Over the last number of years, GPUs have often made headlines for their capabilities when used for cryptocurrency mining. While that industry has witnessed some spectacular rises and falls, the use of GPUs for other technological advances has been growing steadily on a much more stable track.

The use of GPUs for complex and computationally expensive ML (Machine Learning) and AI (Artificial Intelligence) workflows offers incredible benefits in terms of performance. However, due to the starkly different approach for processing tasks, building and utilizing models using GPU-driven infrastructure can be quite challenging.

NVIDIA RAPIDS aims to provide a GPU-compatible suite of ML tools, that can be utilized to create ML and AI workflows that are similar to those that are created with typical and widely used ML toolsets, such as Pandas and Scikit-Learn. TMA Solutions’ Data Science Centre recently set about investigating this suite of tools and assessing their usability and performance.

 

Project Goals

The team’s first goal was to investigate the available tools in the RAPIDS framework and identify which parts of the typical machine learning (ML) workflow could be replaced. The cuDF dataframe library provides a suitable replacement for Pandas, and the cuML library suite provides a host of machine-learning algorithms similar to those provided by Scikit-learn. The cuDF library is based on the Apache Arrow columnar data format and provides a pandas-like API that facilitates a straightforward transition from CPU-based computing to GPU-based. Similarly, the cuML library provides an API that mirrors Scikit-learn.

 

Data Analysis and Workflow Conversion

To gain familiarity with the RAPIDS framework, we decided to implement an existing ML workflow using cuDF and cuML. The workflow consisted of an exploratory data analysis of a data set related to housing prices, followed by the development of a predictive model for house prices, including feature engineering and model evaluation.

The initial stages of the conversion were very straightforward. After installing RAPIDS, the cuDF package can be imported into our notebooks and data can be read and processed with minimal changes to the original Pandas code.

Figure 1. Loading data with Pandas

Figure 2. Loading data with cuDF

Figures 1 and 2 above show the similarity of the code for Pandas and cuDF.

Visualizing data is also straightforward when using the RAPIDS framework. Matplotlib and Seaborn are compatible with cuDF when cuDF’s to_cupy function is used to convert the cuDF dataframe to a cuPy array. An example of this can be seen below.

Figure 3. Plotting with cuDF and cuPy

Another example of the conversion from Pandas to cuDF is the identification of columns that contain missing values and the ordering of these columns according to which contain the highest proportion of missing values. The original Pandas code can be seen in the image below.

Figure 4. Analyzing missing data with Pandas

In figure 5 we can see the same analysis of missing values using cuDF. Only the extraction of the data types for each column needed to be adjusted from the original Pandas code.

Figure 5. Analyzing missing data with cuDF

 

Feature Selection and Engineering

The majority of feature selection and engineering operations also proved straightforward to convert from the original workflow.

Splitting the dataset into training and testing subsets follows the same process as when using Pandas, as can be seen below.

Figure 6. Splitting train and test data

Certain operations require some extra steps. For example, when performing MinMax scaling using cuML’s MinMaxScaler, the output data is in array format and as such needed to be converted back to a data frame for further processing.

Figure 7. MinMax scaling with cuML

We encoded categorical variables to capture the monotonic relationship between some variables and the target variable (house price). In this case the conversion from Pandas to cuDF required some minor changes, as can be seen highlighted in red in figures 8 and 9 below.

Figure 8. Encoding categorical variables with Pandas8

Note: the encoding of the test set was removed in these examples for conciseness.

Figure 9. Encoding categorical variables with cuDF

The cuML library offers an extensive set of tools that can be used in the feature engineering process. We assessed the effects of using a Yeo-Johnson transformation by plotting the distribution of the continuous variables in the dataset before and after applying the transformation.

The code used to apply the transformation can be seen in the example below.

Figure 10. Yeo-Johnson transformation with cuML

 

Model Training and Evaluation

We trained a regularized linear regression model (Lasso) to predict house prices. Training the model was straightforward and the implementation was the same as when using Scikit-learn.

Figure 11. Model training with cuML

Evaluating the model required some updates to the original code. Data needed to be converted from cuDF to cuPy array format before calculating MSE and R^2 values. This can be seen in figure 12.

Figure 12. Model evaluation

Overall, the training and evaluation of the model proved to be straightforward and efficient for this dataset.

When evaluating our model as shown in figure 12, we calculated an RMSE of 28123, and an R2 value of 0.855. We plotted our predicted values for house price against the true values as can be seen in figure 13 below.

Figure 13. Predicted vs. true house price

We also plotted the distribution of the errors and found them to be roughly normally distributed, as can be seen in figure 14. We found cuPy to be fully compatible with both Matplotlib and Seaborn, and faced no significant challenges compared to using these plotting tools with the standard Numpy and Pandas libraries.

Figure 14. Distribution of Errors

 

Conclusion

RAPIDS provided a useful and efficient framework for data analysis and machine learning.

cuDF provides a similar API to Pandas, so users who are already familiar with Pandas can easily switch to cuDF. As seen in this report, existing Pandas workflows can also be easily converted to cuDF. cuDF also works more efficiently than Pandas when working with large datasets.

TMA Solutions’ Data Science Centre believes that RAPIDS can provide a useful and highly performant suite of tools that we will be able to use to advance our clients’ AI and ML capabilities. We will continue to investigate the use of RAPIDS for more complex ML workflows and pipelines and look forward to helping our partners grow and develop their businesses further using advanced AI and ML toolsets. Our next blog post in this series will feature benchmarking experiments to compare the performance of cuML and Scikit-learn for K-means Clustering, Random Forest classification, and VLAD feature encoding.

This blog post is the first in a three-part series, that is being developed in partnership with NVIDIA.