{"id":2438,"date":"2021-01-22T10:23:28","date_gmt":"2021-01-22T09:23:28","guid":{"rendered":"https:\/\/blog.besharp.it\/?p=2438"},"modified":"2021-03-18T16:27:08","modified_gmt":"2021-03-18T15:27:08","slug":"a-clustering-process-with-sagemaker-experiments-a-real-world-use-case","status":"publish","type":"post","link":"https:\/\/blog.besharp.it\/a-clustering-process-with-sagemaker-experiments-a-real-world-use-case\/","title":{"rendered":"A clustering process with SageMaker Experiments: a real-world use case"},"content":{"rendered":"\n
The development of an efficient Machine Learning<\/strong> model is a highly iterative process with continuous feedback loops from previous trials and tests, more akin to a scientific experiment than to a software development project. Data Scientists usually train lots of different models every day trying to get to the most robust model for the scenario they are working on and keeping track of all the tests carried out is often a daunting task even in a single person project.<\/p>\n\n\n\n Amazon offers several tools to help Data Scientists to find the correct set of parameters for their models. Automatic Model Tuning and Amazon SageMaker Autopilot help in exploring quickly and automatically large sections of the phase space, however, these services also contribute to the neverending growth of training jobs parameters and artifacts.<\/p>\n\n\n\n If the project is big enough, multiple engineers are usually involved. Therefore keeping the project as structured as possible, as well as finding ways of sharing all datasets, notebooks, hyperparameters, and results is crucial for success.<\/p>\n\n\n\n The main components of a machine learning project are:<\/p>\n\n\n\n Each team member should always have a clear understanding of which is the latest version of the various components and be able to quickly lookup results and artifacts from previous runs and trials.<\/p>\n\n\n\n To help data scientists with these ML project structuring and management tasks, Amazon released a new service: SageMaker Experiments. This new Amazon Sagemaker component aims to solve this management challenge by giving a unified view for such parameters, training runs, and output artifacts. This new Amazon Sagemaker component aims to solve this management challenge by giving a unified view for such parameters, training runs, and output artifacts.<\/p>\n\n\n\n In this article, we present a real world case where we used SageMaker Experiments extensively.<\/p>\n\n\n\n The project dealt with the clustering of a sparse customer dataset containing several millions of customers in order to understand their behavior. The structure of the dataset and the available features made the clustering algorithm choice, and hyperparameter tuning all but trivial. We tested several types of clustering algorithms (Kmeans, Gaussian mixture, DBSCAN) with different combinations of features. PCA and variable correlation were used to understand the relevant features for clustering.<\/p>\n\n\n\n After several iterations, we found that the most stable result could be found using DBSCAN after dimensionality reduction with UMAP (Uniform Manifold Approximation and Projection). KNN analysis was used to find the optimal radius (eps) for DBSCAN.<\/p>\n\n\n\n UMAP, DBSCAN, and KNN algorithms can be massively accelerated through GPU parallelization. <\/p>\n\n\n\n In order to carry out efficient clustering on our dataset, we decided to use the RapidsAI framework, which includes the CUDA-enabled GPU version of all the algorithms needed for our pipeline. AWS offers several options for GPU enabled Sagemaker ML instances. For our workload we selected an ml.p3.2xlarge for testing and exploration and ml.g4dn.2xlarge for model training.<\/p>\n\n\n\nCustomer Clustering<\/h2>\n\n\n\n
SageMaker training on AWS GPU Instances<\/h2>\n\n\n\n
Installing RapidsAI on ML instance<\/h2>\n\n\n\n