Clustering Use Cases - Unsupervised Machine Learning on GCP

Clustering is one of the most common patterns in Unsupervised machine learning. Some areas / use cases where we can apply clustering include:

BigQuery ML: Is ideal for clustering use cases (and SQL-based machine learning use cases).

Apart from BQML, GCP offers several tools that are useful for unsupervised learning tasks:

Vertex AI: For AutoML and custom unsupervised learning models.
AI Platform Notebooks: For custom unsupervised learning implementations in a notebook environment.
Dataflow: For data preprocessing and transformation.
Dataproc: For running Spark-based unsupervised learning tasks.
Cloud Storage: For storing and managing data used in unsupervised learning.

Overview: BigQuery ML allows you to create and execute machine learning models directly in BigQuery using SQL. It supports various unsupervised learning techniques such as clustering.
Unsupervised Learning Capabilities:
- Clustering: You can use the K-means algorithm for clustering tasks.
- Dimensionality Reduction: While not directly supported, you can preprocess data using SQL to perform tasks like feature engineering.

Overview: Vertex AI is Google Cloud’s unified machine learning platform that provides end-to-end ML services and tools. It includes support for training and deploying machine learning models.
Unsupervised Learning Capabilities:
- AutoML: Allows you to build models with minimal code, including models for clustering and anomaly detection.
- Custom Training: You can use Vertex AI’s custom training capabilities to implement and train unsupervised learning algorithms using TensorFlow, PyTorch, or other ML frameworks.

Overview: Dataflow is a fully managed service for stream and batch processing of data. It uses Apache Beam to handle complex data processing tasks.
Unsupervised Learning Capabilities:
- Data Preparation: Useful for preprocessing and transforming data before applying unsupervised learning models.
- Integration: Dataflow can be integrated with BigQuery and other services to streamline the data pipeline for machine learning tasks.

Overview: Dataproc is a managed Spark and Hadoop service that simplifies running big data processing tasks.
Unsupervised Learning Capabilities:
- Spark MLlib: You can use Spark MLlib for various unsupervised learning tasks, including clustering and dimensionality reduction.
- Custom Algorithms: Supports custom implementations of unsupervised learning algorithms in Spark.

Overview: AI Platform Notebooks provides managed Jupyter notebooks that can be used for developing and running machine learning models.
Unsupervised Learning Capabilities:
- Flexible Environment: You can install and use libraries for unsupervised learning (e.g., Scikit-learn, TensorFlow) and implement various unsupervised learning algorithms in a notebook environment.

Overview: Cloud Storage provides scalable and secure storage for large datasets.
Unsupervised Learning Capabilities:
- Data Storage: Stores the data needed for unsupervised learning tasks. It integrates with other GCP tools for data processing and analysis.

Clustering Use Cases – Unsupervised Machine Learning on GCP