Clustering Use Cases – Unsupervised Machine Learning on GCP

  • Clustering  is one of the most common patterns in Unsupervised machine learning. Some areas / use cases where we can apply clustering include:
  • Market segmentation
  • Social network analysis
  • Search result grouping
  • Medical imaging
  • Image segmentation
  • Anomaly detection

BigQuery ML: Is ideal for clustering use cases (and SQL-based machine learning use cases).

Apart from BQML,  GCP offers several tools that are useful for unsupervised learning tasks:

  • Vertex AI: For AutoML and custom unsupervised learning models.
  • AI Platform Notebooks: For custom unsupervised learning implementations in a notebook environment.
  • Dataflow: For data preprocessing and transformation.
  • Dataproc: For running Spark-based unsupervised learning tasks.
  • Cloud Storage: For storing and managing data used in unsupervised learning.

1. BigQuery ML

  • Overview: BigQuery ML allows you to create and execute machine learning models directly in BigQuery using SQL. It supports various unsupervised learning techniques such as clustering.
  • Unsupervised Learning Capabilities:
    • Clustering: You can use the K-means algorithm for clustering tasks.
    • Dimensionality Reduction: While not directly supported, you can preprocess data using SQL to perform tasks like feature engineering.

2. Vertex AI

  • Overview: Vertex AI is Google Cloud’s unified machine learning platform that provides end-to-end ML services and tools. It includes support for training and deploying machine learning models.
  • Unsupervised Learning Capabilities:
    • AutoML: Allows you to build models with minimal code, including models for clustering and anomaly detection.
    • Custom Training: You can use Vertex AI’s custom training capabilities to implement and train unsupervised learning algorithms using TensorFlow, PyTorch, or other ML frameworks.

3. Dataflow

  • Overview: Dataflow is a fully managed service for stream and batch processing of data. It uses Apache Beam to handle complex data processing tasks.
  • Unsupervised Learning Capabilities:
    • Data Preparation: Useful for preprocessing and transforming data before applying unsupervised learning models.
    • Integration: Dataflow can be integrated with BigQuery and other services to streamline the data pipeline for machine learning tasks.

4. Dataproc

  • Overview: Dataproc is a managed Spark and Hadoop service that simplifies running big data processing tasks.
  • Unsupervised Learning Capabilities:
    • Spark MLlib: You can use Spark MLlib for various unsupervised learning tasks, including clustering and dimensionality reduction.
    • Custom Algorithms: Supports custom implementations of unsupervised learning algorithms in Spark.

5. AI Platform Notebooks

  • Overview: AI Platform Notebooks provides managed Jupyter notebooks that can be used for developing and running machine learning models.
  • Unsupervised Learning Capabilities:
    • Flexible Environment: You can install and use libraries for unsupervised learning (e.g., Scikit-learn, TensorFlow) and implement various unsupervised learning algorithms in a notebook environment.

6. Cloud Storage

  • Overview: Cloud Storage provides scalable and secure storage for large datasets.
  • Unsupervised Learning Capabilities:
    • Data Storage: Stores the data needed for unsupervised learning tasks. It integrates with other GCP tools for data processing and analysis.