- Clustering is one of the most common patterns in Unsupervised machine learning. Some areas / use cases where we can apply clustering include:
- Market segmentation
- Social network analysis
- Search result grouping
- Medical imaging
- Image segmentation
- Anomaly detection
BigQuery ML: Is ideal for clustering use cases (and SQL-based machine learning use cases).
Apart from BQML, GCP offers several tools that are useful for unsupervised learning tasks:
- Vertex AI: For AutoML and custom unsupervised learning models.
- AI Platform Notebooks: For custom unsupervised learning implementations in a notebook environment.
- Dataflow: For data preprocessing and transformation.
- Dataproc: For running Spark-based unsupervised learning tasks.
- Cloud Storage: For storing and managing data used in unsupervised learning.
1. BigQuery ML
- Overview: BigQuery ML allows you to create and execute machine learning models directly in BigQuery using SQL. It supports various unsupervised learning techniques such as clustering.
- Unsupervised Learning Capabilities:
- Clustering: You can use the K-means algorithm for clustering tasks.
- Dimensionality Reduction: While not directly supported, you can preprocess data using SQL to perform tasks like feature engineering.
2. Vertex AI
- Overview: Vertex AI is Google Cloud’s unified machine learning platform that provides end-to-end ML services and tools. It includes support for training and deploying machine learning models.
- Unsupervised Learning Capabilities:
- AutoML: Allows you to build models with minimal code, including models for clustering and anomaly detection.
- Custom Training: You can use Vertex AI’s custom training capabilities to implement and train unsupervised learning algorithms using TensorFlow, PyTorch, or other ML frameworks.
3. Dataflow
- Overview: Dataflow is a fully managed service for stream and batch processing of data. It uses Apache Beam to handle complex data processing tasks.
- Unsupervised Learning Capabilities:
- Data Preparation: Useful for preprocessing and transforming data before applying unsupervised learning models.
- Integration: Dataflow can be integrated with BigQuery and other services to streamline the data pipeline for machine learning tasks.
4. Dataproc
- Overview: Dataproc is a managed Spark and Hadoop service that simplifies running big data processing tasks.
- Unsupervised Learning Capabilities:
- Spark MLlib: You can use Spark MLlib for various unsupervised learning tasks, including clustering and dimensionality reduction.
- Custom Algorithms: Supports custom implementations of unsupervised learning algorithms in Spark.
5. AI Platform Notebooks
- Overview: AI Platform Notebooks provides managed Jupyter notebooks that can be used for developing and running machine learning models.
- Unsupervised Learning Capabilities:
- Flexible Environment: You can install and use libraries for unsupervised learning (e.g., Scikit-learn, TensorFlow) and implement various unsupervised learning algorithms in a notebook environment.
6. Cloud Storage
- Overview: Cloud Storage provides scalable and secure storage for large datasets.
- Unsupervised Learning Capabilities:
- Data Storage: Stores the data needed for unsupervised learning tasks. It integrates with other GCP tools for data processing and analysis.