Month: November 2020

  • Real Time Object Detection Algorithms

    YOLO provides real time object detection. Logistic Regression, Naive Bayes and SVC are not for Computer vision, but for conventional Machine Learning. Need a hands-on Data Architect, AI, ML or GCP Consultant? Need help with your data journey?  Start the conversation today.    

  • Reduce Dimensionality of Data using PCA

    Reducing the number of Dimensions is a common step for pre-processing data. This enables us to overcome the downsides of  dimensionality. Principal Component Analysis – PCA is a technique for achieving this but it works only on numerical data. Need a hands-on Data Architect, AI, ML or GCP Consultant? Need help with your data journey?…

  • Basic Data Ingestion and Real Time Processing in GCP

    Say you need to deploy sensor devices (e.g. Air quality measurement devices) across different cities globally. The data to be collected from these devices should be ingested, processed and analyzed on real time basis. The basic pipeline starts with Pub Sub. Ingest using Pub Sub, Dump into DataFlow for pre processing and then from DataFlow,…

  • BigQuery Data Loading, Data Sources and Data Formats

    Also read, basic data processing pipeline in GCP Federated Data Sources for Bigquery A Federated source (external source) is a source which allows BigQuery to query data directly without importing it in BigQuery. DataStore is not a valid federated source. Cloud Storage, Cloud SQL, BigTable and Google Drive are valid Data sources for federating data…

  • Delayed Sensor Data in DataFlow

    Also read DataFlow Basics Basics of Late Arriving data Processing time is when an event is received in Dataflow. Event time can be any custom timestamp in the event e.g. event create time. Event time indicates when the event was triggered from the sensor. Due to network latency, Event time is always before the processing…

  • BigTable Hotspots

    When most of the writes are happening to the same node in a BigTable cluster then that node becomes a bottleneck. Such a node is called a Hotspot in the cluster. This can happen due to bad row key design.  Key Visualizer helps detect such hotspots. Need a hands-on Data Architect, AI, ML or GCP…

  • Duplicate Sensor Data?

    A set of sensors post events to GCP Cloud pub/sub through MQTT protocol.  If you are observing lots of duplicate messages in the Pub/Sub Topic, chances are that the endpoint ack is not arriving on time. As per GCP docs  – Duplicate messages in the topic could be because of delayed acknowledgements. If acknowledgement is…

  • Hive and RDBMS Sync Issues

    Using Hive, perform analysis on the data stored in HDFS. The data is being regularly retrieved from a RDBMS store. The RDBMS is frequently updated. This causes a  lot of duplicate data in HDFS. How would you overcome this issue? ORC file format provides update functionality on HDFS using Hive transactional tables. Need a hands-on…

  • Linear Regression for Property Price Prediction

    Example – Build a predictive model for property prices in your city Algorithm Choices  – Neural Network, XGBoost and SVR all have poor interpretability.  Logistic Regression is suitable mostly for Classification models. Linear Regression is the best choice – it provides accuracy and interpretability. Need a hands-on Data Architect, AI, ML or GCP Consultant? Need…

  • OLTP and OLAP Bottlenecks

    Say you have an Online Transactional Database (OLTP). Daily transaction data is extracted, transformed and loaded from this OLTP system to Teradata Database (OLAP) during nightly batch. As the customer base grows, nightly ETL loads are taking  longer and longer. Business users are not able to view the latest report the next morning. What solution…