Basics of Late Arriving data
Processing time is when an event is received in Dataflow. Event time can be any custom timestamp in the event e.g. event create time. Event time indicates when the event was triggered from the sensor. Due to network latency, Event time is always before the processing time.
A watermark is a threshold that indicates when Dataflow expects all of the data in a window to have arrived. If new data arrives with a timestamp that’s in the window but older than the watermark, the data is considered late data.
Managing late data in DataFlow
You can allow late data by invoking the .withAllowedLateness
operation when you set your PCollection
‘s windowing strategy. The following code example demonstrates a windowing strategy that will allow late data up to two days after the end of a window.
PCollection<String> items = ...; PCollection<String> fixedWindowedItems = items.apply( Window.<String>into(FixedWindows.of(Duration.standardMinutes(1))).withAllowedLateness(Duration.standardDays(2)));
Need a hands-on Data Architect, AI, ML or GCP Consultant?
Need help with your data journey? Start the conversation today.