Apache Beam

Description

For now Apache Beam has SDK for Java, Python and Go. https://beam.apache.org/documentation/

Apache Beam is actually new SDK for Google Cloud Dataflow.

Cloud Dataflow is a fully-managed service for transforming and enriching data in stream (real time) and batch (historical) modes with equal reliability and expressiveness -- no more complex workarounds or compromises needed. And with its serverless approach to resource provisioning and management, you have access to virtually limitless capacity to solve your biggest data processing challenges, while paying only for what you use.

Cloud Dataflow unlocks transformational use cases across industries, including:

  • Clickstream, Point-of-Sale, and segmentation analysis in retail

  • Fraud detection in financial services

  • Personalized user experience in gaming

  • IoT analytics in manufacturing, healthcare, and logistics

References

Tutorials & examples

Understanding Streaming

Google Dataflow Pipeline example

Talks

- Serverless data processing with Google Cloud Dataflow (Google Cloud Next '17) http://youtube.com/watch?v=3BrcmUqWNm0 - Apache Beam: Portable and Parallel Data Processing (Google Cloud Next '17) https://www.youtube.com/watch?v=owTuuVt6Oro

Built-in Transforms

How are Java exceptions handled in Dataflow?

Your pipeline may throw exceptions while processing data. Some of these errors are transient (e.g., temporary difficulty accessing an external service), but some are permanent, such as errors caused by corrupt or unparseable input data, or null pointers during computation.

Dataflow processes elements in arbitrary bundles, and will retry the complete bundle when an error is thrown for any element in that bundle. When running in batch mode, bundles including a failing item are retried 4 times. The pipeline will fail completely when a single bundle has failed 4 times. When running in streaming mode, a bundle including a failing item will be retried indefinitely, which may cause your pipeline to permanently stall.

Exceptions in user code (for example, your DoFn instances) are reported in the Dataflow Monitoring Interface. If you run your pipeline with BlockingDataflowPipelineRunner, you'll also see error messages printed in your console or terminal window.

Consider guarding against errors in your code by adding exception handlers. For example, if you'd like to drop elements that fail some custom input validation done in a ParDo, use a try/catch block within your ParDo to handle the exception and drop the element. You may also want to use an Aggregator to keep track of error counts.

Interesting stuff