Don’t feel afraid of ‘fancy’ syntax of Apache Beam Python

Basic Concepts Link to heading

pipelines Link to heading

A pipeline encapsulates the entire series of computations involved in reading input data, transforming that data, and writing output data. The input source and output sink can be the same or of different types, allowing you to convert data from one format to another. Apache Beam programs start by constructing a Pipeline object, and then using that object as the basis for creating the pipeline’s datasets. Each pipeline represents a single, repeatable job.

how to design your pipeline

PCollection Link to heading

A PCollection represents a potentially distributed, multi-element dataset that acts as the pipeline’s data. Apache Beam transforms use PCollection objects as inputs and outputs for each step in your pipeline. A PCollection can hold a dataset of a fixed size or an unbounded dataset from a continuously updating data source.

PCollection has several characteristics:

  1. PCollection must all be the same type
  2. PCollection is immutable
  3. PCollection doesn’t support random access

Transforms Link to heading

dataflow transform

A transform represents a processing operation that transforms data. A transform takes one or more PCollections as input, performs an operation that you specify on each element in that collection, and produces one or more PCollections as output. A transform can perform nearly any kind of processing operation, including performing mathematical computations on data, converting data from one format to another, grouping data together, reading and writing data, filtering data to output only the elements you want, or combining data elements into single values.

in python, Beam SDK has a generic apply method or simply use |.

core Beam Transforms are

  • ParDo

    ~=Map. parallel processing. when you apply a ParDo, you need provide a DoFn Object. DoFn Object contains logic that applies to the elements in the input collection.

    When you use Beam, often the most important pieces of code you’ll write are these DoFns–they’re what define your pipeline’s exact data processing tasks. lightweighted DoFns can be provided in line using lambda functions.

  • GroupByKey

    GroupByKey represents a transform from a multimap (multiple keys to individual values) to a uni-map (unique keys to collections of values).

  • CoGroupByKey

    CoGroupByKey performs a relational join of two or more key/value PCollections that have the same key type

  • Combine

    Once using Combine, one has to provide CombineFn containing the combine logic.

  • Flatten

    Flatten merges multiple PCollection objects into a single logical PCollection.

  • Partition

    Partition is a Beam transform for PCollection objects that store the same data type. Partition splits a single PCollection into a fixed number of smaller collections.

Windowing Link to heading

Windowing subdivides a PCollection according to the timestamps of its individual elements

Beam provides several windowing functions, including:

  • Fixed Time Windows
  • Sliding Time Windows
  • Per-Session Windows
  • Single Global Window
  • Calendar-based Windows (not supported by the Beam SDK for Python)

credit to Link to heading