Apache Beam Spark Pipeline Engine

Beam Spark

The Apache Spark Runner can be used to execute Beam pipelines using Apache Spark.

The Spark Runner executes Beam pipelines on top of Apache Spark, providing:

  • Batch and streaming (and combined) pipelines.

  • The same fault-tolerance guarantees as provided by RDDs and DStreams.

  • The same security features Spark provides.

  • Built-in metrics reporting using Spark’s metrics system, which reports Beam Aggregators as well.

  • Native support for Beam side-inputs via spark’s Broadcast variables

Beam Spark Engine

Check the Apache Beam Spark runner docs for more information.

Options

Option Description Default

The Spark master

The url of the Spark Master. This is the equivalent of setting SparkConf#setMaster(String) and can either be local[x] to run local with x cores, spark://host:port to connect to a Spark Standalone cluster, mesos://host:port to connect to a Mesos cluster, or yarn to connect to a yarn cluster.

local[4]

Streaming: batch interval (ms)

The StreamingContext’s batchDuration - setting Spark’s batch interval.

1000

Streaming: checkpoint directory

A checkpoint directory for streaming resilience, ignored in batch. For durability, a reliable filesystem such as HDFS/S3/GS is necessary.

local dir in /tmp

Streaming: checkpoint duration (ms)

Enable Metrics sink

A servlet within the existing Spark UI to serve metrics data as JSON data.

Streaming: maximum records per batch

The maximum records per batch interval.

Streaming: minimum read time (ms)

Mimimum elapsed read time.

Bundle size

The maximum number of elements in a bundle.

User agent

A user agent string as per RFC2616, describing the pipeline to external services.

Temp location

Path for temporary files.

Plugins to stage (, delimited)

Comma separated list of plugins.

Transform plugin classes

List of transform plugin classes.

XP plugin classes

List of extensions point plugins.

Streaming Hop transforms flush interval (ms)

The amount of time after which the internal buffer is sent completely over the network and emptied.

Hop streaming transforms buffer size

The internal buffer size to use.

Fat jar file location

Fat jar location.