Apache Beam Direct Pipeline Engine

Beam Direct

The Direct runner can be used for local testing and development.

The Direct Runner executes pipelines on your machine and is designed to validate that pipelines adhere to the Apache Beam model as closely as possible. Instead of focusing on efficient pipeline execution, the Direct Runner performs additional checks to ensure that users do not rely on semantics that are not guaranteed by the model.

  • Enforcing immutability of elements

  • Enforcing encodability of elements

  • Elements are processed in an arbitrary order at all points

  • Serialization of user functions (DoFn, CombineFn, etc.)

Using the Direct Runner for testing and development helps ensure that pipelines are robust across different Beam runners. In addition, debugging failed runs can be a non-trivial task when a pipeline executes on a remote cluster. Instead, it is often faster and simpler to perform local unit testing on your pipeline code. Unit testing your pipeline locally also allows you to use your preferred local debugging tools.

Beam Direct Engine

Options

Option Description

Number of workers

The Number of threads or subprocesses is used to configure parallelism. From 2.22.0, direct_num_workers = 0 is supported. When direct_num_workers is set to 0, it will set the number of threads/subprocess to the number of cores of the machine where the pipeline is running.

User agent

A user agent string as per RFC2616, describing the pipeline to external services.

Temp location

Path for temporary files.

Plugins to stage (, delimited)

Comma separated list of plugins.

Transform plugin classes

List of transform plugin classes.

XP plugin classes

List of extensions point plugins.

Streaming Hop transforms flush interval (ms)

The amount of time after which the internal buffer is sent completely over the network and emptied.

Hop streaming transforms buffer size

The internal buffer size to use.

Fat jar file location

Fat jar location.