Apache Beam Google DataFlow Pipeline Engine

Beam DataFlow

Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. As a managed Google Cloud service, it provisions worker nodes and out of the box optimization.

The Cloud Dataflow Runner and service are suitable for large scale continuous jobs and provide:

  • A fully managed service

  • Autoscaling of the number of workers throughout the lifetime of the job

  • Dynamic work re-balancing

Check the Google DataFlow docs and Apache Beam DataFlow runner docs for more information.

Beam Dataflow Engine

Options

Option Description

Project ID

The project ID for your Google Cloud Project. This is required if you want to run your pipeline using the Dataflow managed service.

Application name

The name of the Dataflow job being executed as it appears in Dataflow’s jobs list and job details. Also used when updating an existing pipeline.

Staging location

Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning with gs://.

Initial number of workers

The initial number of Google Compute Engine instances to use when executing your pipeline. This option determines how many workers the Dataflow service starts up when your job begins.

Maximum number of workers

The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can be higher than the initial number of workers (specified by numWorkers to allow your job to scale up, automatically or otherwise.

Auto-scaling algorithm

The autoscaling mode for your Dataflow job. Possible values are THROUGHPUT_BASED to enable autoscaling, or NONE to disable. See Autotuning features to learn more about how autoscaling works in the Dataflow managed service.

Worker machine type

The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types.

For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.

Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. Check the list of machine types for reference.

Worker disk type

The type of persistent disk to use, specified by a full URL of the disk type resource.

For example, use compute.googleapis.com/projects//zones//diskTypes/pd-ssd to specify a SSD persistent disk.

more.

Disk size in GB

The disk size, in gigabytes, to use on each remote Compute Engine worker instance. If set, specify at least 30 GB to account for the worker boot image and local logs.

Region

Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for workerRegion is automatically assigned.

Note: This option cannot be combined with workerZone or zone.

(regions list).

Zone

Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs.

Note: This option cannot be combined with workerRegion or zone.

User agent

A user agent string as per RFC2616, describing the pipeline to external services.

Temp location

Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs://.

Plugins to stage (, delimited)

Comma separated list of plugins.

Transform plugin classes

List of transform plugin classes.

XP plugin classes

List of extensions point plugins.

Streaming Hop transforms flush interval (ms)

The amount of time after which the internal buffer is sent completely over the network and emptied.

Hop streaming transforms buffer size

The internal buffer size to use.

Fat jar file location

Fat jar location.

Environment Settings

This environment variable need to be set locally.

GOOGLE_APPLICATION_CREDENTIALS=/path/to/google-key.json