Google Dataflow Pipeline (Template)
Apache Hop pipelines can be scheduled and triggered in various ways. In this section we will walk through the steps needed to schedule a pipeline on Google Dataflow using Dataflow Templates. Apache Hop uses a flex template to launch a job on Google Dataflow.
Preparing your environment
Before we can add a new pipeline in the Google Cloud Platform console we need to create a Google Storage bucket that contains 3 types of files.
Hop pipelines
The pipelines you created using the Hop Gui and wish to schedule in Google Dataflow.
- Tip
-
You can also create a Hop project using a Google Storage bucket this way you can directly create and edit Hop pipelines in GS
Hop Metadata
For the pipeline to be able to use Hop metadata objects and other run configurations we need to generate a hop metadata.json file. This file can be generated from the GUI under Tools → Export metadata to JSON or using the export-metadata function from the Hop conf tool.
Beam Flex template metadata file:
The final part to get everything working is a metadata file used by Dataflow to stitch all the parts together.
{
"defaultEnvironment": {},
"image": "apache/hop-dataflow-template:latest",
"metadata": {
"description": "This template allows you to start Hop pipelines on dataflow",
"name": "Template to start a hop pipeline",
"parameters": [
{
"helpText": "Google storage location pointing to the Hop metadata file",
"label": "Hop Metadata Location",
"name": "hopMetadataLocation",
"regexes": [
".*"
]
},
{
"helpText": "Google storage location pointing to the pipeline you wish to start",
"label": "Hop Pipeline Location",
"name": "hopPipelineLocation",
"regexes": [
".*"
]
}
]
},
"sdkInfo": {
"language": "JAVA"
}
}
- Important
-
You can change the docker image used in the metadata file
Creating a Dataflow pipeline
Now we can go back to the console and "Create data pipeline"
When selecting the Beam Flex template metadata file you will notice required parameters showing up. You can then add the path yo yhe Hop metadata and Hop pipeline stored in cloud storage.