Download a recent Hop build.
unzip hop to a local directory
change to the hop directory
Hop is a metadata driven environment where you manage your data processing workflows.
Before anything else, we need to explain Hop’s two main concepts:
Workflow is a (by default) sequential process that has a starting point and one or more endpoints. Between the start and endpoint, a variety of 'actions' can be performed. These actions can range from executing other workflows or pipelines, archiving files that were processed, sending error messages or success notifications and much more.
Pipelines are more granular items of work. A pipeline is where the actual work is done. Pipelines consist of a chain of transforms that read, process or write data. Depending on the execution engine your pipelines run, this can be in batch, streaming or a hybrid mode.
The actions in a workflow and the transforms in a pipeline are connected by 'hops'. Hop are visual links between actions (workflows) and transforms (pipelines).
As you’ll discover soon, the process of creating workflows and pipelines is very similar.
However, there are a number of conceptual differences between workflows and pipelines that you have to keep in mind:
the pipeline engine executes all transforms in a pipeline simultaneously and in parallel. The workflow engine executes all actions in a workflow sequentially by default. When action finishes, the workflow engine checks which action needs to be executed next.
hops in a pipeline pass data between transforms. In a workflow, hops can conditionally determine which action the workflow needs to execute next (on success, on failure, unconditionally)
because of their sequential nature, workflows have 1 action to start from and 1 or more end actions. Pipelines can start with input from multiple transforms simultaneously.
The following tools are at your disposal to work with Hop flows and pipelines:
the Hop Gui is your visual IDE to build, preview, run, test, deploy, … workflows and pipelines.
the Hop Server is a lightweight web server that provides a REST api to run workflows and pipelines remotely.
Hop Run is a command line utility to run workflows and pipelines.
The Hop Gui is your local development environment to build, run, preview and debug (work)flows and pipelines.
Check out this short video to learn how to download, unzip and start the Hop Gui (on Windows).
<!-- [html-validate-disable-next deprecated] -→ video::RMIOTmZK-YE[youtube, width=75%, height=400px]
After starting the Hop Gui, you’ll be presented with a window like the one below.
After clicking the 'New' icon in the upper left corner, you’ll be presented with the window below. Choose either 'New Workflow' or 'New Pipeline'.
Your new pipeline is created, and you’ll see the dialog below.
Let’s walk through the top toolbar:
start the execution of the pipeline
pause the execution of the pipeline
stop the execution of the pipeline
preview the pipeline
debug the pipeline
print the pipeline
undo an operation
redo an operation
align the specified (selected) transforms to the specified grid size
align the selected transforms with left-most selected transform in the selection
align the selected transforms with right-most selected transform in the selection
align the selected transforms with top-most selected transform in the selection
align the selected transforms with bottom-most selected transform in the selection
Distribute the selected transforms evenly between the left-most and right-most transform in your selection
Distribute the selected transforms evenly between the top-most and bottom-most transform in your selection
Pipelines consist of two main work items:
transforms are the basic operations in your pipeline. A pipeline typically consists of a lot of transforms that are chained together by hops. Transforms are granular, in the sense that each transform is designed and optimized to perform one and only one task. Although one transform by itself may not offer spectacular functionality, the combination of all transforms in a pipeline is makes your pipelines powerful.
hops link transforms together. When a transform finishes processing the data set it received, that data set is passed to the next transform through a hop. Hops are uni-directional (data can’t flow backwards). Hops only buffer and pass data around, the hop itself is transform-agnostic, it doesn’t know anything about the transforms it passes data from or to. Some transforms can read from or write to other transforms conditionally to or from a number of other transforms, but this a transform-specific configuration. The hop is unaware of it. Hops can be disabled by clicking on them, or through right-click → disable.
Click anywhere in the pipeline canvas, the area where you’ll see the image below.
Upon clicking, you’ll be presented with the dialog shown below. The search box at the top of this dialog works for transform, name, tags (TODO) etc. Once you’ve found the transform you’re looking for, click on it to add it to your pipeline. An alternative to clicking is arrow key navigation + enter. Repeat this step now or whenever you want to add more transforms to your pipeline. Once you’ve added a transform to your pipeline, you can drag to reposition it.
TODO: link to transform documentation.
Add a 'Generate Rows' and a 'Add Sequence' transform, and your pipeline should like the one below.
There are a number of ways to create a hop:
shift-drag: while holding down the shift key on your keyboard. Click on a transform, while holding down your primary mouse button, drag to the second transform. Release the primary mouse button and the shift key.
scroll-drag: scroll-click on a transform , while holding down your mouse’s scroll button, drag to the second transform. Release the scroll button.
click on a transform in your pipeline to open the 'click anywhere' dialog. Click the 'Create hop' image::getting-started/icons/HOP.svg[Create hop, 25px, align="bottom"] button and select the transform you want to create the hop to.
Click the 'run' button image::getting-started/icons/run.svg[Run, 25px, align="bottom"] in your pipeline toolbar
Let’s walk through the options in this dialog
Pipeline run configurations, edit, new, manage your run configurations. Run configurations are used to specify a name, description and engine to run your pipeline.
Log level: choose the log level for your pipeline. The available options are
Row Level (very detailed)
Clear log before running (enabled by default): logging information from previous runs will be cleared from the logging tab.
parameters: This table will show the parameter name, default value and description. enter your runtime parameters in the 'value' field.
variables: add the variable name and value you want to set in this tab.
always show dialog on run (enabled by default): You’ll be presented with this dialog every time you run this dialog. When disabled, the pipeline will run with the default options.
Click the 'New' button right next to the 'Pipeline run configuration'. Give your run configuration a name and (optionally) a description. Choose the 'local pipeline engine'. As the name implies, the 'local single threaded pipeline engine' runs the pipeline in a single CPU thread. The default 'local pipeline engine' will create a separate CPU thread for each transform in your pipeline to evenly spread the load of your pipeline over your CPU cores.
Click 'Ok' to create your configuration and select it from the dropdown list. For this getting started guide, we’ll leave all other options to the defaults. Click 'Launch'.
Since we haven’t saved our pipeline yet, you’ll be prompted to do so by the dialog below.
Your pipeline will finish in a matter of milliseconds, and the 'Execution Result' view will show up at the bottom of your IDE. This view has 5 tabs:
transform metrics: transformName, read, written, input, output, update, rejected, errors, buffers input, buffers output, speed, status (TODO: elaborate)
logging: the logging output for your pipeline
preview data: a preview of the data for the selected transform. This grid shows the data as it passed through this transform.
performance graph: TODO
While developing your pipeline, you’ll often want to check your data as it enters or exits a transform. Previews are an easy way to take a glance at the state of your data stream as it exits a transform.
To preview the data that is processed by a transform, click on a transform and select 'Preview output'. The same result can be achieved by selecting a transform in your pipeline (rectangle select) and clicking the preview (eye) icon in the pipeline toolbar.
You’ll be presented with the dialog below. You can change the number of rows to preview (1000 by default), but in most cases, you’ll just want to hit the green 'Quick Launch' button.
Once your pipeline finished processing the selected number of rows for the selected transform, a new popup dialog will show your preview results.
|your entire pipeline is executed for a preview, you’re just taking a peek into the processing at the selected transform. If your pipeline modifies data (writes, updates, deletes) further down the stream, those actions *will be performed, even if you’re previewing an earlier transform.|
Let’s take a quick look at the buttons at the bottom of this dialog:
Close: closes the preview dialog. The pipeline will remain paused, and will therefore still be active.
Stop: stop the preview and the pipeline execution.
Get more rows: fetch the next 1000 (or any other selected amount of) rows for preview.
Debugging a pipeline’s transform is very similar to previewing. Instead of pausing the pipeline execution after a given number of rows, the pipeline is paused when a given condition is met. The process to start a debug session is similar to starting the preview: click on a transform and select 'Debug output' from the pop-up dialog, or select a transform and hit the bug-icon in the pipeline toolbar.
You’ll be presented with the dialog below. You’ll recognize this dialog from the 'preview' we just did, but instead, the 'Retrieve first rows (preview)' option is now unchecked, and 'Pause pipeline on condition' is checked.
In the 'Break-point / pause condition' below that option, you can specify on which condition you want to debug. This dialog is the same as the Filter Rows transform.
In our very basic example, we’ve set a breakpoint on 'valuename > 5'.
With the 'valuename > 5' breakpoint, our pipeline is paused as soon as this condition is met (valuename = 6). The rows preceding that moment are also shown, so you can investigate how your data was processed before the breakpoint condition was true.
Similar to the preview options, you can close, stop or continue the debugging ('Get more rows'). When you tell your pipeline to 'Get more rows', the pipeline execution will be resumed until the breakpoint condition is met once more, instead of just fetching the next 1000 (default) rows.
The design and execution of workflows is very similar to that of pipelines. However, keep in mind that there are significant differences between how Hop handles workflows and pipelines under the hood.
To create a workflow, hit the 'new' icon or 'CTRL-N'. From the pop-up dialog, select 'New workflow'.
Add the following actions to your workflow and create the hops to connect them:
Double-click or single-click and choose 'Edit action' to configure the action for the pipeline you just created.
In the pipeline dialog, use the 'Browse' button to select your pipeline and give the action an appropriate name, for example 'First Pipeline'.
Notice how the hops in your workflow are a little different from what you’ve seen in pipeline hops.
Add a fourth action 'Abort' and create a hop from your pipeline action.
You now have the three types of hops that are available in workflows:
unconditional (lock icon, black hop): 'unconditional' hops are followed no matter what the exit code (true/false) of the previous action is
success (green hop, check mark): 'success' hops are used when the previous action executed successfully.
failure (red hop, error mark): 'failure' or 'error' hops are followed when the previous action failed.
|The hop type can be changed by clicking on the hop’s icon.|
With these three hop types and the actions at your disposal, you’re ready to create powerful data orchestration workflows.
As with designing workflows, the steps to run a workflow are very similar to running a pipeline.
Click the 'run' button in your workflow toolbar
In the workflow run dialog, hit the 'New' button in the upper right corner to create a new 'Workflow run configuration'.
In the dialog that pops up, add 'Local Workflow' as the workflow configuration name and choose the 'Local workflow engine'.
Click 'OK' to return to the workflow run dialog, make sure your configuration is selected and hit 'Launch'.
This workflow with our very basic pipeline should execute in less than one second. You’ll now have the execution results pane which again looks very similar to the pipeline execution results.
The first tab in your workflow execution is 'Logging'. This tab shows the logging information for your entire workflow. Any errors that occurred in your workflow will be highlighted in red.
The second tab are your workflow metrics. This tab is less verbose, but gives you an action-by-action overview of the execution of your workflow. The black, green and red color codings indicate information, success and failure. In larger worfklows, the metrics tab will give you a quick overview of what happened in your workflow, what the required time per action was, etc.
You’ll use the logging tab to find more detailed information about what happened in your workflow or in a particular action.
After you’ve designed and tested your pipeline or transform locally, you may want to run it on a headless machine.
The Hop Server is a light weight web server that you can use to run your workflows and pipelines remotely.
First, we’ll have to start the server. Head over to your Hop directory, and locate the 'hop-server' scripts (sh for Mac/Linux, bat for Windows).
Running the script without any arguments will print its usage:
Usage: hop-server <Interface address> <Port> [-h] [-p <arg>] [-s] [-u <arg>] or Usage: hop-server <Configuration File> Starts or stops the hopServer server. -h,--help This help text -p,--password <arg> The administrator password. Required only if stopping the Hop Server server. -s,--stop Stop the running hopServer server. This is only allowed when using the hostname/port form of the command. -u,--userName <arg> The administrator user name. Required only if stopping the Hop Server server. Example: hop-server.sh 127.0.0.1 8080 Example: hop-server.sh 192.168.1.221 8081 Example: hop-server.sh /foo/bar/hop-server-config.xml Example: hop-server.sh http://www.example.com/hop-server-config.xml Example: hop-server.sh 127.0.0.1 8080 -s -u cluster -p cluster
As an example, let’s run our server on our local machine on port 8085:
./hop-server.sh localhost 8085
hop-server.bat localhost 8085
The startup process shouldn’t take more than 1 or 2 seconds, and should show 2 lines of logging information:
2020/04/30 16:22:55 - HopServer - Installing timer to purge stale objects after 1440 minutes. 2020/04/30 16:22:55 - HopServer - Created listener for webserver @ address : localhost:8085
In your favorite browser, go to http://localhost:8085 and sign in with the default user 'cluster' and password 'cluster'.
Click the 'show status' link below to get to page shown in the second screenshot.
We now have verified our server is up and running. Let’s return to Hop Gui to configure a run configuration for it. Click the 'New' icon or 'CTRL-N' and select 'Slave Server'.
In the slave server dialog, enter the details for the local server we just created.
With our slave server in place, all that’s left to do is to create a run configuration for this server. Head back to your pipeline (again, the process is similar for workflows), and hit 'run'. Before running your pipeline, create a new 'Pipeline run configuration'.
Name this configuration 'Remote Pipeline', select 'Remote pipeline engine' as the engine type, select the 'local' run configuration we created earlier, and select 'localhost' for the slave server we just created.
Select this run configuration and run your pipeline. Your execution results will be almost identical to the locale execution you did earlier, however, the logs will show you executed the pipeline remotely:
2020/04/30 17:01:33 - first_pipeline - Executing this pipeline using the Remote Pipeline Engine with run configuration 'Remote Pipeline' ... ... ... 2020/04/30 17:01:34 - first_pipeline - Execution finished on a remote pipeline engine with run configuration 'Remote Pipeline'
The execution results for this pipeline will now be available in our server’s status page as well:
Select the pipeline or workflow line that you want to investigate, and choose one of the options from the options in the upper left corner of the pipeline or workflow overview table. Click the eye icon to open the details for that specfific execution:
Hop Run is the last tool we’ll discuss in this getting started overview. In many cases, you’ll want to run your workflows and pipelines on a headless server, but don’t necessarily want to run through rest services or from Hop Gui.
Hop Run is a command line that can be used to run workflows or pipelines e.g. over ssh of from a cron job.
The command to run is 'hop-run' (sh on Mac/Linux, bat on Windows). Without any arguments, hop-run shows its usage syntax:
A filename is needed to run a workflow or pipeline Usage: <main class> [-hotw] [-e=<environment>] [-f=<filename>] [-l=<level>] [-r=<runConfigurationName>] [-p=<parameters>[, <parameters>...]]... [-s=<systemProperties>[, <systemProperties>...]]... -e, --environment=<environment> The name of the environment to use -f, --file=<filename> The filename of the workflow or pipeline to run -h, --help Displays this help message and quits. -l, --level=<level> The debug level, one of NONE, MINIMAL, BASIC, DETAILED, DEBUG, ROWLEVEL -o, --printoptions Print the used options -p, --parameters=<parameters>[,<parameters>...] A comma separated list of PARAMETER=VALUE pairs -r, --runconfig=<runConfigurationName> The name of the Run Configuration to use -s, --system-properties=<systemProperties>[,<systemProperties>...] A comma separated list of KEY=VALUE pairs -t, --pipeline Force execution of a pipeline -w, --workflow Force execution of a workflow
Since we’ve been working with a very basic pipeline, running it from hop-run is as easy as specifying: * the pipeline filename to run * the run configuration to use
./hop-run.sh -f /tmp/first_pipeline.hpl -r local
You’ll get output that will be very similar to the one below:
2020/04/30 17:16:48 - first_pipeline - Executing this pipeline using the Local Pipeline Engine with run configuration 'local' 2020/04/30 17:16:48 - first_pipeline - Execution started for pipeline [first_pipeline] 2020/04/30 17:16:48 - Generate rows.0 - Finished processing (I=0, O=0, R=0, W=10, U=0, E=0) 2020/04/30 17:16:48 - Add sequence.0 - Finished processing (I=0, O=0, R=10, W=10, U=0, E=0) 2020/04/30 17:16:48 - first_pipeline - Pipeline duration : 0.079 seconds [ 0.079 ] 2020/04/30 17:16:48 - first_pipeline - Execution finished on a local pipeline engine with run configuration 'local' ./hop-run.sh -f /tmp/first_pipeline.hpl -r local 5.62s user 0.34s system 258% cpu 2.309 total
We’ll be adding more documentation as we go, so keep an eye on the Apache Hop (Incubating) documentation section.
A good place to start exploring is the detailed documentation for:
|Apache Hop (Incubating) considers high-quality documentation a very important part of the project. Help us to improve by creating tickets for any documentation errors, suggestions or feature requests in our JIRA system.|