What is Apache Hop?

Visual Design and Metadata

Apache Hop, short for Hop Orchestration Platform, is a data orchestration and data engineering platform that aims to facilitate all aspects of data and metadata orchestration. Hop lets you focus on the problem you’re trying to solve without technology getting in the way. Simple tasks should be easy, complex tasks need to be possible.

Hop allows data professionals to work visually, using metadata to describe how data should be processed. Visual design enables data developers to focus on what they want to do instead of how that task needs to be done. This focus on the task at hand lets Hop developers be more productive than they would be when writing code.

Architecture

Hop was designed to be as flexible as possible: at the core is the small but powerful Hop engine. All functionality is added through plugins: the default Hop installation comes with about 400 plugins. You can remove or add third-party plugins according to your needs to tailor Hop to be exactly what you need. Hop is designed to work in any scenario, from IoT to huge volumes of data, on-prem, in the cloud, on a bare OS or in containers and kubernetes.

Flexible Runtimes

Hop developers create workflows and pipelines in a visual development environment called Hop Gui. These workflows and pipelines can be executed on a variety of engines: workflows and pipelines can run on the native Hop engine, both locally and remotely. Pipelines can also run on Apache Spark, Apache Flink and Google Dataflow through the Apache Beam runtime configurations.

In workflows and pipelines, hundreds of operations can be applied on the data: read from and write to a variety of source and target platforms, but also combine, enrich, clean and in many other ways manipulate data. Depending on the engine and selected functionality, your data can be processed in batch, streaming or in a batch/streaming hybrid.

Use Cases

A number of common use cases for Hop are:

  • Loading large data sets into databases taking advantage of the cloud, clustered, and massively parallel processing environments.

  • Data warehouse population with built-in support for Slowly Changing Dimensions (SCD), Change Data Capture (CDC), and surrogate key creation.

  • Integrate between diverse data architectures, combining relational databases, files, NoSQL databases like Neo4j, MongoDB, Cassandra etc

  • Data migration between different databases and applications.

  • Data profiling and data cleansing.