Apache Spark and Hop

Due to conflicting libraries it is currently not possible to support all Beam runtime engines out of the box from the GUI and Hop server. This is a known problem and we are looking for a solution to support all runners out of the box. The good news is that the Apache Flink and Google Dataflow will work, but for spark some changes to the installation are needed. This guide will explain what you need to change to a Hop 2.0 installation to get Spark running.

When applying these changes the Flink and Google Dataflow runner will stop working.

Installation guide

The first step is to download Apache Spark, this can be done from their official download page. Make sure you fetch the version that is listed as supported in our documentation we will be using this package to add jars to the Hop installation. You can also fetch these jars from your Spark cluster if needed.

Cleaning up Hop

For this to work we will both need to remove and add some files to the Beam plugin located under /plugins/engines/beam/lib folder

Files to remove

Following files have to be removed from the /plugins/engines/beam/lib folder

  • flink-shaded-jackson-*

  • jackson-module-scala*

  • scala-java8-compat*

  • scala-library*

  • scala-parser-combinators*

Files to add

Following files have to be added to the /plugins/engines/beam/lib folder, you can find them in <spark installation>/jars folder

  • json4s-ast*

  • json4s-core*

  • json4s-jackson*

  • json4s-scalap*

  • log4j*

  • scala-compiler*

  • scala-library*

  • scala-parser-combinators*

  • scala-reflect*

  • scala-xml*

  • spark-unsafe*

  • xbean-asm7-shaded*

Files to fetch from Maven Central

Due to version conflicts following file has to be fetched from maven central

Rebuild the fat jar

The final step would be to build the fat-jar this can be done from the GUI Tools → Generate a Hop fat jar or you can use the command line ./hop-conf.sh -fj <location>/fat-jar.jar