Apache Tika

Description

The Apache Tika transform parses files in all sorts of formats and extracts the text content as well as available metadata it can extract. This transform uses the Apache Tika libraries to do the parsing.

The extracted metadata is given in JSON format. If you need specific pieces of information from this metadata, you can extract those with a JSON Input transform.

Supported Engines

Hop Engine

Supported

Single Threaded

Supported

Native Spark

Supported

Beam Spark

Maybe Supported

Beam Flink

Maybe Supported

Beam Dataflow

Maybe Supported

Options

Option	Description
Transform name	Name of the transform. Note: This name has to be unique in a single pipeline.
File tab	Here you can specify which files will be read and examined.
Content tab	This tab has various content settings. For example, you can specify the file encoding, output format and so on.
Output fields tab	On this tab you can simply type in the names of the fields you want in the output.

Option

Description

Transform name

Name of the transform. Note: This name has to be unique in a single pipeline.

File tab

Here you can specify which files will be read and examined.

Content tab

This tab has various content settings. For example, you can specify the file encoding, output format and so on.

Output fields tab

On this tab you can simply type in the names of the fields you want in the output.