File Metadata

Description

The File Metadata transform scans a file to determine its metadata structure or layout.

Use this transforms in situations where you need to read a structured text file (e.g. CSV, TSV) when you don’t know the exact layout in advance.

The information provided in this file can be used to actually read the file later, e.g. through metadata injection.

The layout detected depends on the number of rows scanned.

For example, if the first 100 rows of a file are scanned and the first field is detected as an integer, there is a possibility this field contains alphanumerical characters in later rows. Using 0 rows for 'limit scanned rows' is a way to make sure the entire file is scanned and the layout is detected correctly, even though this may be time consuming or even impossible for large files.

Supported Engines

Hop Engine

Supported

Spark

Maybe Supported

Flink

Maybe Supported

Dataflow

Maybe Supported

Options

Option	Description
Transform name	The name for this transform
Filename	The filename to scan for metadata
Filename in a field?	Enable to specify the file name(s) in a field in the input stream
Filename field	When the previous option is enabled, you can specify the field that will contain the filename(s) at runtime.
Limit scanned rows	The number of rows to limit the scan to (default 10,000). Use 0 rows to scan the entire file.
Fallback charset	Charset to use while scanning the file
Delimiter candidates	List of delimiters to try while detecting the file layout. Tab, semicolon, comma are provided by default.
Enclosure candidates	List of delimiters to try while detecting the file layout. Double and single quote are provided by default.

Option

Description

Transform name

The name for this transform

Filename

The filename to scan for metadata

Filename in a field?

Enable to specify the file name(s) in a field in the input stream

Filename field

When the previous option is enabled, you can specify the field that will contain the filename(s) at runtime.

Limit scanned rows

The number of rows to limit the scan to (default 10,000). Use 0 rows to scan the entire file.

Fallback charset

Charset to use while scanning the file

Delimiter candidates

List of delimiters to try while detecting the file layout. Tab, semicolon, comma are provided by default.

Enclosure candidates

List of delimiters to try while detecting the file layout. Double and single quote are provided by default.

Output Fields

The fields returned by this transform are

charset
delimiter
field_count
skip_header_lines
skip_footer_lines
header_line_present
name
type
length
precision
mask
decimal_symbol
grouping_symbol