Text File Input

Description

The Text File Input transform reads data from a variety of text-file types, including formats generated by spreadsheets and fixed width flat files.

The features of the transform allow you to read from a list of files or directories, use wild cards in the form of regular expressions, and accept generalized filenames from previous transforms.

Supported Engines

Hop Engine

Supported

Spark

Supported

Flink

Supported

Dataflow

Supported

Options

General

Option	Description
Transform Name	Name of the transform this name has to be unique in a single pipeline.

Option

Description

Transform Name

Name of the transform this name has to be unique in a single pipeline.

File Tab

The table below provides a detailed descriptions of the features available on the File tab:

Option	Description
File or directory	This field specifies the location and/or name of the input text file.
Regular expression	Specify the regular expression you want to use to select the files in the directory specified in the previous option. For example, you want to process all files that have a .txt extension. (See below "Selecting file using Regular Expressions")
Exclude Regular expression	Specify a regular expression to exclude filenames within a specified directory.
Selected Files	This table contains a list of selected files (or wildcard selections) along with a property specifying if file is required or not. If a file is required and it isn’t found, an error is generated. Otherwise, the filename is skipped.
Show filenames(s)…	Displays a list of all files that will be loaded based on the current selected file definitions.
Show file content	Displays the content of the selected file.
Show content from first data line	Displays the content from the first data line only for the selected file.

Option

Description

File or directory

This field specifies the location and/or name of the input text file.

Regular expression

Specify the regular expression you want to use to select the files in the directory specified in the previous option. For example, you want to process all files that have a .txt extension. (See below "Selecting file using Regular Expressions")

Exclude Regular expression

Specify a regular expression to exclude filenames within a specified directory.

Selected Files

This table contains a list of selected files (or wildcard selections) along with a property specifying if file is required or not. If a file is required and it isn’t found, an error is generated. Otherwise, the filename is skipped.

Show filenames(s)…

Displays a list of all files that will be loaded based on the current selected file definitions.

Show file content

Displays the content of the selected file.

Show content from first data line

Displays the content from the first data line only for the selected file.

Selecting files using Regular Expressions

The Text File Input transform can search for files by wildcard in the form of a regular expression. Regular expressions are more sophisticated than using '*' and '?' wildcards. Below are a few examples of regular expressions:

Filename	Regular Expression	Files selected
/dirA/	.userdata.\.txt	Find all files in /dirA/ with names containing userdata and ending with .txt
/dirB/	AAA.*	Find all files in /dirB/ with names that start with AAA
/dirC/	[ENG:A-Z][ENG:0-9].*	Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)

Filename

Regular Expression

Files selected

/dirA/

.userdata.\.txt

Find all files in /dirA/ with names containing userdata and ending with .txt

/dirB/

AAA.*

Find all files in /dirB/ with names that start with AAA

/dirC/

[ENG:A-Z][ENG:0-9].*

Find all files in /dirC/ with names that start with a capital and followed by a digit (A0-Z9)

Accepting filenames from a previous transform

This option allows even more flexibility in combination with other transforms such as "Get Filenames". You can construct your filename and pass it to this transform. This way the filename can come from any source: text file, database table, etc.

Option	Description
Accept filenames from previous transforms	Enables the option to get filenames from previous transforms.
Pass through fields from previous transform	Enable this option to add all previous fields coming into the transform to the transform output. This behaves like a join option.
Transform to read filenames from	Transform from which to read the filenames
Field in the input to use as filename	Text File Input looks in this transform to determine which filenames to use

Option

Description

Accept filenames from previous transforms

Enables the option to get filenames from previous transforms.

Pass through fields from previous transform

Enable this option to add all previous fields coming into the transform to the transform output. This behaves like a join option.

Transform to read filenames from

Transform from which to read the filenames

Field in the input to use as filename

Text File Input looks in this transform to determine which filenames to use

Content Tab

The content tab allows you to specify the format of the text files that are being read. Below is a list of the options associated with this tab:

Option	Description
File type	Can be either CSV or Fixed length. Based on this selection, Hop will launch a different helper GUI when you press the "get fields" button in the last "fields" tab.
Separator	One or more characters that separate the fields in a single line of text. Typically this is ; or a tab. Special characters (e.g. CHAR ASCII HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].
Enclosure	Some fields can be enclosed by a pair of strings to allow separator characters in fields. The enclosure string is optional. If you use repeat an enclosures allow text line 'Not the nine o''clock news.'. With ' the enclosure string, this gets parsed as Not the nine o’clock news. Special characters (e.g. CHAR ASCII HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].
Escape	Specify an escape character (or characters) if you have these types of characters in your data. If you have \ as an escape character, the text 'Not the nine o'clock news' (with ' the enclosure) gets parsed as Not the nine o’clock news. Special characters (e.g. CHAR HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].
Header & number of header lines	Enable if your text file has a header row (first lines in the file); you can specify the number of times the header lines appears. If we mistakenly leave the Header flag set on files that do not have any columns' names in its first row, Hop will set the column’s name the value found on a specific column for its specific position. In case, for that specific position, the column’s value is empty, Hop will set column’s name to EmptyField_<n> where n is the position of the column in the columns' set. NOTE: remember also to perform a check on the guessed data types and column’s specifier that was set after the file’s analysis because they could be wrong due to wrong assumptions made by Hop while looking at the sample dataset.
Wrapped lines and number of wraps	Use if you deal with data lines that have wrapped beyond a specific page limit; note that headers and footers are never considered wrapped
Paged layout and page size and doc header	Use these options as a last resort when dealing with texts meant for printing on a line printer; use the number of document header lines to skip introductory texts and the number of lines per page to position the data lines
Compression	Enable if your text file is placed in a Zip or GZip archive.Note: At the moment, only the first file in the archive is read.
No empty rows	Do not send empty rows to the next transforms.
Include filename in output	Enable if you want the filename to be part of the output
Filename field name	Name of the field that contains the filename
Rownum in output?	Enable if you want the row number to be part of the output
Row number field name	Name of the field that contains the row number
Rownum by file?	Allows the row number to be reset per file
Format	Can be either DOS, UNIX or mixed. UNIX files have lines that are terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done.
Encoding	Specify the text file encoding to use; leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Hop searches your system for available encodings.
Limit	Sets the number of lines that is read from the file; 0 means read all lines.
Be lenient when parsing dates?	Disable if you want strict parsing of data fields; if case-lenient parsing is enabled, dates like Jan 32nd will become Feb 1st.
The date format Locale	This locale is used to parse dates that have been written in full such as "February 2nd, 2006;" parsing this date on a system running in the French (fr_FR) locale would not work because February is called Février in that locale.
Add filenames to result	Adds the filenames to the internal filename result set. This internal result set can be used later on, e.g. to process all read files.

Option

Description

File type

Can be either CSV or Fixed length. Based on this selection, Hop will launch a different helper GUI when you press the "get fields" button in the last "fields" tab.

Separator

One or more characters that separate the fields in a single line of text. Typically this is ; or a tab. Special characters (e.g. CHAR ASCII HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].

Enclosure

Some fields can be enclosed by a pair of strings to allow separator characters in fields. The enclosure string is optional. If you use repeat an enclosures allow text line 'Not the nine o''clock news.'. With ' the enclosure string, this gets parsed as Not the nine o’clock news. Special characters (e.g. CHAR ASCII HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].

Escape

Specify an escape character (or characters) if you have these types of characters in your data. If you have \ as an escape character, the text 'Not the nine o'clock news' (with ' the enclosure) gets parsed as Not the nine o’clock news. Special characters (e.g. CHAR HEX01) can be set with the format $[value], e.g. $[01] or $[6F,FF,00,1F].

Header & number of header lines

Enable if your text file has a header row (first lines in the file); you can specify the number of times the header lines appears. If we mistakenly leave the Header flag set on files that do not have any columns' names in its first row, Hop will set the column’s name the value found on a specific column for its specific position. In case, for that specific position, the column’s value is empty, Hop will set column’s name to EmptyField_<n> where n is the position of the column in the columns' set. NOTE: remember also to perform a check on the guessed data types and column’s specifier that was set after the file’s analysis because they could be wrong due to wrong assumptions made by Hop while looking at the sample dataset.

Wrapped lines and number of wraps

Use if you deal with data lines that have wrapped beyond a specific page limit; note that headers and footers are never considered wrapped

Paged layout and page size and doc header

Use these options as a last resort when dealing with texts meant for printing on a line printer; use the number of document header lines to skip introductory texts and the number of lines per page to position the data lines

Compression

Enable if your text file is placed in a Zip or GZip archive.Note: At the moment, only the first file in the archive is read.

No empty rows

Do not send empty rows to the next transforms.

Include filename in output

Enable if you want the filename to be part of the output

Filename field name

Name of the field that contains the filename

Rownum in output?

Enable if you want the row number to be part of the output

Row number field name

Name of the field that contains the row number

Rownum by file?

Allows the row number to be reset per file

Format

Can be either DOS, UNIX or mixed. UNIX files have lines that are terminated by line feeds. DOS files have lines separated by carriage returns and line feeds. If you specify mixed, no verification is done.

Encoding

Specify the text file encoding to use; leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Hop searches your system for available encodings.

Limit

Sets the number of lines that is read from the file; 0 means read all lines.

Be lenient when parsing dates?

Disable if you want strict parsing of data fields; if case-lenient parsing is enabled, dates like Jan 32nd will become Feb 1st.

The date format Locale

This locale is used to parse dates that have been written in full such as "February 2nd, 2006;" parsing this date on a system running in the French (fr_FR) locale would not work because February is called Février in that locale.

Add filenames to result

Adds the filenames to the internal filename result set. This internal result set can be used later on, e.g. to process all read files.

Error Handling Tab

The error handling tab allows you to specify how the transform reacts when errors occur. The table below describes the options available for Error handling:

Option	Description
Ignore errors?	Enable if you want to ignore errors during parsing
Skip error lines	Enable if you want to skip those lines that contain errors. You can generate an extra file that contains the line numbers on which the errors occurred. Lines with errors are not skipped, the fields that have parsing errors, will be empty (null)
Error count field name	Add a field to the output stream rows; this field contains the number of errors on the line
Error fields field name	Add a field to the output stream rows; this field contains the field names on which an error occurred
Error text field name	Add a field to the output stream rows; this field contains the descriptions of the parsing errors that have occurred
Warnings file directory	When warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension>
Error files directory	When errors occur related to non-existing or non-accessible files, they are placed in this directory. The name of the file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>
Failing line numbers files directory	When a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension>

Option

Description

Ignore errors?

Enable if you want to ignore errors during parsing

Skip error lines

Enable if you want to skip those lines that contain errors. You can generate an extra file that contains the line numbers on which the errors occurred. Lines with errors are not skipped, the fields that have parsing errors, will be empty (null)

Error count field name

Add a field to the output stream rows; this field contains the number of errors on the line

Error fields field name

Add a field to the output stream rows; this field contains the field names on which an error occurred

Error text field name

Add a field to the output stream rows; this field contains the descriptions of the parsing errors that have occurred

Warnings file directory

When warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension>

Error files directory

When errors occur related to non-existing or non-accessible files, they are placed in this directory. The name of the file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>

Failing line numbers files directory

When a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension>

Filters Tab

The filters tab provides you with the ability to specify the lines you want to skip in the text file. The table below describes the available options for defining filters:

Option	Description
Filter string	The string for which to search
Filter position	The position where the filter string has to be at in the line. Zero (0) is the first position in the line. If you specify a value below zero (0) here, the filter string is searched for in the entire string.
Stop on filter	Specify Y here if you want to stop processing the current text file when the filter string is encountered.
Positive match	Specify Y here if you want to process lines that match the filter, or N if you want to ignore such lines.

Option

Description

Filter string

The string for which to search

Filter position

The position where the filter string has to be at in the line. Zero (0) is the first position in the line. If you specify a value below zero (0) here, the filter string is searched for in the entire string.

Stop on filter

Specify Y here if you want to stop processing the current text file when the filter string is encountered.

Positive match

Specify Y here if you want to process lines that match the filter, or N if you want to ignore such lines.

Fields Tab

The fields tab allows you to specify the information about the name and format of the fields being read from the text file. Available options include:

Option	Description
Name	Name of the field
Type	Type of the field can be either String, Date or Number
Format	See Number Formats for a complete description of format symbols.
Position	This is needed when processing the 'Fixed' filetype. It is zero based, so the first character is starting with position 0.
Length	For Number: Total number of significant figures in a number; For String: total length of string; For Date: length of printed output of the string (e.g. 4 only gives back the year).
Precision	For Number: Number of floating point digits; For String, Date, Boolean: unused;
Currency	Used to interpret numbers like $10,000.00 or E5.000,00
Decimal	A decimal point can be a "." (10;000.00) or "," (5.000,00)
Grouping	A grouping can be a dot "," (10;000.00) or "." (5.000,00)
Null if	Treat this value as NULL
Default	Default value in case the field in the text file was not specified (empty)
Trim	type trim this field (left, right, both) before processing
Repeat	If the corresponding value in this row is empty, repeat the one from the last row when it was not empty.

Option

Description

Name

Name of the field

Type

Type of the field can be either String, Date or Number

Format

See Number Formats for a complete description of format symbols.

Position

This is needed when processing the 'Fixed' filetype. It is zero based, so the first character is starting with position 0.

Length

For Number: Total number of significant figures in a number; For String: total length of string; For Date: length of printed output of the string (e.g. 4 only gives back the year).

Precision

For Number: Number of floating point digits; For String, Date, Boolean: unused;

Currency

Used to interpret numbers like $10,000.00 or E5.000,00

Decimal

A decimal point can be a "." (10;000.00) or "," (5.000,00)

Grouping

A grouping can be a dot "," (10;000.00) or "." (5.000,00)

Null if

Treat this value as NULL

Default

Default value in case the field in the text file was not specified (empty)

Trim

type trim this field (left, right, both) before processing

Repeat

If the corresponding value in this row is empty, repeat the one from the last row when it was not empty.

Number Formats

The information below on Number formats was taken from the Sun Java API documentation, located at http://java.sun.com/j2se/1.4.2/docs/api/java/text/DecimalFormat.html. For further information on valid numeric formats used in this transform, view the Number Formatting Table.

Symbol	Location	Localized	Meaning
0	Number	Yes	Digit
#	Number	Yes	Digit, zero shows as absent
.	Number	Yes	Decimal separator or monetary decimal separator
-	Number	Yes	Minus sign
,	Number	Yes	Grouping separator
E	Number	Yes	Separates mantissa and exponent in scientific notation; need not be quoted in prefix or suffix
;	Sub pattern boundary	Yes	Separates positive and negative sub patterns
%	Prefix or suffix	Yes	Multiply by 100 and show as percentage
\u2030	Prefix or suffix	Yes	Multiply by 1000 and show as per mille
€ (\u00A4)	Prefix or suffix	No	Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
'	Prefix or suffix	No	Used to quote special characters in a prefix or suffix, for example, "''" formats 123 to "123". To create a single quote itself, use two in a row: " o''clock".

Symbol

Location

Localized

Meaning

Number

Yes

Digit

Number

Yes

Digit, zero shows as absent

Number

Yes

Decimal separator or monetary decimal separator

Number

Yes

Minus sign

Number

Yes

Grouping separator

Number

Yes

Separates mantissa and exponent in scientific notation; need not be quoted in prefix or suffix

;

Sub pattern boundary

Yes

Separates positive and negative sub patterns

Prefix or suffix

Yes

Multiply by 100 and show as percentage

\u2030

Prefix or suffix

Yes

Multiply by 1000 and show as per mille

€ (\u00A4)

Prefix or suffix

Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.

Prefix or suffix

Used to quote special characters in a prefix or suffix, for example, "''" formats 123 to "123". To create a single quote itself, use two in a row: " o''clock".

Scientific Notation

In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation (for example, "0.###E0" formats the number 1234 as "1.234E3".

Date formats

The information on Date formats was taken from the Sun Java API documentation, located at:

http://java.sun.com/j2se/1.4.2/docs/api/java/text/SimpleDateFormat.html. For further information on valid date formats used in this transform, view the Date Formatting Table.

Letter	Date or Time Component	Presentation	Examples
M	Month in year	Month	July; Jul; 07
w	Week in year	Number	27
W	Week in month	Number	2
D	Day in year	Number	189
d	Day in month	Number	10
F	Day of week in month	Number	2
E	Day in week	Text	Tuesday; Tue
a	Am/pm marker	Text	PM
H	Hour in day (0-23)	Number 0
k	Hour in day (1-24)	Number 24
K	Hour in am/pm (0-11)	Number 0
h	Hour in am/pm (1-12)	Number 12
m	Minute in hour	Number 30
s	Second in minute	Number 55
S	Millisecond	Number 978
z	Time zone	General time zone	Pacific Standard Time; PST; GMT-08:00
Z	Time zone	RFC 822 time zone	-0800

Letter

Date or Time Component

Presentation

Examples

Month in year

Month

July; Jul; 07

Week in year

Number

Week in month

Number

Day in year

Number

189

Day in month

Number

Day of week in month

Number

Day in week

Text

Tuesday; Tue

Am/pm marker

Text

Hour in day (0-23)

Number 0

Hour in day (1-24)

Number 24

Hour in am/pm (0-11)

Number 0

Hour in am/pm (1-12)

Number 12

Minute in hour

Number 30

Second in minute

Number 55

Millisecond

Number 978

Time zone

General time zone

Pacific Standard Time; PST; GMT-08:00

Time zone

RFC 822 time zone

-0800

Additional Output Fields Tab

Option	Description
Short filename field	The field name that contains the filename without path information but with an extension.
Extension field	The field name that contains the extension of the filename.
Path field	The field name that contains the path in operating system format.
Size field	The field name that contains the size of the field.
Is hidden field	The field name that contains if the file is hidden or not (boolean).
Uri field	The field name that contains the URI.
Root uri field	The field name that contains only the root part of the URI.

Option

Description

Short filename field

The field name that contains the filename without path information but with an extension.

Extension field

The field name that contains the extension of the filename.

Path field

The field name that contains the path in operating system format.

Size field

The field name that contains the size of the field.

Is hidden field

The field name that contains if the file is hidden or not (boolean).

Uri field

The field name that contains the URI.

Root uri field

The field name that contains only the root part of the URI.

Buttons

Function/Button	Description
Show filenames	Displays a list of all the files selected. Note that if the pipeline is to be run on a separate server, the result might be incorrect.
Show file content	Displays the first lines of the text-file. Make sure that the file-format is correct. When in doubt, try both DOS and UNIX formats.
Show content from first data line	Helps you position the data lines in complex text files with multiple header lines and more.
Get fields	Allows you to guess the layout of the file. In case of a CSV file, this is performed almost automatically. When you select a file with fixed length fields, you must specify the field boundaries using a wizard.
Preview rows	Preview the rows generated by this transform.

Function/Button

Description

Show filenames

Displays a list of all the files selected. Note that if the pipeline is to be run on a separate server, the result might be incorrect.

Show file content

Displays the first lines of the text-file. Make sure that the file-format is correct. When in doubt, try both DOS and UNIX formats.

Show content from first data line

Helps you position the data lines in complex text files with multiple header lines and more.

Get fields

Allows you to guess the layout of the file. In case of a CSV file, this is performed almost automatically. When you select a file with fixed length fields, you must specify the field boundaries using a wizard.

Preview rows

Preview the rows generated by this transform.

Metadata Injection Support

All fields of this transform support metadata injection. You can use this transform with ETL Metadata Injection to pass metadata to your pipeline at runtime.