Microsoft Excel input transform Icon Microsoft Excel input

Description

The Microsoft Excel Input transform reads data from Microsoft Excel spreadsheet.

The default spreadsheet type (engine) is set to Excel XLSX, XLS.

When you read other file types like OpenOffice ODS and using special functions like protected worksheets, you need to change the Spread sheet type (engine) in the Content tab accordingly.

Supported Engines

Hop Engine

Supported

Spark

Supported

Flink

Supported

Dataflow

Supported

Options

Files Tab

Option Description

Transform Name

Name of the transform; the name has to be unique in a single transform.

Spread sheet type (engine)

This field allows you to specify the spreadsheet type. Currently the following are supported:

  • Excel XLSX, XLS: This is the default, if you select this spread sheet type you can read all known Excel file types. Functionality provided by the Apache POI project.

  • Excel XLSX (Streaming): This spread sheet type allows to read in large Excel files.

  • Open Office ODS: By selecting this type you can read OpenOffice spreadsheet using the ODFDOM engine.

File or directory

Specifies the location and/or name of the input text file. Note: Click Add to add the file/directory/wildcard combination to the list of selected files (grid) below.

Regular expression

Specify the regular expression you want to use to select the files in the directory specified in the previous option. For example, you want to process all files that have a .txt extension. (For further details see the Text File Input transform and the section "Selecting file using Regular Expressions")

Exclude Regular Expression

Excludes all files (in a given location) that meet the criteria specified by this regular expression.

Selected Files

Contains a list of selected files (or wildcard selections) and a property specifying if file is required or not. If a file is required and it is not found, an error is generated;otherwise, the file name is skipped.

Accept filenames from previous transforms

Allows you to read in file names from a previous transform in the transform. You must also specify which transform you are importing from, and the input field in that transform from which you will retrieve the filename data

Show filenames(s)…​

Displays a list of all files that will be loaded based on the current selected file definitions

Preview rows

Click Preview to examine the contents of the specified Excel file

Sheets

In this tab you can specify the names of the sheets in the Excel workbook to read. For each of the sheet names you can specify the row and column to start at.

Content

Option

Description

Header

Enable if the sheets specified contain a header row to skip

No empty rows

Enable if you don’t want empty rows in the output of this transform

Stop on empty row

Makes the transform stop reading the current sheet of a file when a empty line is encountered

Limit

Limits the number of rows to this number (zero (0) means all rows).

Encoding

Specifies the text file encoding to use. Leave blank to use the default encoding on your system. To use Unicode, specify UTF-8 or UTF-16. On first use, Spoon searches your system for available encodings.)

Error handling

Option

Description

Strict types?

If checked, Hop will report data type errors in the input.

Ignore errors?

Enable if you want to ignore errors during parsing

Skip error lines?

If checked, Hop will skip lines that contain errors. These lines can be dumped to a separate file by specifying a path in the Failing line numbers files directory field below. If this is not checked, lines with errors will appear as NULL values in the output.

Warnings file directory

When warnings are generated, they are placed in this directory. The name of that file is <warning dir>/filename.<date_time>.<warning extension>

Error files directory

When errors occur, they are placed in this directory. The name of that file is <errorfile_dir>/filename.<date_time>.<errorfile_extension>

Failing line numbers files directory

When a parsing error occurs on a line, the line number is placed in this directory. The name of that file is <errorline dir>/filename.<date_time>.<errorline extension>

Fields tab

The fields tab is for specifying the fields that must be read from the Excel files. You can choose to use a Schema Definition or to define the required fields' layout manually.

Use Get fields from header row to fill in the available fields if the sheets have a header row automatically.

The Type column performs type conversions for a given field. For example, if you want to read a date and you have a String value in the Excel file, specify the conversion mask. Note: In the case of Number to Date conversion (for example, 20051028-→ October 28th, 2005) specify the conversion mask yyyyMMdd because there will be an implicit Number to String conversion taking place before doing the String to Date conversion.

Option

Description

Schema Definition

Name of the Schema Definition that we want to reference.

Name

The name of the field.

Type

The field’s data type; String, Date or Number.

Length

The length option depends on the field type. Number: total number of significant figures in a number; String: total length of a string; Date: determines how much of the date string is printed or recorded.

Precision

The precision option depends on the field type, but only Number is supported; it returns the number of floating point digits.

Trim type

Truncates the field (left, right, both) before processing. Useful for fields that have no static length.

Repeat

If set to Y, will repeat this value if the field in the next row is empty.

Format

The format mask (number type). See Text File Input transform and the section "Number Formats" for a complete description of format symbols.

Currency

Symbol used to represent currencies.

Decimal

A decimal point; this is either a dot or a comma.

Grouping

A method of separating units of thousands in numbers of four digits or larger. This is either a dot or a comma.

Additional output fields tab

This tab retrieves custom metadata fields to add to the transform’s output. The purpose of each field is defined in its name, but you can use these fields for whatever you want. Each item defines an output field that will contain the following information. Some of these are missing.

Option

Description

Full filename field

The full file name plus the extension.

Sheetname field

The worksheet name you’re using.

Sheet row nr field

The current sheet row number.

Row nr written field

Number of rows written

Short filename field

The field name that contains the filename without path information but with an extension.

Extension field

The field name that contains the extension of the filename.

Path field

The field name that contains the path in operating system format.

Size field

The field name that contains the size of the file, in bytes.

Is hidden field

The field name that contains if the file is hidden or not (boolean).

Uri field

The field name that contains the URI.

Root uri field

The field name that contains only the root part of the URI.