Get Data From XML transform Icon Get Data From XML

Description

The Get Data From XML transform provides the ability to read data from any type of XML file using XPath specifications.

Get Data From XML can read data from 3 kind of sources (files, stream and url) in 2 modes (static or dynamic).

See also:

Samples (Samples project):

  • transforms/get-data-from-xml.hpl

Supported Engines

Hop Engine

Supported

Spark

Maybe Supported

Flink

Maybe Supported

Dataflow

Maybe Supported

Options

Files Tab

The files tab is where you define the location of the XML files from which you want to read. The table below contains options associated with the Files tab.

Option Description

Transform name

Name of the transform. This must be unique within the pipeline.

XML Source from field

XML source is defined in a field

The XML data to use is present ina field in the input stream.

XML source is a filename

The XML data is available in and read from a file.

Read source as URL

The XML data is read from the location specified by a URL.

Get XML source from a field

Specify the field to that contains the XML data, filename or URL from.

File or directory

Specifies the location and/or name of the input text file.

Add

Click to add the file/directory/wildcard combination to the list of selected files (grid) below.

Browse

Click to browse to the file’s location.

Regular expression

Specifies a regular expression used to select multiple files in the directory specified in the previous option.

Selected Files

Contains a list of selected files (or wildcard selections) and a property specifying if file is required or not. If a file is required and it is not found, an error is generated;otherwise, the file name is skipped.

Delete

Click to remove the selected file in the table.

Edit

Click to modify the selected file in the table.

Show filename(s)…​

Displays a list of all files that will be loaded based on the current selected file definitions

Content Tab

Option Description

Settings

Loop XPath

For every matching entry in the XML file(s) or data, one row of data is added to the output. This is the main specification used to flatten the XML file(s). You can use the "Get XPath nodes" button to search for the possible repeating nodes in the XML document, however, if the XML document is large, this can take a while.

Encoding

The XML filename encoding, in case none is specified in the XML documents.

Namespace aware?

Enable if the XML document requires namespaces to be considered while parsing.

Ignore comments?

Enable to ignore all comments in the XML document while parsing.

Validate XML?

Enable to validate the XML prior to parsing. Use a token when you want to replace dynamically in a Xpath field value. A token is between @_ and - (@_fieldname-). Please see the Example 1 to see how it works.

Use token

Enable to use a token for validating the XML.

Ignore empty file

Enable to ignore any files with no content. These are not valid XML documents.

Do not raise an error if no files

Enable to do nothing if no files are found. Otherwise, an error is returned.

Limit

Specify a maximum number of rows to return. Zero (0) returns all rows.

Prune path to handle large files

Specifies a path, similar to the Loop XPath, used to process chunks of data from the XML file. Each matching value defines a chunk of data that is read and processed. Use the prune path to speed up processing of large files. You can also use this parameter to avoid multiple HTTP URL requests. You can also do this using the XML Input Stream (StAX) transform.

Additional fields

Include filename in output?

Allows you to specify a field name to include the file name (String) in the output of this transform.

Filename fieldname

The field to read the file name value from.

Rownum in output?

Allows you to specify a field name to include the row number (Integer) in the output of this transform.

Rownum fieldname

The field to read the row number value from.

+2

Add to result filename

Add files to result filename

Fields Tab

Option Description

Name

The name of the output field

XPath

The path to the element node or attribute to read

Element

The element type to read: Node or Attribute

Result Type

Either "Value of" or "Single node"

  • Value of: retrieve the value of your XPath expression, e.g. the contents of an element or the value of an attribute.

  • Single node: retrieve the XML contents returned by an XPath expression. Contrary to "Value of", the result of "Single node" is an XML snippet.

Type

The data type to convert to

Format

The format or conversion mask to use in the data type conversion

Length

The maximum number of characters in the the output data value

Precision

The number of decimal places used to display numbers in the output data

Currency

The currency symbol to use for monetary values.

Decimal

The numeric decimal symbol to use for floating-point numbers.

Group

The numeric grouping symbol to use for separating thousands in the data

Trim type

How whitespace characters are removed from values, either from the left (trims leading spaces), right (trims trailing spaces), both (trims all whitespace), or none (no trimming is done)

Repeat

Repeat the column value of the previous row if the column value is empty (null)

Get fields

Click to populate the table with fields from the input stream.

Select fields from snippet

Click to populate the table with fields corresponding to a Loop Xpath and an XML document that must be provided in the popup dialog.

===Additional output fields tab

Option Description

Short filename field

The field used to store the file name, without the path or file extension.

Extension field

The field used to store the file extension.

Path field

The field used to stare the path to the file.

Size field

The field used to store the file size.

Is hidden field

The field used to specify whether the file is hidden.

Last modification field

The field used to store the date the file was last modified.

Uri field

The field used to store the XML document’s source URL.

Root uri field

The field used to store the XML document’s namespace URL, taken from the root element