XML Input Stream (StAX)

Description

The XML Input Stream (StAX) transform reads data from XML files using the Streaming API for the XML (StAX) parser.

This transform is optimal for quickly processing large and complex data structures.

Unlike the Get Data from XML transform, which uses in-memory processing and can require the purging of parts of files, the XML Input Stream (StAX) transform moves the processing logic into the pipeline.

The transform itself provides the raw XML data stream together with additional processing information.

This streaming transform is recommended when you have limitations with other transforms or need to parse XML when:

  • You need fast data loads which are independent of the memory regardless of the file size.

  • You need flexibility in reading various parts of the XML file in different ways, and do not want to repeatedly parse the file.

Because the processing logic of some XML files can be complex, you should have a good knowledge of the existing Hop transforms when using this transform.

Options

Option Description

Transform name

Name of the transform.

Filename

File name of the input XML file. Specify your file name by entering its path or clicking Browse. If you connect to a transform that precedes the XML Input Stream transform, the Browse button is hidden, and the text box becomes a drop-down menu that is populated with the fields from the preceding transform. Select a value from the drop-down menu to use as the path to an XML file. You can use internal variables to specify the path.

Source is from a previous transform

Accept data from a field in a previous transform.

Source field name

Selects a field from the previous transform to use as XML data.

Add filename to result?

Adds the processed XML filename to the result of this pipeline by passing the filename of the XML input file as a value on each result row. You can then use it in subsequent transforms where you want to use the filename as a value.

Skip (Elements/Attributes)

Number of elements or attributes that should be skipped. Use this field for starting the processing at a specific location in a file. The file will still be loaded by the parser, but the rows will not be produced.

Limit (Elements/Attributes)

Limits the number of elements or attributes to process. With the Skip and Limit properties, you can enable chunk loading that is defined in an outer loop.

Default String Length

The default string length for the XML data name and value fields.

Encoding

Encodes the XML file data in the specified encoding.

Add Namespace information?

Adds the XML data type NAMESPACE to the stream. You can add an optional prefix (defined in the XML data name) and URI information (defined in the XML data value). This option adds a defined prefix in the ELEMENT data type to the XML data name, for example, prefix:product. Due to the extra namespace handling, this option slows down the processing throughput.

Trim strings?

Trims all name/value elements and attributes. It eliminates white spaces, tabs, carriage returns, and line feed characters at the beginning and end of the string.

Include filename in output? / Fieldname

Adds the processed file name to the specified field name.

Row number in output? / Fieldname

Adds the processed row number (starting with 1) to the specified field name.

XML data type (numeric) in output? / Fieldname

Adds the processed data type in numeric format to the specified field name.

The following data types are defined:

  • 0 - "UNKNOWN" (Reserved)

  • 1 - "START_ELEMENT"

  • 2 - "END_ELEMENT"

  • 3 - "PROCESSING_INSTRUCTION" (Reserved)

  • 4 - "CHARACTERS"

  • 5 - "COMMENT" (Reserved)

  • 6 - "SPACE" (Reserved)

  • 7 - "START_DOCUMENT"

  • 8 - "END_DOCUMENT"

  • 9 - "ENTITY_REFERENCE" (Reserved)

  • 10-"ATTRIBUTE"

  • 11-"DTD" (Reserved)

  • 12-"CDATA" (Reserved)

  • 13-"NAMESPACE" (When namespace information is selected)

  • 14-"NOTATION_DECLARATION" (Reserved)

  • 15-"ENTITY_DECLARATION" (Reserved).

XML data type (description) in output? / Fieldname

Adds the processed data type in text format to the specified field name. This option should be used instead of the numeric data type for better readability of the pipeline. See the XML data type (numeric) description above for a list of values.

Because this option can cause slower processing of strings and extra memory consumption, it is recommended to use the numeric data type format for big data loads

XML location line in output? / Fieldname

Adds the processed source XML location line to the specified field name.

XML location column in output? / Fieldname

Adds the processed source XML location column to the specified field name.

XML element ID in output? / Fieldname

Adds the processed element number (starting with '0') to the specified field name. In contrast to adding the Row number, this field number is incremented by the count of each new element and not the row number. This numbering ensures that the nesting between levels is correct.

XML parent element ID in output? / Fieldname

Adds the parent element number to the specified field name. When you use the XML element ID with the XML parent element ID, a complete XML element tree is available for later usage.

XML element level in output? / Fieldname

Adds the processed element level to the specified field name, starting with '0' for the root START_ and END_DOCUMENT.

XML path in output? / Fieldname

Adds the processed XML path to the specified field name.

XML parent path in output? / Fieldname

Adds the processed XML parent path to the specified field name.

XML data name in output? / Fieldname

Adds the processed data name of elements, attributes, and optional namespace prefixes to the specified field name.

XML data value in output? / Fieldname

Adds the processed data value of elements, attributes and optional namespace URIs to the specified field name.