Regex Evaluation

Description

The Regex Evaluation transform matches the strings of an input field against a text pattern you define with a regular expression (regex).

This transform uses the java.util.regex package.

The syntax for creating the regular expressions used by this transform is defined in the java.util.regex.Pattern javadoc.

You can use this transform to parse a complex string of text and create new fields out of the input field with capture groups (defined by parentheses).

Supported Engines

Hop Engine

Supported

Spark

Maybe Supported

Flink

Maybe Supported

Dataflow

Maybe Supported

Usage

Pattern matching

The primary usage for this transform is to check if an input field matches the given pattern: a boolean field is added to the stream, indicating whether there is match or not.

The pattern is intended to match the entire input field, not just a part of it. For example, given the input:

"Author, Ann" - 53 posts

a regular expression like \d* posts would give no match, even if a part of the input (53 posts) indeed matches with the pattern. To get an actual match, you need to add .* in the pattern:

.*\d* posts

Capturing text

This transform can also capture parts of the input and store them in new fields of the stream: to do so, just add the usual grouping operator (simple parentheses) in your regular expression.

With the same input text as above, create a regular expression with two capture groups:

^"([^"]*)" - (\d*) posts$

The transform will capture the values Author, Ann and 53, so you can create two new fields in your pipeline (i.e. one for the name, and one for the number of posts).

Options

General

Option	Description
Transform name	Name of the transform.

Option

Description

Transform name

Name of the transform.

Capture Group Fields

Option	Description
New field	Name of the new field generated from the regular expression.
Type	Type of data.
Length	Length of the field.
Precision	Number of floating point digits for number-type fields.
Format	An optional mask for converting the format of the original field. See Common Formats for information on common valid date and numeric formats you can use in this transform.
Group	A grouping can be a "," (10,000.00 for example) or "." (5.000,00 for example)
Decimal	The character used as a decimal point.
Currency	Currency symbol ($ or € for example)
Null If	Treat this value as null.
Default	Default value when the field in the incoming file is not specified (empty).
Trim	The trim method to apply to a string.

Option

Description

New field

Name of the new field generated from the regular expression.

Type

Type of data.

Length

Length of the field.

Precision

Number of floating point digits for number-type fields.

Format

An optional mask for converting the format of the original field. See Common Formats for information on common valid date and numeric formats you can use in this transform.

Group

A grouping can be a "," (10,000.00 for example) or "." (5.000,00 for example)

Decimal

The character used as a decimal point.

Currency

Currency symbol ($ or € for example)

Null If

Treat this value as null.

Default

Default value when the field in the incoming file is not specified (empty).

Trim

The trim method to apply to a string.

Settings Tab

The Settings tab contains the following options:

Option	Description
Field to evaluate	Specify the name of the field from the incoming Hop stream to be matched against the regular expression.
Result field name	Specify the name of the output field. This field is added to the outgoing Hop stream and has a value of Y to indicate the value of the input field matched the regular expression or N to indicate it did not match.
Create fields for capture groups	Select to create new fields based on capture groups, in the regular expression. When this option is selected, substrings in the captured groups are extracted and stored in new output fields, that you specify in the Capture Group Fields table. Each capture group must have a field defined in the Capture Group Fields table. The order of the fields in the table must be the same as the order of the capturing groups in the regular expression. You can change the data type using the columns in the table.
Replace previous fields	Select to replace fields from the incoming Hop stream with fields created for the capture group field names, if the fields have the same name. If this option is clear, new fields are added to the outgoing Hop stream for each capturing group field. This option is available when you select the Create fields for capture groups option.
Regular expression	Specify your regular expression. Click Test regEx to open the Regular expression evaluation window.
Use variable substitution	Select to expand variable references to their values before evaluating the regular expression pattern.

Option

Description

Field to evaluate

Specify the name of the field from the incoming Hop stream to be matched against the regular expression.

Result field name

Specify the name of the output field. This field is added to the outgoing Hop stream and has a value of Y to indicate the value of the input field matched the regular expression or N to indicate it did not match.

Create fields for capture groups

Select to create new fields based on capture groups, in the regular expression. When this option is selected, substrings in the captured groups are extracted and stored in new output fields, that you specify in the Capture Group Fields table. Each capture group must have a field defined in the Capture Group Fields table. The order of the fields in the table must be the same as the order of the capturing groups in the regular expression. You can change the data type using the columns in the table.

Replace previous fields

Select to replace fields from the incoming Hop stream with fields created for the capture group field names, if the fields have the same name. If this option is clear, new fields are added to the outgoing Hop stream for each capturing group field. This option is available when you select the Create fields for capture groups option.

Regular expression

Specify your regular expression. Click Test regEx to open the Regular expression evaluation window.

Use variable substitution

Select to expand variable references to their values before evaluating the regular expression pattern.

Test regEx

You can test your regular expression against three different input strings using the following Regular expression evaluation window. If your expression contains a group field, type a string in the Compare section and the option below the string will be split according to your group(s).

Option	Description
Please enter a new regular expression or modify	Specify your regular expression.
Values to test	Specify the values (Value1, Value2, or Value3) to test your string. The background will turn green if that value is a match against your expression or red if it does not.
Capture from value	Displays the value of the captured string.
Captured fields	Displays the value of the captured groups.

Option

Description

Please enter a new regular expression or modify

Specify your regular expression.

Values to test

Specify the values (Value1, Value2, or Value3) to test your string. The background will turn green if that value is a match against your expression or red if it does not.

Capture from value

Displays the value of the captured string.

Captured fields

Displays the value of the captured groups.

Content Tab

The Content tab contains the following options:

Option Description

Option	Description
Ignore differences in Unicode encodings	Select to ignore different Unicode character encodings. This action may improve performance, but your data can only contain US ASCII characters.
Enables case-insensitive matching	Select to use case-insensitive matching. Only characters in the US-ASCII charset are matched. Unicode-aware case-insensitive matching can be enabled by specifying the 'Unicode-aware case…' flag in conjunction with this flag. The execution flag is `(?i)`.
Permit whitespace and comments in pattern	Select to ignore whitespace and embedded comments starting with `#` through the end of the line. In this mode, you must use the `\s` token to match whitespace. If this option is not enabled, whitespace characters appearing in the regular expression are matched as-is. The execution flag is `(?x)`.
Enable dotall mode	Select to include line terminators with the dot character expression match. The execution flag is `(?s)`.
Enable multiline mode	Select to match the start of a line `^` or the end of a line `$` of the input sequence. By default, these expressions only match at the beginning and the end of the entire input sequence. The execution flag is `(?m)`.
Enable Unicode-aware case folding	Select this option in conjunction with the Enables case-insensitive matching option to perform case-insensitive matching consistent with the Unicode standard. The execution flag is `(?u)`.
Enables Unix lines mode	Select to only recognize the line terminator in the behavior of `.`, `^`, and `$`. The execution flag is `(?d)`.

Ignore differences in Unicode encodings

Select to ignore different Unicode character encodings. This action may improve performance, but your data can only contain US ASCII characters.

Enables case-insensitive matching

Select to use case-insensitive matching. Only characters in the US-ASCII charset are matched. Unicode-aware case-insensitive matching can be enabled by specifying the 'Unicode-aware case…' flag in conjunction with this flag.

The execution flag is (?i).

Permit whitespace and comments in pattern

Select to ignore whitespace and embedded comments starting with # through the end of the line. In this mode, you must use the \s token to match whitespace. If this option is not enabled, whitespace characters appearing in the regular expression are matched as-is.

The execution flag is (?x).

Enable dotall mode

Select to include line terminators with the dot character expression match.

The execution flag is (?s).

Enable multiline mode

Select to match the start of a line ^ or the end of a line $ of the input sequence. By default, these expressions only match at the beginning and the end of the entire input sequence.

The execution flag is (?m).

Enable Unicode-aware case folding

Select this option in conjunction with the Enables case-insensitive matching option to perform case-insensitive matching consistent with the Unicode standard.

The execution flag is (?u).

Enables Unix lines mode

Select to only recognize the line terminator in the behavior of ., ^, and $.

The execution flag is (?d).

Examples

Sub-text matching

As mentioned earlier, the pattern is intended to match the entire input field, i.e. when the supplied input is the pattern.

If you just need to test if your input contains the pattern, you need to tweak your regular expression so that it matches the entire input field. You should also include the grouping operators (parentheses) to get the sub-text you intended to match, for example:

Input data: THIS IS A TITLE <PROCESSING_TAG>
RegEx 1: <.*> → returns no match, because the pattern doesn’t match the entire input
RegEx 2: .(<.>) → returns a match and you can capture the value <PROCESSING_TAG> with the grouping operators

As a consequence, you can consider the line delimiting operators ^ and $ as implied in your regular expression: the examples above are equivalent to ^<.>$ and ^.(<.*>)$ respectively.

Nested capture groups

Suppose your input field contains a text value like "Author, Ann" - 53 posts.

The following regular expression creates four capturing groups and can be used to parse out the different parts:

^"(([^"]+), ([^"])+)" - (\d+) posts\.$

This expression creates the following four capturing groups, which become output fields:

Field name RegEx segment Value

Field name	RegEx segment	Value
Fullname	`[^"]+), ([^"]+`	`Author, Ann`
Lastname	`([^"]+)` (first occurrence)	`Author`
Firstname	`([^"]+)` (second occurrence)	`Ann`
Number of posts	`(\d+)`	`53`

Fullname

[^"]+), ([^"]+

Author, Ann

Lastname

([^"]+) (first occurrence)

Author

Firstname

([^"]+) (second occurrence)

Ann

Number of posts

(\d+)

53

In this example, a field definition must be present for each of these capturing groups.

If the number of capture groups in the regular expression does not match the number of fields specified, the transform will fail and an error is written to the log.