RX Split Node


RX Split Node

Overview
Settings
Advanced Settings
Examples
Scripting

Overview

Regular expressions are special text strings which are used to describe particular character patterns. The RX Split node allows regular expressions to split a string into separate components which are each added to new output fields. The generated output fields are all strings. Unlike the RX Groups node which extracts components with different patterns, the RX Groups node defines a delimiter that separates the values of interest.

The node uses the ICU Regular Expressions package. Full details can be found here.

Settings

Match field

This is used to select the string field containing the text that should be split by the Pattern.

Prefix match field to field names

This specifies how the new field names should be generated:

  • when checked (the default setting), the new field names are generated by joining the name of the Match field to the Output suffix value followed by the appropriate split index up to the value defined by Max splits
  • when unchecked, the new field names are just the Output suffix value followed by the appropriate split index up to the value defined by Max splits

Pattern

This defines the regular expression that will be matched against content of the Match field. Common regular expression components can be viewed and added by using the context menu in the Pattern text area.

Regular Expression Options...

These are described in Advanced Settings below.

Output suffix

This defines either the suffix which will be appended to the Match field or the full name of the new field, depending on the setting of Prefix match field to field names. In both cases, an index value will also be appended to the end of the field name.

Max. splits

This defines the number of output fields that will be generated. For example:

Match field: IPv4Address

Output suffix: _SPLIT

Max. splits: 3

Output fields generated: IPv4Address_SPLIT1, IPv4Address_SPLIT3, IPv4Address_SPLIT3

If the input string is split into fewer components than there are output fields then the extra fields will contain $null$. If the input string is split into more components than there are output fields then any additional split values will be ignored.

Advanced Settings

These settings control the general behaviour of the regular expression matcher. The default is for all settings to be unchecked. These can generally be left in their default state.

Case insensitive

When checked, regular expression matching will ignore character case.

Multiline

By default, ^ and $ match the start and end of the input text. When checked, ^ and $ will also match the start and end of each line within the input text.

Match '.' as line terminator

When checked, a . in a pattern will match a line terminator in the input text which by default it will not.

Comments in patterns

When checked, white space and #comments are allowed within regular expression patterns.

Use Unicode word boundaries

This controls the behaviour of \b in a pattern. When checked, word boundaries are found according to the definitions of word found in Unicode UAX 29.

Examples

All examples assume other settings are set to default.

Split an IPv4 address into separate components

This splits numeric IPv4 (Internet Protocol) addresses into 4 output fields. Numeric IPv4 addresses have the form n.n.n.n where n is an integer in the range 0-255. The . character that is used to separate the address components is also a special character in regular expressions so has to be escaped with \ i.e. \.. Note that the output fields are all strings.

Match field: IPv4

Pattern: \.

Output suffix: _SPLIT

Max. splits: 4

IPv4 IPv4_SPLIT1 IPv4_SPLIT2 IPv4_SPLIT3 IPv4_SPLIT4 Notes
127.0.0.1 127 0 0 1 Valid IPv4
127.0.0 127 0 0 $null$ Last component missing
127.0.0.1.2 127 0 0 1 Extra component ignored
127...1 127 empty string empty string 1 Empty strings for missing numbers

Scripting

Settings

Node type name: regexp_split

Setting Property Type Comment
Match field match_field Field -
Prefix match field to field names prefix_match_field Boolean -
Pattern pattern String -
Output suffix output_suffix String -
Max. splits split_count Integer -
Case insensitive opt_case_insensitive Boolean -
Multiline opt_multiline Boolean -
Match '.' as line terminator opt_dotall Boolean -
Comments in patterns opt_comments Boolean -
Use Unicode word boundaries opt_uword_boundaries Boolean -

Scripting Example

node = modeler.script.stream().createAt("regexp_split", u"RX Split", 512, 192)
node.setPropertyValue("match_field", u"IPv4")
node.setPropertyValue("prefix_match_field", False)
node.setPropertyValue("pattern", u"\.")
node.setPropertyValue("output_suffix", u"IP_")
node.setPropertyValue("split_count", 4)