RX Split Node

RX Split Node

Overview
Settings
Advanced Settings
Examples
Scripting

Overview

Regular expressions are special text strings which are used to describe particular character patterns. The RX Split node allows regular expressions to split a string into separate components which are each added to new output fields. The generated output fields are all strings. Unlike the RX Groups node which extracts components with different patterns, the RX Groups node defines a delimiter that separates the values of interest.

The node uses the ICU Regular Expressions package. Full details can be found here.

Settings

Match field

This is used to select the string field containing the text that should be split by the Pattern.

Prefix match field to field names

This specifies how the new field names should be generated:

  • when checked (the default setting), the new field names are generated by joining the name of the Match field to the Output suffix value followed by the appropriate split index up to the value defined by Max splits
  • when unchecked, the new field names are just the Output suffix value followed by the appropriate split index up to the value defined by Max splits

Pattern

This defines the regular expression that will be matched against content of the Match field. Common regular expression components can be viewed and added by using the context menu in the Pattern text area.

Regular Expression Options…

These are described in Advanced Settings below.

Output suffix

This defines either the suffix which will be appended to the Match field or the full name of the new field, depending on the setting of Prefix match field to field names. In both cases, an index value will also be appended to the end of the field name.

Max. splits

This defines the number of output fields that will be generated. For example:

Match fieldIPv4Address

Output suffix_SPLIT

Max. splits3

Output fields generated: IPv4Address_SPLIT1IPv4Address_SPLIT3IPv4Address_SPLIT3

If the input string is split into fewer components than there are output fields then the extra fields will contain $null$. If the input string is split into more components than there are output fields then any additional split values will be ignored.

Advanced Settings

These settings control the general behaviour of the regular expression matcher. The default is for all settings to be unchecked. These can generally be left in their default state.

Case insensitive

When checked, regular expression matching will ignore character case.

Multiline

By default, ^ and $ match the start and end of the input text. When checked, ^ and $ will also match the start and end of each line within the input text.

Match ‘.’ as line terminator

When checked, a . in a pattern will match a line terminator in the input text which by default it will not.

Comments in patterns

When checked, white space and #comments are allowed within regular expression patterns.

Use Unicode word boundaries

This controls the behaviour of \b in a pattern. When checked, word boundaries are found according to the definitions of word found in Unicode UAX 29.

Examples

All examples assume other settings are set to default.

Split an IPv4 address into separate components

This splits numeric IPv4 (Internet Protocol) addresses into 4 output fields. Numeric IPv4 addresses have the form n.n.n.n where n is an integer in the range 0-255. The . character that is used to separate the address components is also a special character in regular expressions so has to be escaped with \ i.e. \.. Note that the output fields are all strings.

Match fieldIPv4

Pattern\.

Output suffix_SPLIT

Max. splits4

IPv4IPv4_SPLIT1IPv4_SPLIT2IPv4_SPLIT3IPv4_SPLIT4Notes
127.0.0.1127001Valid IPv4
127.0.012700$null$Last component missing
127.0.0.1.2127001Extra component ignored
127...1127empty stringempty string1Empty strings for missing numbers

Scripting

Settings

Node type nameregexp_split

SettingPropertyTypeComment
Match fieldmatch_fieldField
Prefix match field to field namesprefix_match_fieldBoolean
PatternpatternString
Output suffixoutput_suffixString
Max. splitssplit_countInteger
Case insensitiveopt_case_insensitiveBoolean
Multilineopt_multilineBoolean
Match ‘.’ as line terminatoropt_dotallBoolean
Comments in patternsopt_commentsBoolean
Use Unicode word boundariesopt_uword_boundariesBoolean

Scripting Example

node = modeler.script.stream().createAt("regexp_split", u"RX Split", 512, 192)
node.setPropertyValue("match_field", u"IPv4")
node.setPropertyValue("prefix_match_field", False)
node.setPropertyValue("pattern", u"\.")
node.setPropertyValue("output_suffix", u"IP_")
node.setPropertyValue("split_count", 4)
Scroll to Top