RX Split Node
Overview
Settings
Advanced Settings
Examples
Scripting
Overview
Regular expressions are special text strings which are used to describe particular character patterns. The RX Split node allows regular expressions to split a string into separate components which are each added to new output fields. The generated output fields are all strings. Unlike the RX Groups node which extracts components with different patterns, the RX Groups node defines a delimiter that separates the values of interest.
The node uses the ICU Regular Expressions package. Full details can be found here.
Settings
Match field
This is used to select the string field containing the text that should be split by the Pattern.
Prefix match field to field names
This specifies how the new field names should be generated:
- when checked (the default setting), the new field names are generated by joining the name of the Match field to the Output suffix value followed by the appropriate split index up to the value defined by Max splits
- when unchecked, the new field names are just the Output suffix value followed by the appropriate split index up to the value defined by Max splits
Pattern
This defines the regular expression that will be matched against content of the Match field. Common regular expression components can be viewed and added by using the context menu in the Pattern text area.
Regular Expression Options…
These are described in Advanced Settings below.
Output suffix
This defines either the suffix which will be appended to the Match field or the full name of the new field, depending on the setting of Prefix match field to field names. In both cases, an index value will also be appended to the end of the field name.
Max. splits
This defines the number of output fields that will be generated. For example:
Match field: IPv4Address
Output suffix: _SPLIT
Max. splits: 3
Output fields generated: IPv4Address_SPLIT1
, IPv4Address_SPLIT3
, IPv4Address_SPLIT3
If the input string is split into fewer components than there are output fields then the extra fields will contain $null$
. If the input string is split into more components than there are output fields then any additional split values will be ignored.
Advanced Settings
These settings control the general behaviour of the regular expression matcher. The default is for all settings to be unchecked. These can generally be left in their default state.
Case insensitive
When checked, regular expression matching will ignore character case.
Multiline
By default, ^
and $
match the start and end of the input text. When checked, ^
and $
will also match the start and end of each line within the input text.
Match ‘.’ as line terminator
When checked, a .
in a pattern will match a line terminator in the input text which by default it will not.
Comments in patterns
When checked, white space and #comments are allowed within regular expression patterns.
Use Unicode word boundaries
This controls the behaviour of \b
in a pattern. When checked, word boundaries are found according to the definitions of word found in Unicode UAX 29.
Examples
All examples assume other settings are set to default.
Split an IPv4 address into separate components
This splits numeric IPv4 (Internet Protocol) addresses into 4 output fields. Numeric IPv4 addresses have the form n.n.n.n
where n
is an integer in the range 0-255. The .
character that is used to separate the address components is also a special character in regular expressions so has to be escaped with \
i.e. \.
. Note that the output fields are all strings.
Match field: IPv4
Pattern: \.
Output suffix: _SPLIT
Max. splits: 4
IPv4 | IPv4_SPLIT1 | IPv4_SPLIT2 | IPv4_SPLIT3 | IPv4_SPLIT4 | Notes |
---|---|---|---|---|---|
127.0.0.1 | 127 | 0 | 0 | 1 | Valid IPv4 |
127.0.0 | 127 | 0 | 0 | $null$ | Last component missing |
127.0.0.1.2 | 127 | 0 | 0 | 1 | Extra component ignored |
127...1 | 127 | empty string | empty string | 1 | Empty strings for missing numbers |
Scripting
Settings
Node type name: regexp_split
Setting | Property | Type | Comment |
---|---|---|---|
Match field | match_field | Field | – |
Prefix match field to field names | prefix_match_field | Boolean | – |
Pattern | pattern | String | – |
Output suffix | output_suffix | String | – |
Max. splits | split_count | Integer | – |
Case insensitive | opt_case_insensitive | Boolean | – |
Multiline | opt_multiline | Boolean | – |
Match ‘.’ as line terminator | opt_dotall | Boolean | – |
Comments in patterns | opt_comments | Boolean | – |
Use Unicode word boundaries | opt_uword_boundaries | Boolean | – |
Scripting Example
node = modeler.script.stream().createAt("regexp_split", u"RX Split", 512, 192)
node.setPropertyValue("match_field", u"IPv4")
node.setPropertyValue("prefix_match_field", False)
node.setPropertyValue("pattern", u"\.")
node.setPropertyValue("output_suffix", u"IP_")
node.setPropertyValue("split_count", 4)