RX Groups Node


RX Groups Node

Overview
Settings
Advanced Settings
Examples
Scripting

Overview

Regular expressions are special text strings which are used to describe particular character patterns. The RX Groups node allows regular expressions to match specific items in a string which are in turn added to new output fields. The generated output fields are all strings. Unlike the RX Split node which defines a delimiter that separates the values of interest, the RX Groups node supports extracting components with different patterns.

The node generates new string fields containing the groups that have been extracted from the input field along with a field containing the full match.

The node uses the ICU Regular Expressions package. Full details can be found here.

Settings

Match field

This is used to select the string field containing the text that should be split by the Pattern.

Prefix match field to field names

This specifies how the new field names should be generated:

  • when checked (which is the default), the new field names are generated by joining the name of the Match field to the All match name value for the match field and also to the group field names
  • when unchecked, the new field names are based on the All match name value and the group names defined by Group names

Pattern

This defines the actual regular expression that will be matched against content of the Match field. Common regular expression components can be viewed and added by using the context menu in the Pattern text area.

Regular Expression Options...

These are described in Advanced Settings below.

All match name

This defines either the suffix which will be appended to the Match field or the full name of the new field, depending on the setting of Prefix match field to field names. The resulting output field contains the full pattern that was matched or $null$ if the input field did not match the regular expression in Pattern.

Group names

This defines the name and number of output fields that will be generated for each group defined in Pattern.

Reference groups

This defines how the groups defined by the Pattern are mapped to output fields:

  • By position: (the default setting) this will produce output fields based on the order in which the groups are defined by Pattern i.e. the first column will contain the first group matched, the second column the second group matched etc. The Pattern regular expression may include group names but they will be ignored.
  • By name: this will produce output fields based on the group names defined by Pattern

Advanced Settings

These settings control the general behaviour of the regular expression matcher. The default is for all settings to be unchecked. These can generally be left in their default state.

Case insensitive

When checked, regular expression matching will ignore character case.

Multiline

By default, ^ and $ match the start and end of the input text. When checked, ^ and $ will also match the start and end of each line within the input text.

Match '.' as line terminator

When checked, a . in a pattern will match a line terminator in the input text which by default it will not.

Comments in patterns

When checked, white space and #comments are allowed within regular expression patterns.

Use Unicode word boundaries

This controls the behaviour of \b in a pattern. When checked, word boundaries are found according to the definitions of word found in Unicode UAX 29.

Examples

All examples assume other settings are set to default.

Extract URL Protocol By Position

A URL includes a protocol (such as http, https or ftp) as the first item up to the :. The pattern assumes that the protocol only contains alpha characters although some protocols can include numbers (e.g. Amazon s3://).

Match field: URL

Pattern: ([[:alpha:]]+):

All match name: _MATCH

Group names:

  • Protocol

Reference groups: By position

URL URL_MATCH URLProtocol Notes
https://www.amazon.co.uk/ https: https Valid URL
http://www.icu-project.org/apiref/ http: http Valid URL
www.dmg.org $null$ $null$ Missing protocol, no match

Extract URL Protocol By Group Name

This example is similar to the previous one. However, in this case, the group is given a name of Protocol in the Pattern which is then used to reference the named output field by changing Reference groups to By name. Note that the results are the same as the previous example, and would also be the same even if Reference groups was changed to By position because there is only a single group being matched.

Match field: URL

Pattern: (?<Protocol>[[:alpha:]]+):

All match name: _MATCH

Group names:

  • Protocol

Reference groups: By name

URL URL_MATCH URLProtocol Notes
https://www.amazon.co.uk/ https: https Valid URL
http://www.icu-project.org/apiref/ http: http Valid URL
www.dmg.org $null$ $null$ Missing protocol, no match

Extract URL Protocol And Remainder By Group Name

This is similar to the previous example except this time, all the text after the *protocol*:// component is also matched to a group.

Match field: URL

Pattern: (?<Protocol>[[:alpha:]]+)://(?<Remainder>.*)

All match name: _MATCH

Group names:

  • Protocol
  • Remainder

Reference groups: By name

URL URL_MATCH URLProtocol URLRemainder Notes
https://www.amazon.co.uk/ https://www.amazon.co.uk/ https www.amazon.co.uk/ Valid URL
http://www.icu-project.org/apiref/ http://www.icu-project.org/apiref/ http www.icu-project.org/apiref/ Valid URL
www.dmg.org $null$ $null$ $null$ No match

Scripting

Settings

Node type name: regexp_groups

Setting Property Type Comment
Match field match_field Field -
Prefix match field to field names prefix_match_field Boolean -
Pattern pattern String -
All match name all_match_name String -
Group names group_names String List -
Reference groups grouping position or name -
Case insensitive opt_case_insensitive Boolean -
Multiline opt_multiline Boolean -
Match '.' as line terminator opt_dotall Boolean -
Comments in patterns opt_comments Boolean -
Use Unicode word boundaries opt_uword_boundaries Boolean -

Scripting Example

node = modeler.script.stream().createAt("regexp_groups", u"RX Groups", 512, 192)
node.setPropertyValue("match_field", u"URL")
node.setPropertyValue("prefix_match_field", False)
node.setPropertyValue("pattern", u"(?<Protocol>[[:alpha:]]+)://(?<Remainder>.*)")
node.setPropertyValue("all_match_name", u"Matched")
node.setPropertyValue("group_names", [u"Protocol", u"Remainder"])
node.setPropertyValue("grouping", u"name")