RX groups node

RX Groups Node

Overview
Settings
Advanced Settings
Examples
Scripting

Overview

Regular expressions are special text strings which are used to describe particular character patterns. The RX Groups node allows regular expressions to match specific items in a string which are in turn added to new output fields. The generated output fields are all strings. Unlike the RX Split node which defines a delimiter that separates the values of interest, the RX Groups node supports extracting components with different patterns.

The node generates new string fields containing the groups that have been extracted from the input field along with a field containing the full match.

The node uses the ICU Regular Expressions package. Full details can be found here.

Settings

Match field

This is used to select the string field containing the text that should be split by the Pattern.

Prefix match field to field names

This specifies how the new field names should be generated:

  • when checked (which is the default), the new field names are generated by joining the name of the Match field to the All match name value for the match field and also to the group field names
  • when unchecked, the new field names are based on the All match name value and the group names defined by Group names

Pattern

This defines the actual regular expression that will be matched against content of the Match field. Common regular expression components can be viewed and added by using the context menu in the Pattern text area.

Regular Expression Options…

These are described in Advanced Settings below.

All match name

This defines either the suffix which will be appended to the Match field or the full name of the new field, depending on the setting of Prefix match field to field names. The resulting output field contains the full pattern that was matched or $null$ if the input field did not match the regular expression in Pattern.

Group names

This defines the name and number of output fields that will be generated for each group defined in Pattern.

Reference groups

This defines how the groups defined by the Pattern are mapped to output fields:

  • By position: (the default setting) this will produce output fields based on the order in which the groups are defined by Pattern i.e. the first column will contain the first group matched, the second column the second group matched etc. The Pattern regular expression may include group names but they will be ignored.
  • By name: this will produce output fields based on the group names defined by Pattern

Advanced Settings

These settings control the general behaviour of the regular expression matcher. The default is for all settings to be unchecked. These can generally be left in their default state.

Case insensitive

When checked, regular expression matching will ignore character case.

Multiline

By default, ^ and $ match the start and end of the input text. When checked, ^ and $ will also match the start and end of each line within the input text.

Match ‘.’ as line terminator

When checked, a . in a pattern will match a line terminator in the input text which by default it will not.

Comments in patterns

When checked, white space and #comments are allowed within regular expression patterns.

Use Unicode word boundaries

This controls the behaviour of \b in a pattern. When checked, word boundaries are found according to the definitions of word found in Unicode UAX 29.

Examples

All examples assume other settings are set to default.

Extract URL Protocol By Position

A URL includes a protocol (such as httphttps or ftp) as the first item up to the :. The pattern assumes that the protocol only contains alpha characters although some protocols can include numbers (e.g. Amazon s3://).

Match fieldURL

Pattern([[:alpha:]]+):

All match name_MATCH

Group names:

  • Protocol

Reference groupsBy position

URLURL_MATCHURLProtocolNotes
https://www.amazon.co.uk/https:httpsValid URL
http://www.icu-project.org/apiref/http:httpValid URL
www.dmg.org$null$$null$Missing protocol, no match

Extract URL Protocol By Group Name

This example is similar to the previous one. However, in this case, the group is given a name of Protocol in the Pattern which is then used to reference the named output field by changing Reference groups to By name. Note that the results are the same as the previous example, and would also be the same even if Reference groups was changed to By position because there is only a single group being matched.

Match fieldURL

Pattern(?<Protocol>[[:alpha:]]+):

All match name_MATCH

Group names:

  • Protocol

Reference groupsBy name

URLURL_MATCHURLProtocolNotes
https://www.amazon.co.uk/https:httpsValid URL
http://www.icu-project.org/apiref/http:httpValid URL
www.dmg.org$null$$null$Missing protocol, no match

Extract URL Protocol And Remainder By Group Name

This is similar to the previous example except this time, all the text after the *protocol*:// component is also matched to a group.

Match fieldURL

Pattern(?<Protocol>[[:alpha:]]+)://(?<Remainder>.*)

All match name_MATCH

Group names:

  • Protocol
  • Remainder

Reference groupsBy name

URLURL_MATCHURLProtocolURLRemainderNotes
https://www.amazon.co.uk/https://www.amazon.co.uk/httpswww.amazon.co.uk/Valid URL
http://www.icu-project.org/apiref/http://www.icu-project.org/apiref/httpwww.icu-project.org/apiref/Valid URL
www.dmg.org$null$$null$$null$No match

Scripting

Settings

Node type nameregexp_groups

SettingPropertyTypeComment
Match fieldmatch_fieldField
Prefix match field to field namesprefix_match_fieldBoolean
PatternpatternString
All match nameall_match_nameString
Group namesgroup_namesString List
Reference groupsgroupingposition or name
Case insensitiveopt_case_insensitiveBoolean
Multilineopt_multilineBoolean
Match ‘.’ as line terminatoropt_dotallBoolean
Comments in patternsopt_commentsBoolean
Use Unicode word boundariesopt_uword_boundariesBoolean

Scripting Example

node = modeler.script.stream().createAt("regexp_groups", u"RX Groups", 512, 192)
node.setPropertyValue("match_field", u"URL")
node.setPropertyValue("prefix_match_field", False)
node.setPropertyValue("pattern", u"(?<Protocol>[[:alpha:]]+)://(?<Remainder>.*)")
node.setPropertyValue("all_match_name", u"Matched")
node.setPropertyValue("group_names", [u"Protocol", u"Remainder"])
node.setPropertyValue("grouping", u"name")
Download your free copy of our Understanding Significance Testing white paper
Subscribe to our email newsletter today to receive updates on the latest news, tutorials and events, and get your free copy of our latest white paper.
We respect your privacy. Your information is safe and will never be shared.
Don't miss out. Subscribe today.
×
×
WordPress Popup Plugin
Scroll to Top