String Cleaner Node

Overview

The String Cleaner node provides common string cleaning operations in a way that is simple to use. Operations including removing non-standard characters, trimming leading and/or trailing whitespace and capitalization. These operations can be applied across multiple input fields and a new string output field will be created for each input field.

Settings

The node settings are split into related sub-groups.

Fields

Clean fields

This is used to select which string fields should be cleaned.

Output suffix

New field names are generated by joining the name of each field selected in Clean fields to the output suffix.

Clean fields: HomePhone, MobilePhone

Output suffix: _Cleaned

Output fields generated: HomePhone_Cleaned, MobilePhone_Cleaned

Whitespace

Leading and trailing spaces

This specifies how strings should be trimmed:

None: (the default setting) the string is not trimmed
Left: removes spaces at the start of the string
Right: removes spaces at the end of the string
Both: removes spaces at the start and end of the string

Replace tab with space

When checked, tab characters will be replaced with space characters.

Replace duplicate space or tab with space

When checked, 2 or more adjacent space or tab characters will be replaced with a single space character.

Capitalization

Capitalize

This specifies how character case should be changed in the string:

Leave unchanged: (the default setting) character cases are not modified
ALL UPPER CASE: Any lower case characters are converted to the equivalent upper case characters
all lower case: Any upper case characters are converted to the equivalent lower case characters

Character Categories

Category handling

This specifies how character categories should be handled:

Remove selected categories: (the default setting) character cases are not modified
Keep selected categories and remove others: Any lower case characters are converted to the equivalent upper case characters

Examples

All examples assume other settings are set to default.

Clean phone numbers

This removes any non-digit characters from phone number strings. Clean fields: MobilePhone

Output suffix: _Cleaned

Digits: checked

Category handling: Keep selected categories and remove others

MobilePhone	MobilePhone_Cleaned	Notes
`+44 1234 56789`	`44123456789`	–
`(555) 567890`	`555567890`	–

Scripting

Settings

Node type name: regexp_cleaner

Setting	Property	Type	Comment
Clean fields	`clean_fields`	String List	–
Output suffix	`output_suffix`	String	–
Trim	`trim_mode`	`none`, `left`, `right` or `both`	–
Replace tab with space	`replace_tabs`	Boolean	–
Replace duplicate space or tab with space	`replace_duplicate_blanks`	Boolean	–
Capitalize	`capitalize_mode`	`none`, `upper` or `lower`	–
Upper case English characters	`find_upper_english_chars`	Boolean	–
Lower case English characters	`find_lower_english_chars`	Boolean	–
Digits	`find_digits`	Boolean	–
Punctuation	`find_punctuation`	Boolean	–
Blanks	`find_blanks`	Boolean	–
Spaces	`find_spaces`	Boolean	–
Non-printing characters	`find_non_printing_chars`	Boolean	–
Category handling	`categories_mode`	`remove` or `keep`	–

Scripting Example

node = modeler.script.stream().createAt("regexp_cleaner", u"String Cleaner", 512, 192)
node.setPropertyValue("clean_fields", [u"HomePhone", u"MobilePhone"])
node.setPropertyValue("output_suffix", u"_processed")
node.setPropertyValue("trim_mode", u"both")
node.setPropertyValue("replace_tabs", True)
node.setPropertyValue("replace_duplicate_blanks", True)
node.setPropertyValue("capitalize_mode", u"none")
node.setPropertyValue("find_upper_english_chars", False)
node.setPropertyValue("find_lower_english_chars", False)
node.setPropertyValue("find_digits", True)
node.setPropertyValue("find_punctuation", False)
node.setPropertyValue("find_blanks", False)
node.setPropertyValue("find_spaces", False)
node.setPropertyValue("find_non_printing_chars", False)
node.setPropertyValue("categories_mode", u"keep")

String cleaner node

String Cleaner Node

Overview

Settings

Fields

Clean fields

Output suffix

Whitespace

Leading and trailing spaces

Replace tab with space

Replace duplicate space or tab with space

Capitalization

Capitalize

Character Categories

Categories

Category handling

Examples

Clean phone numbers

Scripting

Settings

Scripting Example

Contact us