String cleaner node


String Cleaner Node

Overview
Settings
Examples
Scripting

Overview

The String Cleaner node provides common string cleaning operations in a way that is simple to use. Operations including removing non-standard characters, trimming leading and/or trailing whitespace and capitalization. These operations can be applied across multiple input fields and a new string output field will be created for each input field.

Settings

The node settings are split into related sub-groups.

Fields

Clean fields

This is used to select which string fields should be cleaned.

Output suffix

New field names are generated by joining the name of each field selected in Clean fields to the output suffix.

Clean fields: HomePhone, MobilePhone

Output suffix: _Cleaned

Output fields generated: HomePhone_Cleaned, MobilePhone_Cleaned

Whitespace

Leading and trailing spaces

This specifies how strings should be trimmed:

  • None: (the default setting) the string is not trimmed
  • Left: removes spaces at the start of the string
  • Right: removes spaces at the end of the string
  • Both: removes spaces at the start and end of the string

Replace tab with space

When checked, tab characters will be replaced with space characters.

Replace duplicate space or tab with space

When checked, 2 or more adjacent space or tab characters will be replaced with a single space character.

Capitalization

Capitalize

This specifies how character case should be changed in the string:

  • Leave unchanged: (the default setting) character cases are not modified
  • ALL UPPER CASE: Any lower case characters are converted to the equivalent upper case characters
  • all lower case: Any upper case characters are converted to the equivalent lower case characters

Character Categories

Categories

This section lists various character categories which can be checked or unchecked.

  • Upper case English characters: characters representing the letters A to Z
  • Lower case English characters: characters representing the letters a to z
  • Digits: characters representing the numbers 0 to 9
  • Punctuation: punctuation characters are !'#$%&'()*+,-./:;<=>[email protected][/]^_{|}~
  • Blanks: space or tab characters
  • Spaces: space, tab, new line, vertical tab, form feed or carriage return characters
  • Non-printing characters: other characters that are not normally visible but can sometimes be included in strings

Category handling

This specifies how character categories should be handled:

  • Remove selected categories: (the default setting) character cases are not modified
  • Keep selected categories and remove others: Any lower case characters are converted to the equivalent upper case characters

Examples

All examples assume other settings are set to default.

Clean phone numbers

This removes any non-digit characters from phone number strings. Clean fields: MobilePhone

Output suffix: _Cleaned

Digits: checked

Category handling: Keep selected categories and remove others

MobilePhone MobilePhone_Cleaned Notes
+44 1234 56789 44123456789 -
(555) 567890 555567890 -

Scripting

Settings

Node type name: regexp_cleaner

Setting Property Type Comment
Clean fields clean_fields String List -
Output suffix output_suffix String -
Trim trim_mode none, left, right or both -
Replace tab with space replace_tabs Boolean -
Replace duplicate space or tab with space replace_duplicate_blanks Boolean -
Capitalize capitalize_mode none, upper or lower -
Upper case English characters find_upper_english_chars Boolean -
Lower case English characters find_lower_english_chars Boolean -
Digits find_digits Boolean -
Punctuation find_punctuation Boolean -
Blanks find_blanks Boolean -
Spaces find_spaces Boolean -
Non-printing characters find_non_printing_chars Boolean -
Category handling categories_mode remove or keep -

Scripting Example

node = modeler.script.stream().createAt("regexp_cleaner", u"String Cleaner", 512, 192)
node.setPropertyValue("clean_fields", [u"HomePhone", u"MobilePhone"])
node.setPropertyValue("output_suffix", u"_processed")
node.setPropertyValue("trim_mode", u"both")
node.setPropertyValue("replace_tabs", True)
node.setPropertyValue("replace_duplicate_blanks", True)
node.setPropertyValue("capitalize_mode", u"none")
node.setPropertyValue("find_upper_english_chars", False)
node.setPropertyValue("find_lower_english_chars", False)
node.setPropertyValue("find_digits", True)
node.setPropertyValue("find_punctuation", False)
node.setPropertyValue("find_blanks", False)
node.setPropertyValue("find_spaces", False)
node.setPropertyValue("find_non_printing_chars", False)
node.setPropertyValue("categories_mode", u"keep")