String cleaner node

String Cleaner Node

Overview
Settings
Examples
Scripting

Overview

The String Cleaner node provides common string cleaning operations in a way that is simple to use. Operations including removing non-standard characters, trimming leading and/or trailing whitespace and capitalization. These operations can be applied across multiple input fields and a new string output field will be created for each input field.

Settings

The node settings are split into related sub-groups.

Fields

Clean fields

This is used to select which string fields should be cleaned.

Output suffix

New field names are generated by joining the name of each field selected in Clean fields to the output suffix.

Clean fieldsHomePhoneMobilePhone

Output suffix_Cleaned

Output fields generated: HomePhone_CleanedMobilePhone_Cleaned

Whitespace

Leading and trailing spaces

This specifies how strings should be trimmed:

  • None: (the default setting) the string is not trimmed
  • Left: removes spaces at the start of the string
  • Right: removes spaces at the end of the string
  • Both: removes spaces at the start and end of the string

Replace tab with space

When checked, tab characters will be replaced with space characters.

Replace duplicate space or tab with space

When checked, 2 or more adjacent space or tab characters will be replaced with a single space character.

Capitalization

Capitalize

This specifies how character case should be changed in the string:

  • Leave unchanged: (the default setting) character cases are not modified
  • ALL UPPER CASE: Any lower case characters are converted to the equivalent upper case characters
  • all lower case: Any upper case characters are converted to the equivalent lower case characters

Character Categories

Categories

This section lists various character categories which can be checked or unchecked.

  • Upper case English characters: characters representing the letters A to Z
  • Lower case English characters: characters representing the letters a to z
  • Digits: characters representing the numbers 0 to 9
  • Punctuation: punctuation characters are !'#$%&'()*+,-./:;<=>[email protected][/]^_{|}~
  • Blanks: space or tab characters
  • Spaces: space, tab, new line, vertical tab, form feed or carriage return characters
  • Non-printing characters: other characters that are not normally visible but can sometimes be included in strings

Category handling

This specifies how character categories should be handled:

  • Remove selected categories: (the default setting) character cases are not modified
  • Keep selected categories and remove others: Any lower case characters are converted to the equivalent upper case characters

Examples

All examples assume other settings are set to default.

Clean phone numbers

This removes any non-digit characters from phone number strings. Clean fieldsMobilePhone

Output suffix_Cleaned

Digitschecked

Category handlingKeep selected categories and remove others

MobilePhoneMobilePhone_CleanedNotes
+44 1234 5678944123456789
(555) 567890555567890

Scripting

Settings

Node type nameregexp_cleaner

SettingPropertyTypeComment
Clean fieldsclean_fieldsString List
Output suffixoutput_suffixString
Trimtrim_modenoneleftright or both
Replace tab with spacereplace_tabsBoolean
Replace duplicate space or tab with spacereplace_duplicate_blanksBoolean
Capitalizecapitalize_modenoneupper or lower
Upper case English charactersfind_upper_english_charsBoolean
Lower case English charactersfind_lower_english_charsBoolean
Digitsfind_digitsBoolean
Punctuationfind_punctuationBoolean
Blanksfind_blanksBoolean
Spacesfind_spacesBoolean
Non-printing charactersfind_non_printing_charsBoolean
Category handlingcategories_moderemove or keep

Scripting Example

node = modeler.script.stream().createAt("regexp_cleaner", u"String Cleaner", 512, 192)
node.setPropertyValue("clean_fields", [u"HomePhone", u"MobilePhone"])
node.setPropertyValue("output_suffix", u"_processed")
node.setPropertyValue("trim_mode", u"both")
node.setPropertyValue("replace_tabs", True)
node.setPropertyValue("replace_duplicate_blanks", True)
node.setPropertyValue("capitalize_mode", u"none")
node.setPropertyValue("find_upper_english_chars", False)
node.setPropertyValue("find_lower_english_chars", False)
node.setPropertyValue("find_digits", True)
node.setPropertyValue("find_punctuation", False)
node.setPropertyValue("find_blanks", False)
node.setPropertyValue("find_spaces", False)
node.setPropertyValue("find_non_printing_chars", False)
node.setPropertyValue("categories_mode", u"keep")
Scroll to Top