We added two modes to simple_word_tokenize.
Compact (limited charset)
Full (full charset from camel_tools)
Plan is to add it as a CLI option and to the API.
Maybe we can support provided custom charsets or providing OOB more specific charsets in the future