Number Filter

Hi,

Any suggestions to filter numbers separated by blanks? E.g. 12 12 2016 -> ""

Thanks

Hi,

There are two options here. If you want to replace the blanks with something else, you should use the String Manipulation node. If you want to filter the rows in which the blanks occur, you should use the Row Filter node.

Best,

Roland

Hi alfroc,

are there numbers split up in different terms, or are they one term e.g. is "12 12 2016" one term?

Cheers, Kilian

Hi Kilian,

if a row in a document contains: "Today, 12/12/2016, is a nice day", after executing the Replacer Node you get: "Today 12 12 2016 is a nice day". I'd like remove all numeric characters getting: "Today is a nice day".

Cheers, Alfredo

The problem in this case is, that "12/12/2016" is one term. After replacing "/" with a whitespace, "12 12 2016" is still one term, so the Number Filter does not recognize the term as a number, because of the whitespaces. As a workaround you could extract all necessary data with the Document Data Extractor and "rebuild" document with the Strings To Document node (this node tokenizes the sentences and creates the terms).

The second possibility is, that you use another tokenizer in the first place. When you create your documents with the Strings To Document node or one of the Document Source nodes, select the OpenNLP SimpleTokenizer. This tokenizer makes every char sequence, containing the same character class, to one term. 

Example: Today, 12/12/2016, is a nice day

"Today" -> Term, "," -> Term, "12" -> Term, "/ "-> Term, "12" -> Term, "/" -> Term, "2016" -> Term, ","- Term, "is" -> Term, etc.

When you use this tokenizer for document creation, you can apply the Number Filter and the Replacer node (for the "/" chars) afterwards and it should work fine, but watch out! If you want some "-" seperated words to be one term, don't use the SimpleTokenizer. 

An example workflow is attached. 

Cheers,

Julian

Hi alfroc,

you can do this also more easily by using the Regex Filter node with the expression ".*\d+.*". This will filter out all terms that contain digits. Attached is an example workflow.

Btw. I see that this is not quite obvious and the Number Filter node should be able to handle this. I opened a ticket to upgrade the Number Filter node with an option that allows to filter not only numbers but also terms that contain digits.

Cheers, Kilian

Thank you very much, Kilian!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.