I am facing issues with the Tag Filter node. I expect to isolate the identified tags, but quite frequently it just concatenates all of them. Here we have a couple of examples:
1) Using dictionary tagger + Tag filter (NE-UNKNOWN)
Text:
"To investigate the cystathione beta synthase (CBS) gene T833C, G919A, 844ins68 polymorphisms and plasma homocysteine (Hcy) levels in ethnic Uyghur and Han patients with essential hypertension (EH) in Xinjiang."
Expected output after Tagger + Tag Filter:
"CBS", "homocysteine hypertension" (these two have been pre-loaded in the dictionary)
Actual output:
"CBShomocysteine hypertension"
2) Using Abner Tagger + Tag filter (ABNER-PROTEIN)
Text:
"In conclusion, we provide evidence of a frequent epigenetic inactivation of RSK4, SPARC, PROM1, HOXA10, HOXA9, WT1-AS, SFRP2, SFRP5, OPCML, and MIR34B in the development of non-serous ovarian carcinomas of Lynch and sporadic origin, as compared to serous tumors."
I assume that you mean with 'Actual output: "CBShomocysteine hypertension"' the table view of the output table of the Tag Filter node, is this correct? The "concatenation" is only happening in the view. This is due to the fact that after the token CBS there is no whitepsace (the next token is ")". This means if the token CBS remains after filtering it will be shown together with the next token (that remains after filtering) without any whitespace in the table view. However, the tokenization is not changed, meaning that even if CBS and the following token are show without whitespace, they still remain as two tokens.
The easiest way to exactly check the tokenization is to create a bag of words (after filtering).