Problems with tag filter

I am facing issues with the Tag Filter node. I expect to isolate the identified tags, but quite frequently it just concatenates all of them. Here we have a couple of examples:

 

1) Using dictionary tagger + Tag filter (NE-UNKNOWN)

Text:

"To investigate the cystathione beta synthase (CBS) gene T833C, G919A, 844ins68 polymorphisms and plasma homocysteine (Hcy) levels in ethnic Uyghur and Han patients with essential hypertension (EH) in Xinjiang."

Expected output after Tagger + Tag Filter:

"CBS", "homocysteine hypertension"    (these two have been pre-loaded in the dictionary)

Actual output:

"CBShomocysteine hypertension"

 

2) Using Abner Tagger + Tag filter (ABNER-PROTEIN)

Text:

"In conclusion, we provide evidence of a frequent epigenetic inactivation of RSK4, SPARC, PROM1, HOXA10, HOXA9, WT1-AS, SFRP2, SFRP5, OPCML, and MIR34B in the development of non-serous ovarian carcinomas of Lynch and sporadic origin, as compared to serous tumors."

Expected output:

"SPARC", "PROM1", "OPCML"

Actual output:

"SPARCPROM1OPCML"

 

Any ideas?

 

Cheers,

Fernando

 

Hi Fernando,

I assume that you mean with 'Actual output: "CBShomocysteine hypertension"' the table view of the output table of the Tag Filter node, is this correct? The "concatenation" is only happening in the view. This is due to the fact that after the token CBS there is no whitepsace (the next token is ")". This means if the token CBS remains after filtering it will be shown together with the next token (that remains after filtering) without any whitespace in the table view. However, the tokenization is not changed, meaning that even if CBS and the following token are show without whitespace, they still remain as two tokens.

The easiest way to exactly check the tokenization is to create a bag of words (after filtering).

I hope this helps.

Cheers, Kilian

I see. Thanks Kilian. As a suggestion, it would be really helpful to have a node showing the actual tokens one by one in original sentence order.

 

Cheers,

Fernando

 

Your are right, this would be a nice option for the Document Viewer node, to show the tokenization. I have opened a ticket for that.

Cheers, Kilian
 

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.