StanfordNLP NE Tagger Joiner

I am trying to perform Named Entity Recognition on a set of documents, but after the tagging I loose my row ID which means I can't join the results back.

The information comes in as rows of strings, I convert it to docs using 'String To Doc' node. I assign a row ID to it. Then use 'StanfordNLP NE tagger' to perform the NE Recognition, create a bag of words and use a splitter to get the useful information. I want to join the results back to the original input, I use the document column to do this. But I am getting back an empty table, even though it is the exact same set of documents.

Any help here would be appreciated.

I am attaching an image of my flow.

Hi Sanmit,

You are trying to join the original documents to the tagged documents. Since the tagged documents also include the tagging information, they can't be matched to the original documents.

Therefore, you should connect the bottom input of your Joiner to the StanfordNLP Tagger rather than to the RowID node. If you want to continue working with the documents without tags, you can use a Tag Stripper node after joining.

I hope that helps!

Cheers,

Roland

Thanks for the rely Roland.

I use the RowID for inserting the data into a database which is also indexed on RowID. That is the reason I am trying to join on the RowID node.

 

I tried the tag stripper node to perform the join but it looks like it still returned an empty data table.

Snapshot attached

Hi Sanmit,

Can you please try again without the Tag Stripper that connects directly to the RowID node? It's possible that this changes something in the documents so that they don't match afterwards.

If that doesn't help, can you please post your workflow here with a small sample of your data? I can then have a closer look what is happening.

Cheers,

Roland

Roland,

Sorry for the late reply was out on vacation.

I tried what you said, that didn't work. I am attaching the project here along with a small data set.

Here is the dataset.

you've exported a preference file, not the workflow.