How do we tag multiple entities at once in text files ?

Hi,

I have set of text files, these text files have information about products like manufacturer name, manufacturer id, shipping date, product id, product name etc.

I would like to tag and extract the product id and product name at once from the text files. So i used workflow like this.
Flat file parser -> partitioning into training set and test set -> StandfordNLP NE Learner ( other input is Dictionary of product ids ) -> StandfordNLP NE Tagger (other input is test set)

This workflow extracts product ids. I repeated the same for product names. That worked fine as well.

I would like to extract both product id and product name at once. Can we tag multiple entities (product id and product name) at once in a document ? How is it possible ? Can someone pls throw some knowledge ?

Thanks,
Swatkat9

Hi,

You can simply use multiple tagger nodes in the same workflow. E.g., you first tag for product IDs, then for product names.

Cheers,
Roland

Thanks for the reply Roland.

Here, this is one point to mention is that product name, product ID are related. There could be multiple products in the same text file where we would want to extract products id and product names.

Consider this text file

product name Macbook pro Mouse product ID 1000
product name Macbook pro Keyboard product ID 1001

Here i would like extract

  1. Macbook pro Mouse, 1000
  2. Macbook pro Keyboard, 1001

When I use multiple tagger nodes, there is a possibility that output being

  1. Macbook pro Mouse, 1001
  2. Macbook pro Keyboard, 1000

As tagger node tags name and ID separately. Do we have a solution for this ?

Thanks,
Swatkat9

Hi @swatkat9,

if your text file is structured the way you mention, i would not use the Text Processing nodes.
I would:
1.) Read the file with the Line Reader Node (so every line will be represented by an individual row).
2.) Use the Regex Split Node with the following pattern:
product name (.*?) product id (.*)
I also used the Ignore Case option.
I hope the result will be what you expect to get.

Best,
Michael

hi Michael,

The text file i mentioned is only to understand the problem i am facing.

The actual text files have traces of product name, product id with other information related to product spread accross multiple lines in unstructured format like the natural language.

The solution which i am looking at is to tag product id and product name by establishing relationship between these entities and apply learner node. Is it some thing feasible on Knime ? Please suggest.

Thanks,
Swatkat9

Hi @swatkat9,

one questions. You said that product names and product ids are related. Do you know how the ids and names relate? Do you have a list of id, product name associations available?
If so, i don’t think it matters much in which order or combination you extract the ids and names, cause you can check how they should relate and correct the errors. Or not?
Anyway, in this case I would probably do the following:
If you have a list of product id -> product name mappings you can add an additional id column e.g. call it association id. Replace both product names and product ids by the association id (e.g. with Dict Replacer Node in original text) and extract it (by tagging, Regex matching etc.). Afterwards you can group by the association id, join the association table and know which products where mentioned in the original message. Could that be an (unconventional) approach that might help?

Hi,

List of ids and product names are not available. I need to extract ids and relevant names from text files.

Let me try with Regex Matching. You know I was thinking to take advantage of machine learning, that’s why I intended to use StandfordNLP NE Learner. So it picks up product id and names for me.

In that context, Learner started picking up ids and names separately like the way I said with an example, One column extracting product ids and other column for names. But I need a mapping between these ids and names, which product id belongs to which product name.

Thanks,
Swatkat9

Hi, ok this is tough. I would probably try extracting product ids first (i think they usually follow a pre-defined pattern). Afterwards try to find “meaningful” co-occurrences with single terms or ngrams. Not an optimal solution but maybe worth trying.

Thanks for all your replies. Let me give a look at what ngram is.

It would be very nice to have Knime incorporate Machine learning nodes which actually identify multiple entities and their relationships.

Thanks,
Swatkat9