Dictionary Tagging Help

okyere · February 16, 2015, 5:49pm

I have a table like this:

item_desc	supplier
book	comp_1
red book	comp_1
pencil	comp_4
cup	comp_2
cup holder	comp_5
paper	comp_1
white paper	comp_2
red paper	comp_4
blue book	comp_3
pencil eraser	comp_5
book cover	comp_3

1. Find out how often a supplier is matched with an item description. Eg.Basically, it is a list of items and the company that supplies them. My goal is to be able to

red book
========
comp_1 - 80%
comp_5 - 10%
comp_3 - 10%
-------------------
total - 100%

2. Find out how often an item description is matched to a supplier i.e. the coverage of each supplier. Eg.

comp_3
======
white paper - 70%
short pencil - 20%
blue book - 10%
---------------------
total - 100%

At the dictionary tagging node, I have a table that contains each every unique word(s) and I chose sentiment tag type. The problem is, since some of the words(terms) are a subset of terms, I'm not getting the node to tag them all correctly. Eg. after tagging, I can find "cup" in the terms but I cannot find "cup holder". I can find "book" but not "book cover". I think the tagger only takes the first instance it sees but I'm not sure. My goal is to be able to tag say "book" when it appears and also tag "book cover" separately because in this example, they are not the same things.

Also, if there's a suggestion about how to solve my problem, I'll appreciate it. So far, I have read the table from oracle and filtered to the columns I need, then strings to document, then punctuation eraser, dictionary tagger, bag of words, and finally general tag filter to filter out the tagged words. That's when I realized I am missing some terms that were supposed to be tagged. Please help. Thanks.

okyere

Marlin · February 17, 2015, 7:44am

Hi,

if your data set is not too large, you can use this method: add a "dummy" column with a Constant Value Column node. Use a GroupBy node to count how often each pair appears by using both columns as grouping columns and the dummy column as the base for a count-aggregation. In parallel, use a second GroupBy node to count how often each item appears by grouping by the item only, and count-aggregating one of the other columns. Join both tables by the item columns with a Joiner. Now you have one column with the pair-count and one with the item-count and all you have to do is divide one by the other with a Math Formula node and filter garbage with a Column Filter. This may sound complicated but I use this pattern all the time, because as simple as the task seems, I haven't found a better way yet.

You could also use a Pivot note after adding a dummy, with a Missing Value node afterwards. This would get you the totals, too, in the second and third out ports. Pro: less steps. Contra: even slower than the first method, and with a lot of overhead if the resulting matrix would be sparse (i.e. if most items are only supplied by a few suppliers or vice versa).

Concerning the tagging, I'm not sure if you can tag these more complicated terms. But you can always use some string manipulation before converting to documents, eg replacing "pencil eraser" with "pencilEraser" with, say, a String Replacer (Dictionary)", a String Replacer in a loop with a pattern from a flow variable or something similar. Afterwards, let the system just treat both as a term. Would that help?

okyere · February 17, 2015, 7:34pm

Marlin,

Thanks for your helpful insight and suggestions. I took the first approach and came up with what I have attached. Please take a look and let me know what you think.

About the tagging part, I haven't tried what you suggested yet but I was wondering if I'll need to even convert the text to document and tag, etc if I'm going to stick with the implementation (first) you suggested. It seems to me that if the workflow works for what I want to do, then I won't even need tagging. I may still have to consolidate some items as the same eg. "black bag", "blk bag", and "bag blk -04 72 / 15" are all the same item. The description were entered manually by someone so depending on how they feel, they may give as much detail as possible; or not.

So let me know how I could consolidate items that I know are the "same" and also if I will still need the converting to document and tagging, etc. I'm new to data mining and this tool (Knime); so forgive me if these are newbie quesitions. I'm still learning :-)
Thanks.

test_test.zip

Marlin · February 18, 2015, 7:16am

That Workflow look right.

Btw, you can select the renderer of a column in the table views by right clicking the column header. Among others, columns with double values in them offer a renderer that interprets the values as percentages. So depending on the rest of your flow, the multiplication with 100 might be unneccessary.

The Text Processing extension is really a convenience and meant for analysing a lot of documents. So if it is convenient to you, I don't know. If not, you can do a lot of things with the String Replacers, String Manipulators, Rule Engines... If going this route, it helps to know about regular expressions and the specific quirks and possibilities that java adds to them.

okyere · February 18, 2015, 10:45pm

Thanks for pointinng out the rerender options.

Now, about the text processing part. I know a bit about regular expressions but I don't know how I can use that to solve this problem. What I'm hoping to be able to do is to use a node (or some nodes) that will be able to scan across the entire description field and see the "dominan" words and then based on that, it can categorize each item. As I tried to illustrate in the example in my last post, I can have "bag", "black bag", "bag 094/23", "bag sm", etc. I hope there's a node or nodes that will intelligently see all of them as a bag. Another example is "tube", "tube assemb.", "tube kit", "tube washer", etc. I can manually see and put them all under the tube category. But I'm hoping that a node or combination of nodes can help me achieve that (I have about 12 million rows of data :-)). If you know of such, please share. Even if I have to supply the keywords to look for in the categorization, that will be ok. Thanks.

kilian.thiel · February 19, 2015, 10:07am

What you could do is normalizing your item names i.t.o. replacing the names by normalized versions. Of course you would need to provide the normalized item names as a dictionary e.g. tube and bag as in your example above.

Create regular expressions out of these normalized item names and user the String Replacer node to apply the expressions and replace the names in your dataset. Do this by looping over the the expressions and use the expressions as well as the normalized item name as workflow variables to control the String Replacer node. Mark replaced items and collect all replaced items in the end of the loop.

Attached is an example workflow that illustrates the approach.

Cheers, Kilian

itemnamereplacement.zip

okyere · March 2, 2015, 5:51am

Kilan, thanks a lot for your suggestion and sample workflow. Just I was getting comfortable, the requirements have changed. So help me out here a little more. I have a table of parts (items) and four different columns that describe the item or part in column 1. some are short descriptions eg. "copper hammer 20 ibs bulk only bla bla". others are short eg. "steel plate". For now, I'm only trying to target descriptions for raw commodities. eg. steel, copper plate, steel plate, raw steel. I was able to use your seggestion from above but only one regex keyword eg. steel. As you'd imagine, lots of the results had noise beacuase it matches everything that has steel in the description. So the example description above will show up but it is not what I'm looking for. I think I can't use the single word matches. So how do I do multi word matches. like "raw copper" or "steel plate" knowing that the words can appear in any order? Only if those are matched that I can be sure that they're describing the raw commodity I'm looking for. On top of that, I have four columns that contain the descriptions and I have to use them all. So will I need to repeat what you desmostrated four times for each description column?

To summarize:
1. How do I match specific multi words that can appear in different orders?
2. How do I apply the string replacer node with four different target columns looking for the same things. Thanks.

kilian.thiel · March 2, 2015, 5:38pm

To match multiple words with regular expressions appearing in any order use an expression like "(raw cropper|cropper raw)+". For details about regular expressions in Java please see: http://docs.oracle.com/javase/tutorial/essential/regex/

Be aware that the order the expressions are applied is important. Start with expressions that contain multiple words first, e.g. first raw cropper than cropper.

You could combine the text of these four columns and do the replacement on the combined column only. To combine strings use the String Manipulator node and the join() function.

Cheers, Kilian

okyere · March 9, 2015, 6:28pm

Kilan,

Thanks. I have found that these:

STEEL PLATE
STEEL SHEET
SHEET STEEL
COPPER PLATE

are the string expressions that match to something in the description texts. However, instead of replacing every description that has these terms with the same normalized terms, I want to say replace them with just 'COPPER' or 'STEEL'. How do i configure the string replacer node to give me that instead of replacing with words from the dictionary. I tired using the string manipulator node with regex at the end of the workfrlow to look for 'steel' in the Replacement column and append new column with value 'steel'. That didn't work. I'm getting steel for even when steel is not in the string. Please help. Thanks.

kilian.thiel · March 17, 2015, 2:44pm

Hi,

I am still not 100% sure what exactly you want to achive. I am guessing that you have a text like

"This a description of STEEL PLATE consisting of some words." and you want to replace STEEL PLATE with STEEL, is this correct? If so, see the attached workflow. You can use the String Replacer node to do that.

Cheers, Kilian

replacement.zip

system · June 2, 2023, 9:49pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.