Document order in output of Term Vector node

I’m struggling to understand from the output of the Term Vector node in which documents a certain term appears: the column names are cryptic (the numbers don’t refer to anything I can recognize), and the order in the collection column is not explained anywhere (the node’s help says that “the ordering is specified in the data table spec”, but I can’t find it there).

Hi @mpenalver !!

Could you give us a little more detail on what you want to do? maybe a workflow or some images could help us to understand the problem.

I was playing with some documents and i got the Term Vector. It’s possible to see in which documents (ID number in columns) a certain term (rows) appears (1 or 0). Did you use Bag of words and Term freq nodes before to compute the term vector table?

By the way, my text data is not clean, don’t notice this (just and example giving importance to Term Vector node).

Thank you!!


Term Vector.knwf (2.4 MB)

1 Like

Thanks for the response, @cristiancandia.

Below is an example output of the Bag Of Words Creator and Term Vector nodes, which are directly connected. Each of the three terms shown only appears in one document. My question is: how can I infer to which documents of the Bag Of Words Creator node’s output the columns of the Term Vector node’s output correspond? In other words, how should I interpret those columns’ names? Notice that my docs purposely have no title, as I don’t want terms in the titles to be considered.

imageimage

1 Like

Thank you @mpenalver now I can see the problem; you are right, when you assign an empty title to each document to avoid terms in the titles to be consider, the output for Term Vector would be not easy to link with the original docs.

What if you generate two different versions of the documents (2 cols) ?

you definitely need an ID for each document (to identify them faster), then you can create two columns, one with no titles documents and another document col with titles (ID as titles). On this way, you build the BOW using the column with no titles and then keep the documents columns with titles to generate the Term Vector.

i’m not sure if that is the best option but works for me, you could try it.


Term Vector.knwf (35.2 KB)

1 Like

Thanks for the idea @cristiancandia. I was hoping for the column names coming out of the Term Vector node to be more descriptive of the document they refer to. You are right in that we need a doc ID (even with title, as two docs might have the same title), so the node’s configuration could give the option to choose a column in the input table that denotes the doc ID and use its values to name the output columns. In the absence of a column ID, the names could correspond with the order of the docs in the input table (the first doc would be named #1, instead of some apparently random number).

1 Like