Setting document category for use later as document class

pstarrett · December 8, 2015, 12:23am

I would like to determine how to set the document category so that downstream classification (Document Class) can be done using the Category to Class node. I see that the String to Document node allows Document Category to be set but I am not sure how to take a table of documents and connect it into the String to Document node in order to set this value. Obviously, the document table is not a string so this cannot be directly connected to the String to Document node. I have done some research on forums but cannot find anything. Thank you!

An interesting note here. If you review the two examples 009001_DocumentClassifiaction and 009002_DocumentClustering (which contain much the same functionality), the document class appears mysteriously in the Color Manager but nowhere in the upstream node path does this value appear to be set. I was trying to answer my own question here by examining these examples but there is nothing there! :)

kilian.thiel · December 8, 2015, 6:46pm

About the document class: you can insert meta information in a key,value pair manner into documents. This information can be extracted later on. If you have a class or category you can use the Meta Info Inserter and Extractor node to insert and extract this information.

About the example workflows: they have just been updated (please have a look). The class has been set during creation with the strings to documents node. Thus it can be magically be extracted later on ;-). After creation the documents have been saved as table file, which are used in these workflows.

Cheers, Kilian

pstarrett · December 8, 2015, 8:04pm

Kilian,

Thank you for the input and I appreciate your sticking with me on this since it has become a showstopper for me. First, I do not find the new versions of the examples. I have updated the examples from the public site and I find the same two examples; I do not see use of Strings to Documents nodes in these examples. I must have the old examples - any direction on where to find them would be greatly appreciated.

I am also trying out the Meta Info Extractor / Inserter nodes but I am finding in the clustering example (the old one, obviously) that only the Document column shows in the Document Column (Meta Info Extractor Configure window, Options tab). When I run this, it generate an empty table. If I enter 'category' into the Meta Info Keys text box on this same tab, it too generates an empty table. In this same example, the Column Filter node on the bottom right (just before the Heirarchiccal Cluster View node) mysteriously finds all columns! If I drop this Column Filter node ANYWHERE upstream in this example, only the Document column is displayed. That is just plain odd. Any idea what is going on? I have tried using the Document Data Extractor will pull and expose all columns from the Table in this example (it appearrs the Column Filter and Meta Info Extractors are incompatible with Tables?) so there is some progress here. I try to connect this to a Strings to Documents node but it says something about a duplicate docuemnt column. So, I am going in circles with all of this and thought I would see if you know of a tutorial or webinar that explains data transformations in this specific area. Thank you for your continued help - it will be VERY helpful to resolve this.

kilian.thiel · December 9, 2015, 10:43am

You can find and download the new examples on the Example Server. On the website the workflows are not updated yet. Before you can extract meta informations you need to insert them. Please, find attached to this post an example workflow that shows how to insert and extract meta information in documents.

In the example workflows (on the Example Server and on the website) there is not meta info in the documents that can be extracted. The class info is in the category field of the documents. Documents have certain fields and category and source are two of them. These field can be used to carry information and additionally generic meta information in a key value manner can be inserted in documents. In your case, since you want to add information to existing documents you need to use the meta information to add data to documents.

If you create new documents using the Strings to documents node you can store data in the category or source fields.

I know this is a bit tricky and unclear at first view. Sorry about that, I hope this helps you understanding better the difference here.

Cheers, Kilian

metainfo.zip

pstarrett · December 10, 2015, 8:46pm

Hello Killian,

First, thank you very much for the help! I have downloaded the attached example and there are a few issues:

1. It does not appear that the table file came with the example. When I look inside the Table Reader folder, all I find is a 'settings.xml' file, not '.table' file. I am then not able to configure the remaining nodes in this example. As an aside, when I tried to use another table in my existing workflow (just to see if it would work), I find that my other table writers are saving tables as a zip file. I do not see a configuration option in the table writer to save as a file as other than the way it is doing (i.e. zip file) so I am not sure what is going on there. If I navigate to the zip file from within the Configuration wizard in the Table Reader, it also requires a '.table' file, not zip. Kind of a showstopper all the way around.

2. When I load your example, I get what appears to be a benign error (though I cannot be certain). The error says your example was created in Knime 3.1 but I have 2.12. Can I (should I) upgrade to 3.1? If so, can I upgrade the existing instance or do I need to do a completely new install and import the workflows from the 2.12 workspaces?

Thank you for your continued help!

Paul

kilian.thiel · December 11, 2015, 11:39am

Hi Paul,

1. the workflow was executed and the data available at all nodes. Since the data file was not included the workflow should not be reset. I attached another workflow to this post with the data insode the workflow dir. You can reset the workflow and re-execute it now.

About the table writers / readers. You can specify the full filename for the Table Writer table.zip or table.table works both. It is recommended to save tables as .table files using the Table Writer node.

2) This error is because of the version of your KNIME Analytics Platform, the workflow was created by verson 3.1 and you are using 2.12. To get rid of this error you should download version 3.1 You can not update from 2.12 to 3.1 since the eclipse version and the Java version changed. You need to download and install a new separate version.

Cheers, Kilian

metainfo.zip

pstarrett · December 12, 2015, 7:06pm

Killian,

I am making headway and this has been very helpful in getting me closer to where I can make great use of Knime. Your new example works and I have been able to more closely review the process. I do see that the Strings to Documents will accept and input directly from a Table Creator (as in your example). There, of course, the Category can be set.

However, I am still having a tough time trying to get Strings to Document node to accept output from anything else (like output from a Table Reader or CSV Reader). I am getting a lot of 'duplicate column' errors; when I dig into the columns, I do not see the duplicate values.This should not be so complicated. Maybe there is a place within the Knime Help where basic functionality and interconnectivity of the Readers and Writers, Document conversion, etc., is shown? I cannot seem to be able to find it and the node decriptions are not of much use in this context.

Also, I have not picked apart the last half of your example and I am not sure if that is suppoed to show how to set column values in a document (or CSV table, of Table, etc.). It appears to add a few columns. Is this area supposed to exemplify setting column values?

I was able to successfully upgrade to 3.1. Thank you for your help there!

Thank you for the continued help! We are almost there!

Paul

kilian.thiel · December 14, 2015, 6:36pm

Hi Paul,

can you share the csv file that you want to use as input data for the Strings to Document node? If you attach the file I can create you a workflow, reading in that data and converting the strings into documents.

I can also create you a workflow that injects meta info to existing documents if you send me some data (attach it to your next post) that you want to use.

Cheers, Kilian

pstarrett · December 14, 2015, 9:46pm

Killian,

Thank you very much! Let me see if I can propose an approach that will save you time and get me on my way. I am attaching a workflow (it is a Palladian node example) that I am using and where these issues can be exemplified. I am passing it along only for high-level context, not for you to spend much (if any) time on it specifically. If you look at the node path from the Table Creator to the Content Extractor, you will find that it all works together. However, often, when I try to *replace* a node or add a node, I receive errors that the input / output type to or from nodes do not match - this includes when I try to write or read files or try to convert to a form that Strings to Document node will accept. If I had some way of knowing what format is exiting a node and what is required on input, I could convert and flow smoothly. The node descriptions I find are not helpful in this context. Is there some known map between types, for example, how to convert between the following (not exhaustive list, these are the ones I know of and are used for illustration):

- Document

- Document Cell

- CSV File

- Text File

- Table

- HTML / XML

- JSON

If I could find some resource that would explain how to "pipe" between each of the formats above then I would be set! I know there are node to convert XML to JSON, Table to HTML, Strings to Document, etc. What about Document to String, Table to CSV, CSV to Table, Text to Table, etc.? Does this make sense? Every node seems to require input of one of the above listed types and will output as one of the above types. For example, what if I wanted to save the output of the HTTPRetriever (in the attached) to CSV or text which I cannot seem to do (saving to table works fine)? Or, if I wanted to do the same with the Content Analyzer? If I knew how to convert types along the way I could save out data and mix and match nodes.

This might also facilitate use of setting column values such as Document Class / Category for each type. If I knew what format to convert a flow to, I might be able to set column values more seamlessly. If you do have a way to do this with Document or CSV, that would be great. Sorry if this email is long but I hope it was an easy read; I figure it was better to go into a little more depth just to get this resolved. Let me know your thoughts. Thank you very much!

palladian_02_query_a_earch_engine.zip

kilian.thiel · December 21, 2015, 1:06pm

The KNIME Analytics Platform provides various types of columns, i.e. cell types. These types can be doubles, ints, strings but also complex types such as Documents are XML cells. On each cell type different nodes can be applied e.g. String Manipulations on string cells, formulas on number cells or e.g. XPath on XML cells and text processing operation such as stop word filtering on document cells. Some types can be converted into others types. You can find the type of a column in the spec information of the data table view (second tab).

The HTML parser in your example produces XML cells. These cells can be processed with the XML nodes. XML can be converted e.g. to strings. To make use of the Textprocessing node you need Document cells. To create Document cells use the parser nodes or the Strings to document, or the content extractor as in your example.To extract fiels of documents back to string, use the Document Data Extractor. To extract field from XML use the XPath node. To process XML use the XML nodes, for JSON use the JSON nodes.

In your example you use the Content Extractor to create document cells. After this node you can now use node of the Textprocessing extension to process these document and e.g. filter or stem them.

String <-> Documents
Strings to Document
Document Data Extractor
Term to String
String to Term
Tag to String

String <-> JSON
JSON nodes

String <-> XML
XML nodes
Html Parser

Cheers, Kilian

ifimsasa · December 31, 2018, 6:21am

Hello Kilian,
I had posted a query as a new thread but then saw this was a related thread so am posting here. Please help.
This is about Sentiment Analysis Using Lexicon Approach
My workflow starts with a ‘Table Creator’ which has two columns: a Text and a sentiment column
I map the Sentiment to Category in String to Document node.
I have reused the steps from the Example given in the community server for the rest of the steps.
In the end,
Towards the end of the workflow there is a Category to Class Node and it adds a Document Class column. Where does it get this column from? Is this the same as Sentiment Label in the data input from the ‘Table Creator’ step?

Thanks,
Sasa

julian.bunzel · January 8, 2019, 10:05am

Hey @ifimsasa,

the Category To Class node creates a class column from the documents category.
Since you have selected your sentiment column as category in the Strings To Document node, the new class column should be the same as your previous sentiment column.

Cheers,
Julian

ifimsasa · January 11, 2019, 2:36pm

Thanks!
I was able to see that Category in Strings to Doc was getting carried over to this field.

badger101 · May 22, 2020, 10:37pm

Hi @julian.bunzel can you elaborate further on what is the role of this Category to Class node in sentiment analysis for lexical approach? Thanks

ScottF · May 26, 2020, 2:37pm

Hi @badger101 -

When you originally create a document column using the String to Document node, you can also set certain metadata fields like author, source, category, and so on. This metadata is retained as part of the document throughout the workflow (you can examine it with the Document Viewer node at any point).

In the Lexicon Based Approach for Sentiment Analysis workflow, we have calculated a predicted sentiment score based on the frequency of how often different positive/negative words appear. To use the Scorer node and see how accurate the predictions are, we also need to have the true category as an explicit field again in the dataset. And that is what the Category to Class node does - it pulls out the category metadata we had originally defined, and puts it in a new column called Document class.

Does that help?

badger101 · June 2, 2020, 9:05pm

Thanks @ScottF , yes it’s all clear now.

system · June 2, 2023, 9:42pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.