I am using hierarchical clustering on text data. I have a series of letters that I am converting to document vectors from which I create a dendrogram. While the documents are labeled with the author's name as the rowID, once I create the bag of words, the rowID is lost. Once the vectors are created the rowID is an automatic sequential number and I have difficulties interpreting the dendrogram since the numbers on the x axis are meaningless to me.
I need to label each node in the dendrogram with the name of the author of the document.
Has anyone done this? Can you help?
Thanks - Gabe
You may want to make sure the Document Cells contain the Author information so this can be extracted later after the Bag Of Words.
First get the Authors out of the RowID into a normal column, So use the RowID node.
Now use the Strings to Document node, selecting your main text for the document, and the new author column for your author. Now all the author information is kept in the Document Cells.
Create your bag of words as usual.
Now use the Document Data Extractor node and select Author. This now extracts out the Author information again, next to your Bag of Words. Now use the RowID node again to put the Author information back into the RowID.
You are now set up for your Dendograms with Author labels.
Simon, this worked like a charm. Thanks!