I recently started using knime. I am trying to identify clusters in my document using k-means. I have been following one of Text processing tutorials , this is my workflow.
Database Reader > column filter >Strings to Document > Stanford tagger > Bag of words Creator >Punctuation Erasure > N Char filter > Stop words Filter > Porter Stemmer > Keygraph keyword extractor >Document vector >kmeans
I dont know if there is something wrong with my work flow or what , I keep getting
"Execute failed: Error while writing to buffer, failed to write to file "knime_container_20151217_4305185062150924715.bin.gz": There is not enough space on the disk
if I change the dataset it occures at a different node. I have allocated 11GB Ram to knime and disk space 299GB which is almost empty.
The dataset I am using has 370288 values...370288 rows with one column each.
1) you can apply the Bag of Words creator after the porter stemmer. The online example is a bit outdate, I have to admit. A better example is this one here: https://www.knime.org/blog/sentiment-analysis
Alternatively connect to the example server and try out the new clustering example workflow. This has been updated just recently.
2) You have 370 000 rows and you are creating 370 000 document out of these rows? To process that amount of documents you need a machine with quite some computationla power. Is it possibel to aggregate the strings / rows before in order to work only with < 100 000 documents?
Thanks Kilian, I updated my workflow and everything went fine, until the data reached Document Vector. I want to implement Kmeans Alogrithm on my data for that I need document vector.
Document vector is taking for ever to complete, its stuck at 99% and gave me following warning:
"WARN KNIMEApplication$3 Potential deadlock in SWT Display thread detected. Full thread dump will follow as debug ouput."
help me out please.
I assume that there will end up too many terms as feature in your data set. How many distinct terms do you have in the end that are used as features?
I've had the same problem with my memory but it solved by having your process node in between of 'Parallel Chunk' nodes. The parallel chunk loop nodes can be found from KNIME Labs under the Parallel Execution.
Thank you for this hint. What also makes sense before creating document vectors is to filter down the number of unique terms (= features). Therefore I usually create a BoW after preprocessing of documents, group by term (maybe as string, without tags) and count the number of documents each term occurs in.
Due to Zipf's law many terms can be filtered because they only occur in a very small fraction of documents. Assume a term occurs only in below 1% of all documents. If you use this term as feature in the document vectors and this feature is used by a pred. model, based on this feature only below 1% of the documents could be classified. This feature makes not much sense i.t.o. classification. Another way to look at this feature is that it has mostly 0 values and just some very few 1's. The variance is very low. This feature is almost constant and should be filtered.
Filtering terms based on the number of documents they occur in make much sense. Filtering out those terms that occurs in less than X% (maybe X=1) of the documents reduced the number of useless features and reduces the feature space of the document vectors.
A example can be found e.g. here: https://www.knime.org/blog/sentiment-analysis
A post was split to a new topic: Disk and RAM configuration