I have some positive movie reviews as txt files in one folder called positive and some negative movie reviews as txt file in one folder called negative. Now, what i want to do is duild a model with this two folders files which will tokennize the words and leran which are positive reviews and which are negative reviews.
Then in one folder i have some movie reviews, i wan to apply that model into this folder files which are also txt files and find out which of them are postive reviews and which are negative.
can anyone please help me ragarding this issue?
you could try this:
read in the data file with an appropriate parser or reader node. Create Documents in order to be able to use the Tetprocessing nodes. Apply preprocessing (stop word filtering, stemming, ...) and create your feature space. Create Document Vectors (Document Vector node). Based on these vectors representing documents and the corresponding label you have (positive, negative) train a classifier, e.g. decision tree and use the resulting model.
When creating the feature space via the bag of words all documents (and thereby terms) must be available (those which are labeled positive or negative and those which are not labeled yet). Otherwise you would end you with two different feature spaces (one build of the labeled documents and a second build of the unlabeled) which would lead to problems.
To avoid this problem you could try to build a dictionary with term occuring in positive documents but not in negative and vice versa. This dictionary can be used on unlabeled documents to tag terms and count the tagged terms afterwards and thus weight the documents more positive of negative.
Thanks for your quick reply.
I started with file reader node and read one of the .txt files from positive folder, but when i tried to set stop word filter or some other preporcessing node it shows no column is spec compatible to document value. As you metioned that i need to create document first, what do u mean by that, how to do it?
I think i should start with flat file documet parser, which allow me to read all the .txt file from one folder and arrange those in rows. but then what, no preprocessing node works after it.
it depends how your flat files are formatted. Is the text splitted somehow in columns like a csv file? Or is it really plain text with no formatting at all? If it is formatted like a csv use the File Reader node followed by the Strings to Document, if it is really plain text use the Flat File Parser node.
To get started with KNIME Text Mining (creation of documents, preprocessing, vector creation, etc.) i recommend this document:
Here you find the online documentation:
Here are some workflows that might be interesting for you, once you know the basics:
Especially the 6. workflow might help you.
Those are really very helpful, Thanks.. :)
This is exactly what I would like to achieve. Can you assist me please?