so I'm really new at using KNIME and I've some questions whether my plan would work with the KNIME open source features or not.
I will do some sentiment analysis on twitter data, special feature here is, that my twitter data is stored in JSON format on a hadoop cluster in the hdfs filesystem. I've installed hive additionally to have a possibility to query the data.
1. Question: Is it possible to connect hive/hdfs with KNIME to get the data into KNIME?
2. Question: Where can i get a german dictionary to classificate the data? (positive,negative)
3. Question: Is it possible to analyse hashtags? For example #merkel and #steinbrück who is positve or negative tagged?
1. a.) Can you create a csv file output (zipped) of that data via hive or would it get too big? If this is possible, create a csv formatted file and use the Strings To Document node to convert you data into KNIME Documents. Btw. how many tweets do you want to analyze?
b.) KNIME provides nodes to access DBs via jdbc and there exists a jdbc client for hive (https://cwiki.apache.org/Hive/hiveclient.html) maybe this could work.
c.) Furthermore the exists a REST plugin for KNIME (http://tech.knime.org/book/krest-rest-nodes-for-knime) allowing for the interaction with RESTful webservices. Hive in incombination with HCatalog could work too.
3.) If you consider a hashtag e.g. as a document's label, with the text of the corresponding tweet as the document's text.
I like the idea of analyzing tweets related to the election (#merkel, #steinbrück) and would like to help you with that in case you are using KNIME. Don't hesitate to ask if you have questions or problems.
Previosuly I have done sentiment analysis using R(not with KNIME), there are multiple library packages that has been provided in R which will suffice your requirement. If you are familiar with R then can give a try.
Apart from this I was curious to know your data collection process. What solution you are using to load tweets in HDFS.
the nodes are wrappers for the Twitter API. This API unfortunately limits the number of results you get from Twitter. The only way to get more results over time is to re-run the query and collect the tweets over time, e.g. with loop nodes and a wait node. You can group over the IDs, to eliminate duplicates, and write that new tweets to a db or file.
Hi people, Im Josefina and I'm studying KNIME, with a case really similar than the one that Marian states. I was wondering if you can give me any advise of where can i get a spanish dictionary to classificate the data. I am analizing a politic case, with our political situation in Argentina.
I do, as Josefina, need an Spanish ndictionary to perform sentiment analyisis: I understand that the only Spanish dictionary available is the one made by The Spanish Society for Natural Language Processing (SEPLN) (http://www.sepln.org/?lang=en) but I haven't been able to get a copy. Do you know of any other sources?.
Thank you for developed a Sentiment Analysis with N-grams (https://www.knime.org/blog/sentiment-analysis-with-n-grams).
I run the workflow with twitter dataset to test the sentiment (positive or negative). It worked well. The problem is how to test with the real data. Does it possilbe to use a real data without pre-define the category (positive or negative)?
I also tried with PMML (as shown below) , but when I run the tested model, there is error message said: 'The column 'xx' does not exit in the table'. I really appreciated if you could advice.