Genre analysis of forum posts

Dear all

I'm working on a project with the goal of detecting genres of a forum posts based on the post titles. I have been using various resource provided on KNIME website and have designed my workflow (please see it attached). My idea was to clean up the data, extract the keywords and use hierarchical clustering technique to generate clusters of the content. Then based on the clusters generated, I may be able to assign different genres to them. I have a list of questions that I would really appreciate any input.

1. Is this method appropriate? The reason I choose clustering is this is a company forum with a specific context (financial service) and therefore I have no pre-defined data that KNIME can learn from. I have no idea of the subjects that users could talk about, apart from everything would relate to financial services.

2. How do I choose the parameters of some nodes? 

a) Keygraph keywords extractor. Is "Number of keywords to extract" determined by the length of each document (or the average length)? In my case, all the post title length range between 3 and 20. Does this mean I need to scale down all three parameters of this node?
b) Document vector. Bitvector or not? How to choose and why?
c) Distance matrix. Which algorithm to choose and why?

3. I understand in the Hierarchical cluster view, the horizontal axis represents the column ID. However is there a quick way to display the document (or maybe the original document) rather than column ID? Although the clusters are visually identifiable, it has little use if I can't link back to the post titles. 

4. Once I identify the genres (clusters), I also want to do some quantitative analysis because alongside the titles I also obtained corresponding view counts, response counts, authors etc. (please see workflow for details). I'm hoping that I may be able to analyse the number of views and responses that each cluster attracts. I'm aware this may require a totally different workflow, but I would really appreciate if you can shed some light on how I can link the clusters to quantitative analytical techniques.

I understand this is a quite long post and I thank you for taking time to read this. I have only been learning KNIME for a week and would be grateful if you can give some valuable insights on my project.

Best 

James

I have been trying Topic Extractor for my data. But it seems it would suit best for situation that the number of topics is known before analysis. I have also encountered a problem, whenever I ran Topic Extractor node, the third output Iteration Statistics is always empty with no data inside. I tried in several different workflow including samples downloaded from KNIME website, all had the same result. If the downloaded workflow is pre-complied and ran, there is data in the third output, but if I re-run it, it becomes empty. It appears to have something to do with my PC. Is this even possible?

Also, could someone please lend a helping hand with my previous post? 

Many thanks

James

Hi James,

I inspected your workflow and have a few tipp and comments about it.

  • The Keygraph Keyword extractor is not really good for feature extraction, which is what is happening since the Document Vector create creates vectors based on the extracted terms, that are later on used for clustering. Better try out the method described here to extract / filter words of a bow to create vectors: https://www.knime.org/blog/sentiment-analysis.
  • For documents I usually use complete linkage for hierarchical clustering to have really compact clusters.
  1. To detect genres in texts why don't you use the Topic Extractor node? You can specify how many topics have to be extracted. This is like the question for k using k-means. You could try hierarchical clustering first to find out how many clusters there are in the data. The Topic Extractor node will assigned terms that describe each topic.
  2.  a.) As mentioned above I would not use the Keygraph Keyword extractor node.
  3. b.) Bitvector is fine for the beginning, you can fine tune with term frequencies (TF*IDF) later on. Usually using these frequencies does not have a huge effect on the result.
  4. c.) Use the Cosine distance when comparing document vectors. Distances like Euclid or Manhattan do not make much sense on high dimensional data: https://en.wikipedia.org/wiki/Cosine_similarity
  5. You can change the RowID in a data table using the RowID node. Select the document column as column to set as row id.
  6. That is also possible. In your original data table you have all the data. Extract e.g. the original titles or texts later on from the documents using the Document Data Extractor join the original numbers. Than group over assigned cluster labels and aggregate the numbers e.g. by mean or sum or so.

Cheers, Kilian

Hi James,

about the Topic Extractor. The problem with unsupervised topic extraction is similar to unsupervised clustering. We don't know how many clusters there are in the data but some algorithms require this setting, e.g. k.-means. Other algorithms are threshold based, e.g. using a cut in the dendrogram of an hierarchical clustering.

I can not reproduce the issue with the topic extractor, can you please post a workflow with the problem.

About the previous part: can you provide a reasonably big data set with which you want to work and extract genres. I could try to create a example workflow to get you started but need data for that.

Cheers, Kilian

 

Hi Kilian

Thank you so much for your comments. They really helped a lot.

I am reading the blog post about sentiment analysis you wrote in 2014 and I will try to apply that method to my project.

For the Topic Extractor iteration stats missing issue, I have uploaded a workflow for your review. It's a sample downloaded directly from KNIME server. I suspect something is not right for my machine.

I have also attached the dataset as a CSV file. Due to confidentiality reason I cannot provide the original data as the forum is for internal user in the firm. However I managed to generate a dummy dataset - won't be the same but can serve as an alternative. I simply searched the phrase "financial services" in Financial Times and retrieved the first 500 article titles. The number of titles is close to my original data and I also randomised the view and response counts. Data structure remains the same.

Please find both in the zip file.

Many thanks

James

Hi James,

thank you for the workflow and the data. I imported the workflow reset the Topic Extractor and re-executed it again but could not reproduce the problem with the Topic Extractor. The third output table contains rows about the iteration statistics.

I tried to apply hierarchical clustering on your data set but it did not work well. Since in the data set only titles are used a texts the feature overlap and thus similarity of documents is quite small. Also extracted keywords do not necessarily represent the genre e.g. for the title "Cyber insecurity ..." the genre would be Internet but it does not appear in the title. I doubt that these titles can be clustered reasonably by simple term overlap. Maybe lemmatisation would help.

Cheers, Kilian

 

Hi Kilian

Thanks for getting back to me. 

Yes I have produced a similar result of the original data. Attached is the dendrogram of both datasets (please view it in a new windows as it's cropped in the thread view) and like you said, the original data seems highly fragmented (although better looking compared to the dummy data). Do you think we can conclude that even though the content were posted in a group with a common theme (financial services), there is just not enough coherence in a bunch of titles?

I also tried to use k-means, but the result is not promising either. I tried the k value from 5 to 15 and chose k=7, which gave me a set of clusters containing {422, 1, 17, 27, 9, 29, 10}. Then I applied k-means again to Cluster_0 (I don't know if this is a viable practice), I got two big clusters {127, 284}. I then tried k-means the third time and discovered that the cluster of 284 documents cannot be further separated. I might use k-means wrongly from the second level but I'm eager to know what your thought on this. 

From the result of hierarchical clustering, can I deduce that it's not quite possible to do a successful clustering using other algorithm such as k-means? If not, is there any other thing I can do to extract structure or features from the data? (such as Topic Extractor?)

Many thanks

James

Hi James,

what you can see from the dendrogram is that only small groups of data points have distances below 1. I.t.o. cosine distance this means they have some words in common. To the rest of the words they have a distance of 1, meaning that they have no words in common. The dedrogram shows that there are only small groups that have some words in common. Also it shows that there is a bigger group in the middle that have more words in common. To make some sense of these titles you could assign the "Cluster Assigner" node after the hierarchical clustering node and specify a distance threshold or cluster count to assign clusters. You don't need to apply a k-means additionally. You could specify 0.975 a distance threshold and filter out all clusters with less then 10 (or so) documents, assuming that the bigger clusters represent a topic.

Cheers, Kilian

Thank you Kilian, the explanation of dendrogram was very helpful. Although I did my own research, it's great to have you confirmed some of my interpretations.

I took your advice and applied Cluster Assigner. However I got stuck at "filtering any cluster with more than 10 documents". I understand it's about counting the occurrence of document per cluster and filtering out any rows with more than 10 documents. I managed to achieve the first step but I don't know how to tie the output table back to the clustered data. I did it awkwardly with another branch of nodes with manual input in Rule-based Row Filter (please see image attached), but I'm pretty sure there is a better way to achieve it. Could you please shed some light on this?

Another question, is there any way to combine Topic Extractor with this method? For example, extracting one or two topics per cluster after filtering any document that doesn't belong to any cluster? Do I need a loop to run Topic Extractor of each cluster in order to aggregate all results into one table?

I know I have lots of knowledge gaps of using KNIME (for instance basic data manipulation), but I have a tight deadline so in a way I really rely on your input for my project.

Many thanks

James

Hi James,

with filtering I meant filtering out clusters with less than 10 documents. The Cluster Assigner will append a column with the assigned cluster label. You can group by the cluster label and count the unique documents assigned to these clusters. Then use a Row Filter to filter small clusters and keep only clusters with min 10 documents. Then use a Reference Row Filter to filter the output table of the cluster assigner node based on the filtered clusters table (output table for row filter). This will result in data points with cluster labels assigned of clusters that are reasonably big. "Noise" is filtered out in this data set.

To extract topics for each cluster use the Group Loop Start and select the cluster label as group column. In the loop body use the Topic Extrctor node and extract n topics (i suggest 1 with n words representing that topic). Then use the loop end to collect the data.

Attached is an example workflow.

Cheers, Kilian

Hi Kilian

Thanks so much for your advice and the sample workflow, it's very generous of you.

I managed to get a similar result using the method you described, I am very pleased with the progress so far.

Best

James