Sentiment Analysis on Amazon Reviews

It’s exactly what we agreed on with the professor.
Now I’m trying to make the tripAdvisor workflow you linked me work with the databases I’ve found.
I’ll let you know how it goes, thanks!

1 Like

I’m having a little problem with the pivoting node in the 3rd block of the workflow. I imagine the goal there would be to pivot the average star rating to the assigned topic that the LDA assigned to each review, but when I try to do that the node it doesn’t work.
I tried different selections like it’s shown in pictures below:





Immagine 2022-02-18 113228

Could you post the workflow as well? If you have sensitive data, you can use the table creator node as input with fake (but similar) data. Thank you!

Sure!
updated.json (364.7 KB)

It doesn’t let me upload directly the dataset so here it is:

and the workflow for good measure:

I have solved the Pivoting issue but I still can’t make the whole model work.
So far I seem to get coherent topics from the LDA at least.

Did I export the model in the wrong way?

Hi @Andrea123 , sorry for the delayed response. I meant if you would post the KNIME workflow (How to import and export KNIME Workflows - YouTube).

This is a .knwf file (see image) and you can upload it directly to here without using third party systems. As well, the node causing you issues it the pivot node, correct? Once I see the KNIME workflow, we can address the issue. Thank you!

Screen Shot 2022-03-08 at 10.21.51 AM

There’s no hurry! Thanks for the reply: I guessed I exported it wrongly when I tried to run the model on my laptop!
Anyway here is the .knmf file:
Topic Models from reviews LAPTOP.knwf (1.7 MB)

The Pivot problem is already solved (I just had to tinker a bit with the node configuration).
It’s from that point onwards that the model stops working. In particular the 4th block where I should evaluate results isn’t functioning.
Thanks again for the support!

Topic Models from reviews LAPTOP.knwf (1.7 MB)

So the model seemed to work for the native knime node and as for R, do you need to use that? I don’t use R, so you would have to ask for help in a separate thread related specifically to whatever issue you encounter there. As for the Linear Regression node, is that appropriate for your problem?

For the topic modeling, the second node was not working so I fixed that, but the visual you were trying to create had too many computer displayed at once, so the visual wasn’t useful. You’ll have to be picky about what you visualize since you have so many laptops.

I also made a change adding a flow variable so that you don’t have to manually select k in your third block.

Finally, I think the random seed generated different topics for me, because the labels you assigned seemed a little mismatched so you may want to double check that.

Have a look at how Machine Learning models are usually checked and done:

Thanks for the corrections!
I’ll check the link you posted and discuss, tomorrow, with my relator wheter I should use Linear Regression and R.

The labelling seemed coherent on my model, albeit for some of those I also wasn’t sure…I’ll definitely check again.

As always, thanks for all your help!

Just a question tho: you said that you changed a flow variable in the 3rd block (I assume in the LDA extractor right?), but how come I see no flow variables or no differences whatsoever when I imported it?
Where should I check exactly?
I imported the file through the import workflow command and had to change the name cause I already had one with the same name on. They look identical to me.
(It’s probably just me being dumb and not knowing what/where I should search ahah)

Hi,

Apologies, I also used the same name as well and then sent you the same file. This time I renamed my file and will show a screenshot to make sure this kind of problem is avoided.


Topic Models from reviews LAPTOP_victor_edits.knar.knwf (156.6 KB)

Also, have you tried smaller k just to see if the results lead to better groupings? As in, don’t allow the algorithm to determine the optimal k, but visually inspect the keywords for k = 10, 11, etc. instead of using 15? I remember in your original comments, you said 1 of the topics was too general? Just a thought I had this morning.

1 Like

Ah! I was checking every node of block 3 side by side to spot differences but couldn’t find any ahaha!
I’ll try to manually set a lower ok as per your suggestion. I originally followed the model and went with what seemed the elbow range to me: very wide in the first step and in the second one I say 2 possible ones, but the one at 15k had a more „steep“ turn so I thought it was the better option.
I’ll definitely try tomorrow by myself and then review everything with the professor who should sign this.
Thanks again, I’ll let you know how it goes!

@victor_palacios Hello

Do you know any repertory to make this analysis with spanish words?

update: the professor liked the model and, apart from the R part, everything works and it seems to also have a meaning.
The linear regression also makes sense when limiting topics but I was wondering where I could find find more infos about how it’s evalued: It’s like a classic regression model where the greater the coefficient the greater the positive correlation and viceversa? or is it something different?

EDIT: just to be clear, the k i find in step 2 is the Number of topic i have to manually input in the LDA configuration in the 3rd block right? or is it the number of words per topic?

Hi @Jalvear,

In the workflow I shared I don’t think there are language dependent nodes except the Stop Word Filter within the Preprocessing component. You can change the language there to be Spanish.

@Andrea123,

Linear Regression via KNIME:

But I’m still suspicious about regression though since ratings are not technically numerical. I would say they are classes and recommend logistic regression, random forests, etc. instead. Maybe try linear regression (a regression technique) and logistic regression (a classification technique). Decision Trees and Random Forests can do both regression and classification as well. And then just choose the model that performs best (or is most easily explained by you).

other video:

I’ll for sure try your suggested options. I had the impression that my ratings were numerical tho, since I actually see star ratings numbers in that column.
Now I’m trying a different approach and I’ve converted said ratings nto a polarity value of 0 for the negative sentiment (star ratings 1 and 2) ad 1 for the positive (star rating 4 and 5) while discarding all neutral reviews (star rating 3).
Thee idea behind is that for a SME is enough to know if a product was positively or negatively received and which topics were related to said evaluation,

Once 0/1 is your target to predict your goal becomes binary classification and regression techniques cannot be used. (Just in case you want to predict neg or pos).

more than prediction I would like to discover potential patterns.
E.g. if the when there’s the “warranty and return” topic the reviews are mostly negative. Or if “gaming” comes up mainly with good reviews.
But I’m also starting to doubt this part is actually necessary…I’m still experimenting with the other techniques you suggested me.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.