How to remove unicode errors, URLs and tokenise?

dilaracet · October 23, 2021, 9:43am

Hi guys!

I’m doing some data pre-processing on a large dataset consisting of tweets and labels. the tweets have URLs and unicode errors (apostrophes were replaced with \xe2\ expressions) from the tweepy extraction.
I extracted the tweets on python using a template code so I have no clue how to clean the data on python. I’ve already tried Regex filter (using the regular expression: ((http|https)://)?[a-zA-Z0-9./?:@-=#]+.([a-zA-Z]){2,6}([a-zA-Z0-9.&/?:@-=#])*) and it doesnt remove all the URLs, only a few.
Can anyone help me remove those unicode errors, remove URLs and tokenise the tweets?
Need to use Bag of Words and SVM afterwards.

Thanks guys any help is appreciated.

ipazin · November 25, 2021, 1:30pm

Hello @dilaracet,

and welcome to KNIME Community!

It’s been a while since you asked this questions but in case you still need help maybe this workflow example on Twitter Data Analysis from KNIME Hub can help:

Br,
Ivan