Hi guys!
I’m doing some data pre-processing on a large dataset consisting of tweets and labels. the tweets have URLs and unicode errors (apostrophes were replaced with \xe2\ expressions) from the tweepy extraction.
I extracted the tweets on python using a template code so I have no clue how to clean the data on python. I’ve already tried Regex filter (using the regular expression: ((http|https)://)?[a-zA-Z0-9./?:@-=#]+.([a-zA-Z]){2,6}([a-zA-Z0-9.&/?:@-=#])*) and it doesnt remove all the URLs, only a few.
Can anyone help me remove those unicode errors, remove URLs and tokenise the tweets?
Need to use Bag of Words and SVM afterwards.
Thanks guys any help is appreciated.