I have question related to text classification, I have transaction data which contain text field entered by user asking for certain help, I want to classify each transaction based predefined categories using the the text (by guessing what user is asking for from combination of sequence of words) ( number of categories more than 70+) ,
I don’t have any labeled data set, just un-categorized large list of text around 2M records
I want to know how to change flat tire -> car maintenance
I have lost my key in the shop -> Lost & found
I forgot my book on table -> Lost & found
I can’t find my car -> Security
My car was stolen -> Security
I’m not feeling well -> health
I’m tired -> health
I’m facing problem with my van , it isn’t working -> car maintenance
I’m facing problem with my van ,I can’t find it -> Security
I can’t find my book -> Lost & found
can Word2vec be used in such above case?
i.e. is it possible to use Word2vec in order to map words that exist in text to certain category
from above examples if text contains flat tire , can word2vec predict category as car maintenance and so on.
Word2Vec will not give you a category, but rather a vector. To map from that vector to a category, you’d still need to perform some sort classification.
You’ll need a word vector model. There’s the Word2Vec Learner to build such a model, or alternatively you could probably use an existing model and read it with the Word Vector Model Reader. However, I’m skeptical that the latter option will give you any good results, as the existing models are obviously trained on an entirely different domain.
Basically, this approach corresponds to #1 in my above post.
However, as stated above, this is surely not the right path to tackle your specific problem with the extreme high number of potential categories.
Rephrasing your problem, you’ll need a two-step classification, first into 125 companies, then into company-specific categories, right? Then I’d consider splitting this classification task into two steps (i.e. 1 + n_companies classifiers). Still, 125 respective 350 categories is huge, and it’ll require a significant amount of training data/time, in case want to create a supervised approach.
Are there any further features which you could exploit beside the textual information?
for now, I don’t think so, but there is relationship between company -> inquiry type -> category
I have 2.5 million records ,
columns [transaction id, company name, inquiry type, inquiry (the text will be used for text classification) , category ( new column, empty , the result of the classification)]
The only field I want to fill is the value of “category” field based on classifying the text in “inquiry” field
my dataset contains the following columns
1- transaction Id (number,has data)
2- Company name (string,has data)
3- inquiry type (string,has data)
4- inquiry (string ,has data, contains text that i’m going to use to classify and find category)
5- category (string , NO data ,I want to fill this column as result of text classification )