Categorizing a large amount of unstructured data


I have around 2 lakh rows of data where there are thousands of categories. I want to categorize the same into much broader ones, and reduce those to around 20 categories. Also, in future whenever I receive a detailed category description in data, I want it to be automatically categorized within the 20 new ones. All the data is in a tabular form in a single Excel sheet. So instead of document classification, I require to classify the line items within a single document.

How do I go about it ? I can manually assign the broader category to the detailed ones and then train the model.

I am new to text processing so any kind of guidance is highly appreciated. Thank you !


I don't quite appear to understand what you mean by "line items within a single document" but I suppose that each line has a column with unstructured text and another with a category label, thus each line can be seen as a separate document and the whole Excel worksheet as a corpus of documents.

The first thing to find out: how do the broad topics relate to the unstructured documents ?

If the detailed category exclusively and accurately describes the document's content, then each broad topic is a mutually exclusive collection of detailed categories. In this case, the most efficient solution is to build a lookup table, which may or may not require text processing skills.

If the relationship between broad topic and detailed categories requires additional information from the unstructured text or if the detailed categories cannot be trusted (quality issues), you won't probably get around text processing. There are some useful text processing examples on the KNIME example server.

Another crucial point is domain knowledge. No matter the classification approach, given that the broad categories have to be assigned first, you'll only be able to extract as much information as you're able to recognise and understand. 

A relatively easy but not effortless way to gain such knowledge is to manually assign the broad categories on a random stratified sample of a few hundred or thousand instances, using both the category description and the unstructured text as inputs. At the same time, you are building a training data set which can be used for supervised learning model.

Unsupervised learning methods (e.g. topic modeling, clustering, LDA, etc.) can also be used to assist you with the task of assigning the broad categories. KNIME provides you with access to such methods, either directly or via R/Python integration. However, without any domain knowledge and the necessary text processing skills, this approach can be as intimidating and time consuming as the manual labeling. You could of course combine both approaches, if you had the necessary time.


Thanks a lot for your detailed response !