Doc2Vec and Labels

petervk · May 15, 2020, 7:08am

I am trying to adopt Simple Document Classification Using Word Vectors workflow for a smallish data set of news articles.

Like in the provided example, I am creating a table with a content column and a lable column. The table contains about 2200 rows (news articles) mapped to 10 categories and is used as the input to Doc2Vec Learner. As in the example workflow, this is then followed by the Vocabulary Extractor node. So far so good.

When I run this and look at the Label output of Vocabulary Extractor however, I only get a single Label instead of the expected 10. I don’t understand what it going on here, and I was not able to find more detailed relevant information in the documentation of the forum.

Btw, the only other change I made to the sample flow was to remove the “Rules” node in the “Read Training Documents”, which as far as I understand just shortens the label names to a three letter acronym but does not seem essential.

Any help would be greatly appreciated.

ScottF · May 15, 2020, 8:09pm

Hi @petervk -

Are you able to post the relevant portion of your workflow with the actual data you’re using, or is that confidential? That would definitely help.

(One quick thing to check would be the Label Column in the configuration on the Doc2Vec column.)

petervk · May 15, 2020, 10:09pm

Hi Scott,

Nothing secret here. I am using a Kaggle dataset ( BBC Full Text).

Don’t know whether it helps, but I have attached the output of the Read Training Documents as a “.table” file (using the Table Writer).
kaggle.zip (2.7 MB)

The columns in the table are:

Concatenate(Col0)- String
label - String
Iteration - Numeric (Int)

In Doc2Vec the Concatenate(Col0) is mapped to Document Column and label to Label Column as in the original example.

ScottF · May 21, 2020, 7:55pm

Hi @petervk -

Sorry for the delayed response, and thanks for providing the dataset. I am able to reproduce the behavior you’re seeing, but I can’t yet explain it.

One thing I noticed is that by placing a Row Sampling node after the Table Reader, and producing a smaller dataset of n=50 stratified on label, I see 4 out 5 labels show up in the vocabulary extractor. My initial suspicion is that on your ~2200 document dataset, the Doc2Vec algorithm may not be converging with the default parameters - but I’m not sure yet.

Let me check internally with one of our developers and see if we can gain any more insight.

petervk · May 21, 2020, 8:10pm

Fabulous. Thank you Scott. Just FYI, I suspected as well that convergence might be a problem, and increased the interations in the hope that it would give better results. That did not help, despite an entire night of number crunching. Have not tried using a smaller dataset.

And btw, I did run the dataset through kNN and Naive Bayes classifiers, both of which gave almost perfect accuracy. That would suggest to me, that the categories in the dataset are sufficiently distinct to also be picked up by Doc2Vec, but of course I could be wrong.