Clasifying early 2000s forum nicknames by gender

iagovar · July 11, 2020, 11:56am

Hi all, I have the following problem, hope someone can shed some light.

I have a list of about 10k nicknames from an old forum. About half of this list has a gender attached to it (done by the very users that registered), and the other half has no gender.

I want to do some analysis based on gender, so having ~5K users without is not helping. I’ve been playing with the Palladian Text Classifier Learner, without much success.

As sources I’m using the very list that I already have without the unclassified nicknames, and a very large list of english gendered names I’ve found in the internet.

The results are pretty poor. Many nicknames are kind of [^-J0hn-^] you know what I mean, but it even misses many of the ones I already provided in the sources.

I don’t really know how to solve this. I’ve been thinking about using the forum messages but I don’t really know how to do it.

Any idea?

marten_kose · July 13, 2020, 9:37am

Not sure whether I fully got what your are trying to do, but one option would be to ignore those unlabeled forum users from your analysis, because trying to label them adds a whole new classification problem.

iagovar · July 13, 2020, 11:29am

I’ve been thinking about it, but sadly I have no way around it, I really do have to classify this uses based on gender or everything else I do will be for nothing.

iagovar · July 14, 2020, 3:38pm

I’m trying now a different approach with forum messages. The problem I’m facing is that I get a memory error where it cancels training after # number of rows due to memory. My current heap space is 26GB and writing tables in disk.

Is there any way around this? I already sampled down my dataset a lot (to around 80k rows when I have +2 million), and also run garbage collector before just in case.

I’d really like to use a larger sample.

iagovar · July 14, 2020, 7:06pm

Went after this poor guy on second-hand market thinking it was going to give some leeway but it’s already suffering. At least it was around 200€.

qqilihq · July 15, 2020, 7:05am

The Palladian Text Classifier is rather memory and resource efficient. This looks like your using it in a very, very strange way. But this is just an assumption, and it’s hard to tell without knowing how the workflow looks.

iagovar · July 15, 2020, 10:58am

Well the workflow is huge (it’s like the inception of metanodes). It’s mostly a database export from a forum MariaDB, and I’m applying the learner and predictor to a dataset to predict user gender (as I already have a column with gender label, but not all users are gendered and I need all of them).

It works fine until I have to learn from forum messages. I sampled down messages to 50K and it works now, but it really puts that computer to its knees.

qqilihq · July 15, 2020, 11:03am

What feature settings are you using?

iagovar · July 15, 2020, 11:11am

This ones. I needed accuracy so that’s what I’ve come up with.

qqilihq · July 15, 2020, 12:24pm

A max. length of 20 words will generate way too many phrase combinations.

I would set this to a maximum of three, no more. For beginning, I suggest min. length 1 … max. length 2 if you use word n-grams.

Alternatively, try chars, e.g. 5-grams (and decrease/increase max/min step by step).

In fact this is a good scenario for a grid search to optimize for best results.

iagovar · July 15, 2020, 12:48pm

Hmm, I’ll try that, thanks.

I’m also having this problem to write and load the model: ERROR in Palladian node TextClassifierModelReader

system · January 14, 2021, 12:53am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.