Clasifying early 2000s forum nicknames by gender

Hi all, I have the following problem, hope someone can shed some light.

I have a list of about 10k nicknames from an old forum. About half of this list has a gender attached to it (done by the very users that registered), and the other half has no gender.

I want to do some analysis based on gender, so having ~5K users without is not helping. I’ve been playing with the Palladian Text Classifier Learner, without much success.

As sources I’m using the very list that I already have without the unclassified nicknames, and a very large list of english gendered names I’ve found in the internet.

The results are pretty poor. Many nicknames are kind of [^-J0hn-^] you know what I mean, but it even misses many of the ones I already provided in the sources.

I don’t really know how to solve this. I’ve been thinking about using the forum messages but I don’t really know how to do it.

Any idea?

Not sure whether I fully got what your are trying to do, but one option would be to ignore those unlabeled forum users from your analysis, because trying to label them adds a whole new classification problem.

I’ve been thinking about it, but sadly I have no way around it, I really do have to classify this uses based on gender or everything else I do will be for nothing.

I’m trying now a different approach with forum messages. The problem I’m facing is that I get a memory error where it cancels training after # number of rows due to memory. My current heap space is 26GB and writing tables in disk.

Is there any way around this? I already sampled down my dataset a lot (to around 80k rows when I have +2 million), and also run garbage collector before just in case.

I’d really like to use a larger sample.

Went after this poor guy on second-hand market thinking it was going to give some leeway but it’s already suffering. At least it was around 200€.

The Palladian Text Classifier is rather memory and resource efficient. This looks like your using it in a very, very strange way. But this is just an assumption, and it’s hard to tell without knowing how the workflow looks.

Well the workflow is huge (it’s like the inception of metanodes). It’s mostly a database export from a forum MariaDB, and I’m applying the learner and predictor to a dataset to predict user gender (as I already have a column with gender label, but not all users are gendered and I need all of them).

It works fine until I have to learn from forum messages. I sampled down messages to 50K and it works now, but it really puts that computer to its knees.

What feature settings are you using?

This ones. I needed accuracy so that’s what I’ve come up with.

A max. length of 20 words will generate way too many phrase combinations.

I would set this to a maximum of three, no more. For beginning, I suggest min. length 1 … max. length 2 if you use word n-grams.

Alternatively, try chars, e.g. 5-grams (and decrease/increase max/min step by step).

In fact this is a good scenario for a grid search to optimize for best results.

1 Like

Hmm, I’ll try that, thanks.

I’m also having this problem to write and load the model: ERROR in Palladian node TextClassifierModelReader