Hi all, I have the following problem, hope someone can shed some light.
I have a list of about 10k nicknames from an old forum. About half of this list has a gender attached to it (done by the very users that registered), and the other half has no gender.
I want to do some analysis based on gender, so having ~5K users without is not helping. I’ve been playing with the Palladian Text Classifier Learner, without much success.
As sources I’m using the very list that I already have without the unclassified nicknames, and a very large list of english gendered names I’ve found in the internet.
The results are pretty poor. Many nicknames are kind of [^-J0hn-^] you know what I mean, but it even misses many of the ones I already provided in the sources.
I don’t really know how to solve this. I’ve been thinking about using the forum messages but I don’t really know how to do it.
Not sure whether I fully got what your are trying to do, but one option would be to ignore those unlabeled forum users from your analysis, because trying to label them adds a whole new classification problem.
I’ve been thinking about it, but sadly I have no way around it, I really do have to classify this uses based on gender or everything else I do will be for nothing.
I’m trying now a different approach with forum messages. The problem I’m facing is that I get a memory error where it cancels training after # number of rows due to memory. My current heap space is 26GB and writing tables in disk.
Is there any way around this? I already sampled down my dataset a lot (to around 80k rows when I have +2 million), and also run garbage collector before just in case.
The Palladian Text Classifier is rather memory and resource efficient. This looks like your using it in a very, very strange way. But this is just an assumption, and it’s hard to tell without knowing how the workflow looks.
Well the workflow is huge (it’s like the inception of metanodes). It’s mostly a database export from a forum MariaDB, and I’m applying the learner and predictor to a dataset to predict user gender (as I already have a column with gender label, but not all users are gendered and I need all of them).
It works fine until I have to learn from forum messages. I sampled down messages to 50K and it works now, but it really puts that computer to its knees.