Number Filter to ignore Document title

Hi,

this time I have a quick one:

I'm using the Number Filter Node on some Documents. Said Documents' title are numbers. After the Number Filter did its work, the Document titles are empty/blank.

Is there a way to tell the Node to ignore Document titles or some other workaround? Couldn't find any for now (except the document extractor/string to document again, but in that case its way too sophisticated).

Thanks,

Manu

Hi Manu,

document titles are treated like any other text in documents and can not be ignored by filtering or preprocessing. You can however, append the original document as additional column.

Cheers, Kilian

Hi Kilian,

thanks for your idea. I built a little workaround, where the Number Filter Node will append the orig document and hence the original title can be extracted from that and later on be combined with the extracted "number filtered text".

I attached the workflow.

Btw, your paper "The KNIME Text Processing Plugin" is quite helpful for my master thesis :)

Cheers,

Manu

Thank you. What is the topic of your thesis?

Cheers, Kilian

It's about "the impact of data preprocessing and feature selection on machine learning algorithms" (text mining/text categorization). First I thought about coding in python but I checked out KNIME and started to like it :)

Since I don't want to use existing data sets only, I started to create my own using wikipedia articles and their categories. It's distracting my focus on the actual work a little bit right now (https://tech.knime.org/node/55909/view), but I find it quite interessting as well.

Cheers,

Manu

Sounds very interesting. What I can tell from text mining projects is that the impact of preprocessing and feature selection for text classification is huge anf proper filtering of feature is essential to reduce the computational load.

Cheers, Kilian

Is there really no prospect for 'fixing' this? I mean, wanting to apply text processing to the body text and not the title is hardly a strange desire; in fact I suspect it is almost always the default in terms of what a user would want and expect. I know there are possible workarounds but often they are messy and sometimes they don't work.

For example, in one scenario, after having lost the titles from the documents, I extract the text using the Document Data Extractor, then turn the strings back into documents in order to reinstate the original filenames. But in some cases, the extracted strings contain concatenated words (I recently posted about this here in relation to the N Chars filter) due to some kind of tokenization glitch. (On a side note, this particular workaround would not be necessary if there was some way to insert titles into documents, but the Document Data Assigner appears to let you do everything except that. But even that would be a workaround for a problem that shouldn't exist in the first place.)

I don't want to sound ungrateful, because I love the text processing features of Knime and I can see that they are still developing. But this is one behaviour that has long driven me crazy and for which I still can't see the logic.