Word Parser not recognizing Line Breaks

Hi everyone,

I'm trying to extract the publishing dates from Factiva articles so I can use them for Sentiment Analysis statistics. Unfortunately, line breaks in the document are not recognized and removed when the documents are converted to strings.

This stops the Palladian/DateExtractor node from recognizing the correct date if the Date Stamp is followed by a number (after line break). Let me give an example for better understanding, this is an article I am trying to parse:

Now this is what I get as an output from DateExtractor:

behaviour predictions bound to outsmart you into buying

(and to the right)

behaviour predictions bound to outsmart you into buying

My workflow then looks like this and shows an empty table in the end:

behaviour predictions bound to outsmart you into buying

I am using Extract Time Window and GroupBy to exclude date guesses outside of my date range and filter duplicates, which works fine when the correct date is recognized among others.

Is it possible to teach the Word parser to insert a space when a line break is detected? That would help me greatly.

Thanks,

Julius

 

I'm not sure how to add line breaks, it looks like a bug, but for now why not pull out the dates separately.

you could use the wildcard tagger after your word parser, in you which you enter in the regex expression to capture the date format, and give it a tag name such as date. For the regex expression, try something like [0-3]?[0-9] [JFMAJSOND][a-z]{2,7}[yhletr] [1-2][0-9]{3}

Then use BoW creator, and general term filter on the tag date. Now use term to string node.

hopefully you now have the date format on its own, in which you can use the string to date/time node.

hope it helps,

simon.

Hi Simon, 

thank you so much, that works! One slight correction, the Regex should be  [0-3]?[0-9] [JFMAJSOND][a-z]{1,7}[yhletr] [1-2][0-9]{3}, otherwise the month of may is omitted.

Now I only have to understand how to correctly process the data so that is can be used as training data for an SVM Model. I'm not getting anywhere and I'm not sure what I'm doing wrong. I have built a model with a simple dictionary word count approach, but it does not work well at all...

Hi Julius,

what exactly do you want to do? Train a classifier to predict the classes of documents? Here is an example for document classification: http://tech.knime.org/document-classification-example

Basically filter and preprocess the documents, convert them into a bag of words first and then into a document vector. Finally train a classifier on the numerical (or binary) document vectors.

Cheers, Kilian