DateExtract not working properly

julius.c · April 25, 2014, 7:24pm

Hi Everyone,

I'm trying to extract the publishing dates from Factiva articles so I can use them for Sentiment Analysis statistics.

They all have the same format and are split to be one .doc per article, I use the Text Processing Word Parser to read them in which works fine, unfortunately the Date Extract Node not only gives back many wrong dates (which I have solved by using "Extract Time Window" and "Group By"), but also fails to find some of the correct dates. This leads to the "Extract Time Window" node omitting all articles that did not have any date in the range and thus reducing the dataset (by about 20-30%)

I believe the problem is with the Word Parser, which in the description says Paragraphs are taken into account, but in reality they are not, there is not even a space between paragraphs, making the dates unreadable for the DateExtractor.

Is there any way I can force KNIME to look up the correct date? They are always in the format "21 May 2013" and are the only dates formatted in that way.

What am I doing wrong?

Any help is greatly appreciated.

qqilihq · April 25, 2014, 10:13pm

Hi,

I've seen that I introduced a bug due to some recent refactoring in the DateExtractorNode (month was shifted by one; in your example 21 May 2013 was erroneously parsed to "21.06.2013"). This is solved now (update available in tomorrow's nightly build of the nodes).

Beside the issue with the shifted month; are there any other problems you encounter? If so, please report them here, we're constantly improving the date parsing logic.

Best,
Philipp

julius.c · April 25, 2014, 11:37pm

Hi Philipp,

First of all thank you for your quick reply. I will be sure to test the new build as soon as it is out. Let me give you an example of the problem I am encountering. Now, I am aware part of the problem is with the Word Parser, but it would be great if I could find a workaround.

For example, I tried to parse this article:

Now this is what I get as an output from DateExtractor:

(and to the right)

My workflow then looks like this and shows an empty table in the end:

As you can see, the Word parser put together the date and the time of publication without as much as a space between, making it unreadable for your node. Also, the second time stamp at the bottom, 24-01-13, is not recognized at all.

It would be great if you had some advice as to how I can prevent this. Unfortunately, with over 20.000 articles to parse, I can't edit every single one so the node can read it...

Thanks again!

Julius

qqilihq · April 26, 2014, 12:17pm

Hi Julius,

thank you for the detailed description; to have a closer look at the issue, it would be helpful to have the sample data you posted. Would it be possible to attach the workflow with the data here?

Best,
Philipp

qqilihq · April 26, 2014, 12:45pm

[update]

I just tried with a sample Word file and I can new see your issue. The problem is, that the line breaks in the Word file are removed by the "Word Parser" node and no whitespace is introduced, when converting the Document cells to Strings. I would suggest to contact Kilian Thiel (author of the KNIME Text Processing nodes), or open a thread in the Text Processing subforum here.

Beside that, as mentioned above: As we're constantly improving the date parsing capabilities in Palladian, feel free to keep me updated in case certain date formats were not recognized correctly/at all.

Best,
Philipp

julius.c · April 26, 2014, 1:13pm

Hi Philipp,

Thanks for your help and trying out. Unfortunately, for copyright issues, I can't post the sample data here for risk of getting in trouble with my university and Factiva. I will try to contact Kilian Thiel about the line break issue.

As for formats not identified, two come to my mind which are also used in my sample document:

1. YY-MM-DD / DD-MM-YY (obviously difficult to find out in which place which variable is, but it would already be helpful if all combinations were recognized)

2. YYYYMMDD

I am not an expert about node programming so I do not know how hard it would be to implement the recognition of these formats, but it would be amazing if you found a way.

Again, thanks for taking so much time and trying to identify the problem.

Cheers,

Julius

system · April 21, 2023, 9:41pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.