Rule engine doesn't work with text pasted from PDF to excel

Cadu · August 19, 2013, 1:37am

Hi,

I pasted article's abstract (text) from PDF to excel. It seems that because the hard return that comes from PDF formatting the 'rule engine' node isn't working.

For instance, the rule is to return '1' when:

$Abstract$ LIKE "*actor-network*"

But the output is '0'. By hand I adjusted the text (deleting the hard returns) and then 'rule engine' worked. The problem is that I have hundreds of abstract and it would be very annoying and time consuming to do the adjustment by hand.

Through Knime, is there any way to format properly the text pasted from PDF to excel in order to 'rule engine' work on it?

I attached the workflow in which the problem can be noticed.

Many thanks in advance,

Cadu

text_from_pdf_to_excel.zip

kilian.thiel · August 19, 2013, 10:14am

Hi Cadu,

you can use the String Replacer node for replacing substrings. Select "Regular expression" in the dialog with this expression: [\r\n]+ and a whitespace as replacement string. Attached you find your workflow with a String Replacer example.

Cheers, Kilian

text_from_pdf_to_excel.zip

Cadu · August 19, 2013, 11:01pm

Wow Kilian, this worked great. Many thanks for share the example.

An additional point:

The 'abstract' column has mixed content formatting. I mean, there is content that came from database and it's ready to use in terms of formatting (e.g. Scopus, Web of Science) and content inserted by hand (this later copied/pasted from PDF and the 'String Replacer' is working on it ).

Will the 'String Replacer' act just in the content from PDF and maintain 'intact' the content from database? Is there any problem on using 'String Replacer' in mixed content formatting?

Cheers,

Cadu

kilian.thiel · August 19, 2013, 11:11pm

Hi Cadu,

the string replacer replaces strings as you specify it to, no matter from where the strings originally come from. The replacement rule is applied on all strings of the specified column. If you want to apply it only on a subset of rows, you have to filter the rows before hand using e.g. the Row Filter node.

Cheers, Kilian

Cadu · August 19, 2013, 11:42pm

Hi Kilian,

Actually, I don't have the need to filter. It is best for the workflow to apply the 'String Replacer' in the same column with mixed formating. I was just concerned if the 'String Replacer' could cause any 'harm' to the data that is already formatted and ready to use (without the need of the 'String Replacer'). If I understood, the 'String Replacer' will act just on the content pasted from PDF (part of the rows) and will not 'harm' the other 'ready' content.

Thank you,

Cadu

kilian.thiel · August 20, 2013, 10:03am

Hi Cadu,

in the dialog of the String Replacer a string column can be specified. The replacement rule is applied to all cells in this column.

Cheers, Kilian

system · June 2, 2023, 9:50pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.