Sentence Extractor using Regex

ricknime · April 5, 2022, 2:31pm

Hello,

I’m creating a prototype to create an auto reference guide.

I’m trying to create several rows from a set of (pdf,word,text) documents, I used Tika Parser to load the files.

I found some guides in the forum, but I haven’t be able to achieve the objective. (I tried Palladian, Regex extractor, but it takes long time to respond, I don’t know if it has an option to set maximum number of lines to test and not to try to solve the regex at once.)

The idea is to create multiple rows selected via regex, creating a daisy chain splits to reach the detailed record and be able to update a database.

Then with a backtrack parsing, be able to display the reference. (Not part of the Knime process)

Taxonomy

Law
Book
Chapter
Article
Fraction
Paragraph

*Same philosophy to process WhatsApp multiline messages *

I don’t know if there is an equivalent to the node “Sentence Extractor”, which looks for a PERIOD, to create the row.

I’ve been trying several options, to apply for instance.

(?=^\bt.tulo\b)([\s\S]*?)(?=\s^\bt.tulo\b|\Z)

Then apply in the next step a split for the chapter and so on.

(?=^\bcap.tulo\b)([\s\S]*?)(?=\s^\bcap.tulo\b|\Z)

A simple way to match and create groups and row. The Node “Regex Split” fails, due it doesn’t work with a multiline selection. (It is my guess).

I know that it is possible to process the files in another environment, (readln … or third party extractors )

Other option is try to hack the node “Sentence Extractor”, and use a regex instead of the search for a PERIOD CR\LF. (It is my guess what the node does to split and create the row)

Any Ideas? Maybe there is a node that I haven’t try.

Illustrative images.

Prototype

Laws

Legal ref

Article, Fraction, Chapter, Law

Other Law Taxonomy Guide - Introduction to Basic Legal Citation

victor_palacios · April 5, 2022, 2:44pm

Could you provide:

(1) some csv or excel file with the picture’s text you supplied? (I’m currently using a mac so I can’t use OCR on the pictures you shared.)

(2) The workflow (.knar file) showing what you have done?

(3) expected output from a concrete example (1 or 2 will be sufficient).

Thank you~

For multiline issues, I recommend this thread: Multiline in the Regex Split node - #8 by armingrudd

After getting (1), (2), and (3), I may be able to help you get your desired output.

ricknime · April 5, 2022, 3:07pm

Thank you Victor for answer.

The files can be download from: https://www.diputados.gob.mx/LeyesBiblio/index.htm

in pdf or docx.

Any document with structure could work

Node Tika Parser → String to Document → Sentence Extractor

Word Tokenizer StanfordNLP Spanishtokenizer (My guess could work with the regular one in English)

The output is quite simple:
Regex 1

(?=^\bt.tulo\b)([\s\S]*?)(?=\s^\bt.tulo\b|\Z)

Regex 2

(?=^\bart..ulo\b)([\s\S]*?)(?=\s^\bart..ulo\b|\Z)

My guess is, any structured text taxonomy could be split.

Table of contents
PART I. THE OECD PRIVACY GUIDELINES … 9
Chapter 1. Recommendation of the Council concerning Guidelines
governing the Protection of Privacy and Transborder Flows of
Personal Data (2013) … 11
Part One. General … 13
Part Two. Basic principles of national application … 14
Part Three. Implementing accountability … 16
Part Four. Basic principles of international application: Free flow and
legitimate restrictions … 16
Part Five. National implementation … 17
Part Six. International co-operation and interoperability … 17
Chapter 2. Supplementary explanatory memorandum to the revised
recommendation of the council concerning guidelines governing the
protection of privacy and transborder flows of personal data (2013) … 19
Introduction… 19
Context of the review … 19

PART
CHAPTER
…
Paragraph

victor_palacios · April 5, 2022, 5:43pm

After much thought, I don’t think KNIME is suitable for this kind of work. First, I also ran into issues with Regex Split as well when dealing with the pdf data. So I switched to Python’s regex split (using the Python Script node) which was also not great. Therefore, I switched to using the word document and that worked better with Python’s regex split. If you are familiar with Javascript, then you can also use the Column Expression node to directly write code as well.

Going deeper into each generated section will require more work. Here is what I can provide as a starter in case you want to try:

structured_taxonomy.knar.knwf (115.0 KB)

ricknime · April 6, 2022, 7:43am

Thank you Victor,

From here I can take it. I will study more in how to integrate Phyton or other programming tools.

Sometimes ETLs make us lazy. ( If already exists, why do I need to program it again? )

system · April 13, 2022, 7:44am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.