Extract specific informations about profil description

Foxyellow · September 1, 2020, 11:04am

Hi everyone,

I am trying to extract information about a profil in a document. These informations are
The first and last name, Address, Phone number, E-mail, age…

For some informations, it’s difficult because it can be written in differents ways. Like the first name or last name, it can have one uppercase for each, or the last name written in capital letters and the first name with only one letter with an uppercase…

Even if we cover all the possibilities, at the end we don’t know if it really corresponds to proper names ?

Same thing to find an address, it can have several different writings.

Thank you for your help.

ipazin · September 1, 2020, 1:00pm

Hi @Foxyellow,

what kind of document are you talking about? Does one document corresponds to one profile or there are multiple profiles in one document?

Br,
Ivan

Foxyellow · September 1, 2020, 1:37pm

Thank you for you reply @ipazin

It can be a bill, a cover letter, a visit card…

Yes one document corresponds to one profil.

ipazin · September 2, 2020, 10:45am

Hi @Foxyellow,

in what format are those bills, letters, cards…? PDF or?

Br,
Ivan

Foxyellow · September 2, 2020, 10:47am

Yes they are in PDF format.

ipazin · September 2, 2020, 11:00am

Hi @Foxyellow,

then you need to use KNIME Textprocessing Extension to build your workflow with appropriate logic. See here two similar topics:

As I’m not into text processing don’t believe I can help you much more but if you share couple examples of documents you’ve got maybe some can give it a try. Also you can look for similar workflows on KNIME Hub. Hope this helps!

Br,
Ivan

ipazin · September 3, 2020, 8:53am

Hi @Foxyellow,

Changing extensions to .txt for example and then uploading should work. However downloads it will need to change extension back to .pdf in order to read it.

Br,
Ivan

Foxyellow · September 3, 2020, 9:07am

Thank you for the answer. I took a look for those topics and I tried to reproduce the same work that was done but I am having difficulties.

So, I will explain my approach.
I have differents college profiles in a PDF format. In these profiles there are diffrents sections
such as Purpose, Program, Structure, Grading… So I thought I could extract this information by separating the sections using the section’s name.
I use the regex extractor for that and I indicate that I want to group all the information included beetween the Purpose section and the Program section.
I’ve used a regex like (Purpose(?s). )(Program(?s). )

I tried this method only with one document and it’s working, I was able to separate the sections. But with multiple documents it doesn’t work because the order of the sections can be different. For one college profil the Program section will be first and the Purpose come after.College Profile-Parish.txt (156.7 KB)

So i don’t know how to do.
Then, I will still have difficulty to identify information such as last name, first name, address.

If anyone can help me that would be really nice of you.

Thank you very much.

armingrudd · September 3, 2020, 8:09pm

Hi @Foxyellow,

Would you please provide 2 or 3 different files.

Foxyellow · September 4, 2020, 7:25pm

Hi @armingrudd

Of course,

You can find two others files with differents organisations. East Lake High School Profile 14-15.txt (89.2 KB) College Profile Prep.txt (19.9 KB)

Thank you very much

armingrudd · September 5, 2020, 3:16am

What are the file extensions for these 2 new files? I cannot convert them to pdf (the last file was successfully converted)

Foxyellow · September 5, 2020, 9:57am

I’m sorry for this
College Profile Prep.txt (105.6 KB) School Profile 14-15 East Leak.txt (101.6 KB)

Can you try to convert them and tell me if it’s working ?

Thank you.

Foxyellow · September 6, 2020, 7:15pm

I’am not an expert with Regex, so I’m tried to specify differents order with separting them by |
but It’s given an horrible code.

In addition,to deal with all the possible combinations this gives a lot of possibilities, so this is clearly not a good solution.

armingrudd · September 6, 2020, 8:41pm

Here is the workflow to extract name, email, other info (phone, fax, address, etc.), program, purpose, structure, curriculum and grading.

26440-1-1.knwf (503.5 KB)

Some issues may appear while extracting “name” in some other files. You can send me those files and I will update the workflow for you.

Foxyellow · September 8, 2020, 8:37pm

Hi @armingrudd

Wow It’s just amazing that you’ve be did. It’s exactly what I want to do. To be honest there is some parts where It’s difficult for me to understand. But anyway It’s a very good workflow.

I would like to thank you for you’re invest.

Maybe there is one more difficult come in my mind. It’s about the sections name like Purpose, Program…

I tried the workflow with others examples and I forgot to specify that the name of a section can be variable. For example It not will be Grading but Grades or Ranking. Also, instead of having Structure It can be Equipements…

A school’s profile can be having others sections like Community, Awards and Distinctions…

I was thinking about create variables of these sections name but I don’t manage to isolate them.

I tried Regex like this :
\n[A-Z]*\n but it’s not catching the right think.

I need to learn more about Regex.

I’m sorry to ask again of you’re time, but you’ve been done a very impressionant workflow and if I can resolve this difficult, It’ would be great.

Again, Thank you for your help.

armingrudd · September 9, 2020, 3:45am

Please send a few more files if possible. I’m thinking of a more general approach to solve your issue which covers different cases.
If not, here is my suggestions to improve the workflow:

Try to rename the sections: e.g. Grades to Grading using regex.
Add other sections in the node’s expression (String Manipulation - Node 2) but later use column filter to remove them.

Foxyellow · September 9, 2020, 7:19pm

Here, you can find 3 new files with similary names sections.

Pingree Profile.txt (109.6 KB) Pace College Profile.txt (397.6 KB) Berkshire School.txt (211.6 KB)

Thank you

Foxyellow · September 15, 2020, 11:01am

Hi,

I am trying to resolve this issue using you’re suggestion.

So, I was thinking about extract name’s sections using regex with commun word. The purpose of this is to catch all possibilities for a section.

For example the Section about Grade/Grading/Graduation… we have the pattern “Grad” in commun letters. So maybe we can catch the section using this pattern.

I know it can return many results wich are not we want, but It may be a good start.

I have an other issue, it’s about the reading of the files. I don’t know if it’s the font of the PDF or an encoding problem but I have somes files having characters which are not recognized and replaced by symbols.

I also have words with spaces inside.

For example : P U R P O S E or PROGRA

Thank you !

armingrudd · September 17, 2020, 7:08am

Hi @Foxyellow,

Sorry for the late reply. Due to your latest examples the task became trickier. One problem as you said is with parsing PDF files. I’m working on an approach to parse them using external tools or APIs. Then still we have issues with different content formats.

I need more time to work on it when I’m free. Thank you for you patience.

Foxyellow · September 20, 2020, 1:06pm

Hi @armingrudd

Don’t worry, I appreciate your effort and take you’re time.