I am trying to extract information about a profil in a document. These informations are
The first and last name, Address, Phone number, E-mail, age…
For some informations, it’s difficult because it can be written in differents ways. Like the first name or last name, it can have one uppercase for each, or the last name written in capital letters and the first name with only one letter with an uppercase…
Even if we cover all the possibilities, at the end we don’t know if it really corresponds to proper names ?
Same thing to find an address, it can have several different writings.
then you need to use KNIME Textprocessing Extension to build your workflow with appropriate logic. See here two similar topics:
As I’m not into text processing don’t believe I can help you much more but if you share couple examples of documents you’ve got maybe some can give it a try. Also you can look for similar workflows on KNIME Hub. Hope this helps!
Changing extensions to .txt for example and then uploading should work. However downloads it will need to change extension back to .pdf in order to read it.
Thank you for the answer. I took a look for those topics and I tried to reproduce the same work that was done but I am having difficulties.
So, I will explain my approach.
I have differents college profiles in a PDF format. In these profiles there are diffrents sections
such as Purpose, Program, Structure, Grading… So I thought I could extract this information by separating the sections using the section’s name.
I use the regex extractor for that and I indicate that I want to group all the information included beetween the Purpose section and the Program section.
I’ve used a regex like (Purpose(?s). )(Program(?s). )
I tried this method only with one document and it’s working, I was able to separate the sections. But with multiple documents it doesn’t work because the order of the sections can be different. For one college profil the Program section will be first and the Purpose come after.College Profile-Parish.txt (156.7 KB)
So i don’t know how to do.
Then, I will still have difficulty to identify information such as last name, first name, address.
If anyone can help me that would be really nice of you.
Wow It’s just amazing that you’ve be did. It’s exactly what I want to do. To be honest there is some parts where It’s difficult for me to understand. But anyway It’s a very good workflow.
I would like to thank you for you’re invest.
Maybe there is one more difficult come in my mind. It’s about the sections name like Purpose, Program…
I tried the workflow with others examples and I forgot to specify that the name of a section can be variable. For example It not will be Grading but Grades or Ranking. Also, instead of having Structure It can be Equipements…
A school’s profile can be having others sections like Community, Awards and Distinctions…
I was thinking about create variables of these sections name but I don’t manage to isolate them.
I tried Regex like this :
\n[A-Z]*\n but it’s not catching the right think.
I need to learn more about Regex.
I’m sorry to ask again of you’re time, but you’ve been done a very impressionant workflow and if I can resolve this difficult, It’ would be great.
Please send a few more files if possible. I’m thinking of a more general approach to solve your issue which covers different cases.
If not, here is my suggestions to improve the workflow:
Try to rename the sections: e.g. Grades to Grading using regex.
Add other sections in the node’s expression (String Manipulation - Node 2) but later use column filter to remove them.
I am trying to resolve this issue using you’re suggestion.
So, I was thinking about extract name’s sections using regex with commun word. The purpose of this is to catch all possibilities for a section.
For example the Section about Grade/Grading/Graduation… we have the pattern “Grad” in commun letters. So maybe we can catch the section using this pattern.
I know it can return many results wich are not we want, but It may be a good start.
I have an other issue, it’s about the reading of the files. I don’t know if it’s the font of the PDF or an encoding problem but I have somes files having characters which are not recognized and replaced by symbols.
Sorry for the late reply. Due to your latest examples the task became trickier. One problem as you said is with parsing PDF files. I’m working on an approach to parse them using external tools or APIs. Then still we have issues with different content formats.
I need more time to work on it when I’m free. Thank you for you patience.