Note that these are different pdf’s. So with these I only need names and home addresses.
For example.
Row 19-21
Row 46-48
Row 67-69
Row 84-86
Row 103-105
Etc just names and addresses. Everything else I don’t need.
You’re a moving target. Originally you said you need 10 rows after a keyword.
Correct. I can’t seem to find that keyword anymore on these files. Even though they are the same exact format only thing that’s different is the names.
Can anyone build or make me a regex expression to extract names and addresses
The link your shared to your hubspace draws a blank again.
@NabilEnn maybe you also want to take a step back and try a different approach also.
I’m at a loss as to why you think regex will work. I’m not a regex expert but with both names and addresses there are infinite variations which would make it extraordinarily difficult to write a regex script to isolate them. You have another problem, i.e. there is a lot of garbage in your data, i.e. two names in one row, trash in a row which probably should have a name. This probably is due to how the Tika Parser node is reading the pdfs.
That is correct. I’m not sure about regex I just always thought it would be easier to use that I guess not. If there is a workflow or better way to read the pdf and isolate the trash, I’d really appreciate anyone that would help me I’ve been stuck on these files for a while and it would take me ages to do them by typing.
[Workflow removed per user request - ScottF]
Resetting the workflow doesn’t help. You’ve still got the Parser pointed to a local file which isn’t available in an upload. You’ve got to store the data with the workflow. I would highly recommend that you spend some time learning KNIME basics like file handling.
https://docs.knime.com/latest/analytics_platform_file_handling_guide/index.html#introduction
@NabilEnn this is how you keep your data with the workflow:
Or you upload the data as a zip file separately
Unless you’ve given up, could you upload 3-4 unprocessed pdfs? That way we can look at the original format.