Pulling numbers out of a string

dseller · July 1, 2021, 4:36pm

Hi,
I’m sure this is relatively easy but I can’t figure it out. I have parsed a pdf which has a whole lot of nonsense and then some numbers which I would like to pull out. These numbers can be anywhere in the string but typically occur toward the bottom and are proceeded by keywords - “Result”, “Resultrow”, etc. Also, the numbers are integers, doubles, and dates. I have a feeling the Regexsplit node is the way to go but I have no idea the syntax to use.

PLEASE, if you provide syntax for the Regexsplit give me some hints on how this works or where to look for an explanation of the syntax. I searched Java API and came out more confused than ever.

Image of pdf:

Image of data stored in pdf:

Thanks!

takbb · July 1, 2021, 7:55pm

Hi @dseller , I’m afraid I haven’t got an answer for you at the moment, but I do have loads of questions…

When you say you want to “pull numbers out”, do you mean all the numbers, and just the numbers?

So if, as in your example it said ResultRow2, would you want to “2” pulled out?

What about where it says “Page 3 of 4”… would you want the 3 and the 4 in this case?

And something like “NB-QCF-03”? You said that the numbers are “integers, doubles and dates”. I’m guessing therefore that the “03” in this example should be excluded as it is part of something else?

And what date format are you looking for? I would think identifying that interpreting “1 November 10, 2017” from your example would be an interesting challenge? At what point do the contained integers become part of the date, and not considered to be numerics in their own right?

The more I write and think about this, the more I think that your comment “I’m sure this is relatively easy” was possibly optimistic

Assuming the data can be extracted, how would you want the numbers presented? A column for each number? A row for each number? A comma delimited string?

Often with a problem such as this, it is the actual definition of the specifics of the problem which take the time to define. I’m sorry to say that at the moment, this “feels” too general. There may be an element of Regex involved in the solution, but I think right now, from what you have said, that the solution is going to be anything but straight forward.

That’s not to say that somebody here cannot assist with finding a solution, but I think we’ll need a better understanding of the actual requirement, (which I know you’ve tried to explain for us) as there are a number of complexities that present themselves here.

dseller · July 1, 2021, 8:55pm

Hi @takbb,
I’ll answer your questions as you’ve listed them.

I don’t care about all the numbers - just specifically the result numbers which are located at a couple different spots in the string/document. So “page 3 of 4”, “NB-QCF-03”, etc. can be ignored. I was imagining searching for relevant data following the keywords of “Result”, “ResultRow2”, etc.

In terms of looking for the date, I am actually using the metadata from the pdf so I can live with ignoring dates listed within the pdf.

In terms of presentation, ideally I’d like each result listed in it’s own separate column so I can change to an integer/double and perform statistical analysis on this. I am not sure the easiest way to do this (column, row, comma delimited string), but I am working with multiple pdfs simultaneously so I assume individual columns would be easiest. Another complicating factor is that the “ResultRowX:” has a variable amount of numbers (integer/double) and changes ("ResultRow1/2/…) depending on the pdf.

Hopefully this paints a better picture.

Thanks!

takbb · July 1, 2021, 10:27pm

Hi @dseller, this isn’t yet the solution to your problem, but I wonder if it might be a stepping stone.

I noticed from your sample extract that all of the “field names” such as ResultRow2, Result and so on were always preceded by two spaces, rather than one, and then followed by a colon.

I wonder if first of all trying to break that part of the document up into “key-value” pairs might be a good step in the right direction?

I constructed a piece of sample text based on one of the images you posted. The image is incomplete so I had to make it up a little, and it won’t exactly fit the field/values that you have:

Text3: 789053 Comment2: Comment3: Product was compare to three previous Test Method: ML-CST-010.01 Button9: Result: 0.620712 Specification: 0.637473, 0.620712 ResultRow2: 0.794051, 0.634139, 0.616562 ResultRow3: 0
(edit: The double-space doesn’t show up here in my post even though I’ve marked it as “pre-formatted text”, but it is correctly included in the data sample in the attached workflow)

Running the sample text through the following workflow

Where the first cell splitter breaks into columns on a “pair of spaces”, this is then transposed into a single column, and then split again on “:”

This turns that text into the following:

Maybe that gives you something to play with further, as you might then be able to do row filtering based on the column name, or perhaps somebody else can take this further, or suggest alternatives.
scan text and divide into fields.knwf (9.8 KB)

mehrdad_bgh · July 2, 2021, 1:45pm

Hi @dseller,

Check this regex: KNIME_project378.knwf (8.3 KB)
I used Brian’s text example.
Useful website: https://www.regular-expressions.info/

GL,
Mehrdad

takbb · July 2, 2021, 4:09pm

@mehrdad_bgh

Nice one Mehrdad… I’ve been wondering where you were and have been missing your ever-useful regex contributions!

dseller · July 8, 2021, 6:22pm

Thank you both so much!

@mehrdad_bgh
The link to the regex explanation site and the regex extractor node are incredibly helpful. I have multiple pdfs that I am incorporating into this workflow as data and these have helped so much.

takbb · July 8, 2021, 7:29pm

Hi @dseller, another regex site that I find useful for trying things out is regex101.com

mehrdad_bgh · July 9, 2021, 9:43am

When I see String problems…REGEX :))

mehrdad_bgh · July 9, 2021, 9:47am

Mastering Regular Expressions in JavaScript course by Steven Hancock for start learning regex