Find string and return N rows beneath it

abiasella · September 9, 2021, 1:55am

Hi all - I’m very new to KNIME and I have what I think is an easy problem, but have been stumped with current solutions for over a week…

I have dozens of semi-unstructured google sheets as input, each with a cell containing the text “Plates” and then up to 12 cells underneath it with comma separated numbers. I have figured out how to loop over each sheet using flow variables.

Unfortunately the “Plates” text is in a different location in each sheet. I am able to use a Column Expressions node to find the text in an array (not sure that is the most efficient use), but then I’m stumped at how to retrieve values from rows beneath it.

e.g.

Plates
1,2,3,4
5,6,7,8
13,14,15,16
82,83,84,85
91,92,93,94
…

Ultimately, I’d like to output a new column with just the first number of each list, e.g.

FirstPlate
1
5
13
82
91
…

but I’d be happy with just learning how to retrieve 12 rows under my target text search. I can do the comma splitting later.

Thanks for any help anyone can provide!

bruno29a · September 9, 2021, 3:44am

Hi @abiasella and welcome to the Knime Community.

Can you explain what’s above the cell that has the text “Plates”? Are they empty cells?

And also, what do you have after the “up to 12 cells underneath”? Do you have other values after these? And you say up to 12, meaning there could be 0 to 12 cells? What determine if it’s 1, or 2, or 12? Unless they’re the only cells after the text Plates, that is there are no other values after them…

In the mean time, you can check the Line Reader node, which allows you to read line by line. You could just go through each line and ignore them until you reach “Plates”, and then read the next N lines.

But I can’t come up with anything until I get the answers to my questions.

You could also share a few different examples of the sheets so we can see how the data looks like, and what the variations are.

abiasella · September 9, 2021, 5:17am

Thanks for your reply. I will definitely check out the Line Reader when I get a chance.

I made a demo google sheet with the two most common example variants as tabs.

Overall I am scraping signup sheets, where there are always 12 slots per sheet. Sometimes there is another section of unneeded information after the 12 cells. There are always 12 cells either with values, or N/A. “Plates” will be either on the first row, or if not, then just empty cells above it.

I hope the demo sheet clarifies what I’m working with! Thanks.

takbb · September 9, 2021, 8:10am

Hi @abiasella , The attached workflow gives a possible mechanism for doing this. As I see it you have two complications. Namely that you know neither the row or the column containing the “plates” data.

I took your sample data and created two csv files in the data folder, with tab separated values. You may well have a workflow that connects to your google sheets, so you’d need to adapt it for that, but the body of the process would be the same.

Using the Line Reader as already suggested , it first of all uses regex in a Rule-based row filter to find a row containing the “plates” header. I’m assuming here that only one row per file will match this. If that’s not the case then a little additional thinking would be required on that.

Having found the row, it stores the row number in a variable and also calculates “row number + 13” to identify start and end rows in the current data file for the “plates” title and the 12 rows of data.

These variables are passed to a row filter to be used against the whole file and return just the rows of interest that can then be converted to columns and given the column names found in the first row of this subset of data.

It also records the file name for future use (if you need it) against each row.

After all files are processed, you end up with a data table containing the plates information from all input files, which you can then process in whatever way you require:

Find variable column and row position.knwf (52.0 KB)

During this exercise, I came upon an inconsistency in “row filter” which I have documented in red annotations on the flow. I documented it there because otherwise the way I derived the row numbers as variables to be passed to the row filter makes no sense! I should be passing 1 and 13, for example, but instead I have to pass 0 and 12 as the row numbers. This is different to the “usual” inconsistencies we find regarding 0 and 1 based row numbering, but I’ll document this elsewhere.

takbb · September 9, 2021, 9:25am

I of course meant to say “row number + 12” to collect 13 rows (title plus 12 data rows), and I made the same mistake in the annotations on and near the Math Formula node. The node config itself though should be correct. I was too late to edit the above post when I spotted it. Apologies if that caused confusion!

Daniel_Weikert · September 9, 2021, 6:03pm

@takbb
why exactlyy do you have a pipe symbol in the regex?

abiasella · September 9, 2021, 7:12pm

Thank you for your answer and workflow! I’ve tried it with other local files and it works for exactly what I need.

You may well have a workflow that connects to your google sheets, so you’d need to adapt it for that, but the body of the process would be the same.

I’ve marked your comment as solution, but I’m ultimately having trouble integrating it with Google Sheets as the Line Reader is not able to access data in the URLs I pass in. I will study your workflow more to see if I can adapt certain parts.

takbb · September 10, 2021, 9:04am

Hi @Daniel_Weikert, I had to go back and check

Ok, so the regex I’ve used to find the term “plates” as a column heading, in the Rule Based Row Filter node was this:

(^|.*\t)plates(\t.*|$)

The quick and simple answer is that the pipe character “|” allows alternative matches. That may be all you need, but I’ll elaborate further as it may assist others. (btw, somebody else may have a different/better way of achieving this, and it’s (very) possible that my regex solution isn’t perfect, but it appears to work from the tests I’ve done with it, but if you/anybody spots something wrong please let me know! )

My intention here was to find all rows where it contained the term “plates” but was a column heading in its own right. My first assertion was that column headings would be separated by tabs, so the word “plates” must be surrounded by tabs for it to be considered an actual heading and not just the string “plates” appearing in something like “contemplates” or “he threw some plates”

So initially I tested my regex as this:

(.*\t)plates(\t.*)
and I included “capture group” brackets even though here they aren’t needed right now as it keeps it separated and it’s also easy to review the different parts of the “match” on a site like regex101.com.
This would happily find “plates” in a string with something followed by tab followed by plates followed by something

e.g. Using regex101 to demonstrate the example (and I’ve put a link to this example at the end if you want to give it a try), by using the capture groups, we can see different portions colour-coded when it makes the match:

However, what if the word “plates” were the first column:
On regex101, we can see that this pattern won’t work:

In this case, we need to allow for the word “plates” being preceded by either "something and a tab, or by the start of line, and this is where the first pipe character comes into play.

(^|.*\t)
This is now finding a match with either start of line or something and tab, but as you can see from the next example, it doesn’t find a line ending with plates

So, at the other end of the regex, I have done the same with “end of line” ($) and provided that as an alternative match using a further pipe:

and just for giggles, and completeness, here is my full test:

which appears to match all lines where “plates” is a column name in its own right.
Here’s a link to the above regex example. To enter tabs into regex101, I pasted the examples in from an editor, but you can also (on Windows) use Alt 009 (with the 009 entered on the numeric keypad)
I hope that helps!

ipazin · September 10, 2021, 10:34am

Hello @abiasella,

Then use google nodes (Google Sheets Connection → Google Sheets Reader) followed by Column Aggregator node with Concatenate method and \t as delimiter to get same table output as from Line Reader.

Welcome to KNIME Community!

Br,
Ivan

Daniel_Weikert · September 10, 2021, 3:46pm

Yes of course, yesterday was probably to late for me
Thanks a lot for the detailed step by step explanation, as always highly appreciate your efforts here. I always stay curious when I go to the forum and am amazed to see what kind of solutions you and @bruno29a came up with. Both of you are clearly an enrichment to this forum and the community. Hope you continue with this.
Best regards, take care and enjoy your weekend.

system · September 17, 2021, 3:46pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.