wildcard Sum ifs between 2 tables

I am trying to replicate a sumifs statement from excel where I match a distinct query parameter from a table against a url column in another and them sum the pageviews.

2 tables

image

I have the tables in knime but no primary key so I am not sure how to replicate this. In excel I would use a sum if with a wildcard either side to check the page column

and would expect an output like this
image

any advice appreciated as I am new to Knime.

You could use string manipulation with substring formula or regex etract expression to extract the word after the ? and then join them based on this newly created column
regex would be sth like this
reg

but I think you need to escape the backlashes in string manipulation so try double backslashes
br

Hope this solution helps (see attached file as well).

  1. Use ‘Cross Joiner’ to join each row of the top table with each row of the bottom table.
  2. Use ‘Column Expressions’ to check if a page contains the parameter (true or false)
  3. Use ‘Rule-based Row Splitter’ where you only keep rows that are ‘true’
  4. Use ‘GroupBy’ to sum over the page views for each parameter

wildcard_sumifs_two_tables.knwf (14.4 KB)

1 Like

@Daniel_Weikert Issue with this approach is that some URL’s contain more than one parameter. If you write it out in full, you’ll get
image
which requires additional processing to see which parameters are actually in the URL.

@Jake120F
I would change it slightly and use (?:&[a-z]+)|(?:\?[a-z]+). This is able to capture all groups within the URL.

In KNIME, something like this should get you going.

image

First, use the Regex Extractor node and use the beforementioned code. Set the output to Rows.

In case you don’t have this node, it’s available here: Regex Extractor — NodePit

Next, apply a small clean-up to remove the HTML characters.

Finally, use a groupBy node where by the Full Match is the grouping column and the Page Views is the Manual Aggregation of type Sum.

If you want to control the list of parameters selectively, a Joiner node with an inner join after the parameter extraction would be sufficient.

See WF:
Wildcard sum ifs between 2 tables.knwf (21.5 KB)

Hope this provides some inspiration!

7 Likes

@ArjenEX all good points that you raised, and very good solution.

However, keep in mind that query strings are case sensitive and are not always lower case. Your regex is limited to lower case query strings only.

But for this situation, it should be ok, since the parameter are pre-defined (colour, shape, etc, which are all lowercase) - and perhaps that is why you did it that way on purpose :smiley:

1 Like

Good spot! Luckily quickly solvable with [A-Za-z]

1 Like

@ArjenEX Thank you so much for the response and explaining it so clearly.
That was much easier than the way I was trying round the houses with the mapping back between the two tables.

I have tweaked it for the real dataset and its behaving as expected :slight_smile:

1 Like

You are right,
sorry for the confusion. My eyes probably were too tired to spot that.
Great that you helped with your solution
br

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.