Reference row filter: thanks. That seems to do EXISTS and NOT EXISTS. It was a big gap.
What about those set operations (EXCEPT/MINUS/DIFFERENCE)?
Table validator: not for what I need.
SQL: the whole point is to avoid SQL, but natively do similar table manipulations. I want to do queries across different data sources, and pulling everything into SQL is just a bottleneck I don’t need.
Maybe you could give us an example of what you want to do.
EXCEPT I think you could do by reversing the Reference Row filter (everything that is not matched). The rest I would think you could do combinations of conditions and maybe joins.
Conditions in KNIME can come in several flavours:
String Manipulation and Math formulas (for calculations)
simple and complex Java snippets where you have the full power of Java functions
It might help to study some examples and see introduction into KNIME because if you start the mass of nodes might be a little bit confusing at first. And also the concept of doing everything in a series of single nodes is something you might have to get accustomed to.
Often it is best to start from a real-world example and try to reproduce that in KNIME. If you switch from a specific program like Excel, SAS or Alteryx there are free books to help with that. And then there is always the active KNIME community in this forum there to help
My only real world example is what you would find in an SQL manual.This is an evaluation, to see whether Knime can manipulate tables as you can in SQL. If there are gaps, I expect to have to create new nodes to fill them.
The Reference Row filter turns out to be quite limited. The Joiner node lets you define a list of joining columns, but the Reference Row filter allows only a single column. That makes it generally unsuitable for EXIST, NOT EXISTS and set operations like EXCEPT. A Joiner with a few more options would be the best solution.
I don’t think I’m confused. For the most part I’m finding it very straightforward to understand, but the terminology is non-standard and it’s hard to find the specific nodes I’m looking for, or to determine they don’t exist.Thank you for your pointers.
from your description it is not entirely clear if you want to manipulate data inside a table or data tables. Or both? Both can be done with KNIME and for several tasks there are several ways.
But one of KNIME’s strong suits is the integration of database functions wither thru dedicated DB nodes that bring KNIME like functions to the DB (with SQL in the background) - or you could just use KNIME as a tool to steer your DB operations up to Big Data and Spark operations (and automate them with the KNIME server).
Like it is so often with KNIME why choose if you can have it all (workflow and coding in R and Python, local machine and fully scalable work with big databases and clusters, replace simple Excel tasks and do complicated models while keeping everything organized and commented on in nice workflows) [end of commercial break ]
To see what is possible with regards to Big Data you could check out my school of hive and the use of H2O model with spark
My goal is to create a set of tools for manipulating tables of a wide variety of data types from a variety of sources, that can be used by people who are not programmers. Coding languages like SQL, Python and R are out, and pulling data into tables and then accessing it via SQL is out.
The KNIME design philosophy of nodes and pipes suits the task well, but most of the nodes I can find are about either numeric data or specific academic applications, and that doesn’t suit my purpose. I’m not interested in Big Data or Spark, or massive scale. Think of the things that people tend to choose Excel, Access or similar tools. I need nodes that can do things like SQL (SELECT, JOIN, GROUP etc) on CSV or XLS files or other data sources, and the final output will be most likely pivot tables, charts and reports.
As for me, the main practical restriction in KNIME implementation of SQL like functionality is a join by range. As for exists not exists they could be simulated using left/right outer joins following corresponding filtering nodes.
A range join would have to be a cross join followed by a row filter, I don’t think there is anything more better. Check the execution plan on your favorite SQL engine if you like. You can do that in Knime.
You can do EXISTS by a join followed by dropping all the RHS columns, but it can be quite inefficient… I can see no good way to do NOT EXISTS or EXCEPT/MINUS short of a highly inefficient cross join, or maybe some dinky technique based on row IDs.
No matter. Should be easy enough to write, based on the existing join node.
For me that sounds just like a task for Knime and there should be several examples on the hub. If you have some specific tasks in mind it might be best to provide an example / challenge and see if the community can come up with ideas.
If you aspire to write additional nodes that are helpful for data manipulation I am sure that is most welcome.
Some community nodes already come with specialised versions of data manipulation nodes which one would have to find and which might serve as an inspiration.
I agree that Knime is well suited to tasks like this, but the data manipulation nodes are ad hoc, inconsistent and missing important features.
SQL can do all these things, but not for this job. My inclination is to create a set of nodes that implement the Relation Algebra and see where it leads. If anyone knows anything along those lines, now would be a good time to mention it.
First comment is: if you want to to SQL, then best to put your data into a relation database. That said actually being able to query a knime table with SQL would be an interesting feature the devs should take note. But it’s probably too complex to be worth it.
For the reference row filter with multiple columns, just concatenate them (String Manipulation for example) before the filter. Only thing to take care is that the chosen columns are in the same order on both tables and if that is the case, this trick works perfectly fine. Albeit I also agree, having the actual option to choose multiple columns would make it easier for new or less technical users. This is just one example that with KNIME you have to sometimes think a bit outside the box.
EXIST, NOT EXISTS and EXCEPT are all covered by reference row filter unless I’m completely missing something.
No, for my purposes both using an existing RDBMS and writing actual SQL are absolutely ruled out. My target user just won’t hack it. And in any case, it’s not a good fit for Knime nodes.
But Relational Algebra was the basis for creating SQL, and has roughly the same capability. It also fits nicely into the pipeline and node model.And obviously it would be nice if it fills a need that others have experienced.
I’m sure there are any number of hacks by which these things can be done, but I’m looking for something better. The ways you suggest will cover most cases, but doubtless there will be edge cases where they fail in surprising ways.