I am running a join query on various modules of data. And I notices a significant difference in execution time when -
1. SQL join query on 18 Hive tables executed directly on Hue Interface vs.
2. KNIME workflow with Hive Connection and Database Nodes.
Note - I am not getting any data into KNIME workspace, all data processing is done on HDFS by using database nodes.
No. of records --> 2 million records / 23 million records
Approach 1 --> 3 min 15 sec / 8 min 27 sec
Approach 2 --> 12 min 26 sec / 34 min 33 sec
Note- these timings are taken form History Server Web UI of YARN.
was the SQL you executed in Hue exactly the same as the one that was executed via "Database Connection Table Reader"? The SQL that is executed via the "Database Connection Table Reader" can be found in the KNIME log (View > Open KNIME log) after running the node.
If yes, the amount of available resources in the cluster can also have an effect on query runtime. If you have to share resrouces with someone else this may lead to longer query runtimes, depending on what the other person is doing with the cluster.
The query I executed is join of 18 modules with 17 joins whereas I have used Knime wokflow to design the same query. That means I have used 18 Database Table Selectors and 17 Database Join Components to get the same result i.e. same number of rows and columns. Kindly have a look at attached snapshot of workflow in the original post.
Also, these processes were executed when I was the only user on Hadoop cluster so resources sharing should not be an issue.
Considering the baove facts the difference in processing time is very huge. That is the concern. Please advise.