I’m working on a component that will download and prepare data from BigQuery. It will deal typically with 1M + rows of data for every day in the date range.
My pain points
- Some fields return as JSON, however the type is set to string. I have to put a string to JSON node which takes time. How can i set the field type correctly at download?
- Is there any way to optimise/speed up JSON Path ?
- Is there any way to optimise/speed up Pivoting (Single group col (one per row), key and first aggregation) - yes i am using in memory, etc…
That’s where the component is taking it’s sweet time…
Any tips as to optimizing speed of processing welcome. I need the component to work for folks who won’t have python etc… installed…
If you have multiple nodes that are streamable (the JSON Path is one of them), it may make sense to try streaming parts of your component. You can find more information about that here: Streaming data in KNIME | KNIME. It might help you speed up the workflow a bit.
I installed the extension, however i don’t see the option to metanode and wrap option as per the page you suggested.
What am i missing?
Oh, my bad. The “Wrapped Metanodes” were renamed in the meantime and are now called “Component”. So just wrap everything in a component instead
Hi @Gavin_Attard ,
My suggestion for it is to put it into a chunk loop (number for split by row or groups %), so you can read row by row to test and make your transformation with the tips that @AlexanderFillbrunn told you.
With it, you can make a performance for each line and test where you can make changes or limit the content to empty your memory for process OR use the write to disk option to use less memory but more diskspace to load and make the transformation.
I hope that can bring you some insight too to speedy up your workflow.
Haha, tried to set streamign to the overall component i made, but because it has metanodes and loops that didn’t work… ok
let see if i component those nodes only. Watch this space
Just curious. Have you tried separating the json from the other rows and try to process them separately e.g. in parallel chunk loop or otherwise.
Thanks @denisfi @Daniel_Weikert @AlexanderFillbrunn
In the end it was a combination of all of those things.
Streaming was something i hadn’t tried.
Once i played around with the order of things, this in the right order i managed to cut the process time over a half.
The challenge was each row has 10 Columns with json data that needs to be parsed into columns. 8 of these have known widths (no. of columns to parse out) with known keys. The other two have variable width and keys. The time sinks were anything that does grouping or pivoting.
Once i serialized those nodes, wrapped them in parallel chunk loops and wrapped the entire process into chunking of 100K rows, it seemed to hit the balance for all processes. Was fun seeing the cores burning hot.
In the end, it took 77.8 minutes to process 9.3M rows with all possible columns (in the end it resulted in 222 cols) so approx 120.5K rows per minute end to end, As opposed to 4h+ previously… The parsing process was around 30 sec per 100K.
This was fun.
Now onto the next iterations of the tool.
thank you for the challenge and results…
Maybe in the future when you find some time you could upload the flow with demo data. Would certainly be interesting for all of us here.