I have a little project.
500.000 registers ( about 1,4 TB ) of Web server logs.
Dedicated server with cloudera express 5.5 distribution ( in an ISP ).
1.- Load logs
2.- modify columns, clean rows,....
3.- Create Reports from logs about time sessions, views, geolocalization,.....
Im newbee in knime. With a little part of logs ( only 30 MB ) i have created a workflow that modify columns, clean rows...and generate one report.
With the small file all is allright ( about 3 minutes to finish all the process ), but i need run this workflow with the whole file ( the big one).
The big file is on Dedicated server, the i suposse that KNIME must not ejecute "database reader" because is too slow. I think i must modify columns, and clean rows with some node that ejecute actions in hadoop server.
¿Is this true?
Second question is which is the best method for o this?
Thanks in advance and sorry for my english.
1.4 TB is a lot, but you can probably still pipe everything through a single machine if you bring the time. :-)
Functionalities to look for are:
- Chunk loops (to go in batches, like the 30 MB ones you have used so far)
- Parallel chunk loops (to speed up processing of batches by using mutliple threads at once)
- "Don't save" nodes from the Image mining extenstion to save disk I/O (though you'll lose result inspection options)
- Streaming to save I/O and the wait time between nodes (works only for some nodes, and you lose inspection options, but it can speed things up massively)
el Ergonomista :-)
If you have access to a Hadoop cluster using the Hive connectors and possibly the Spark executors might be appropriate for the task that you describe. You can learn more about the technologies here:
If that looks interesting (also check that your Hadoop cluster is supported by KNIME - see bottom of webpage), then you can request a 30-day trial license to test your use case: https://www.knime.org/big-data-extensions-free-30-day-trial
There are some example workflows for the Big Data extensions available on the KNIME public examples server that will be helpful to get started.