How knime works ?

ROBERTO_GARCIA · April 8, 2016, 9:57pm

Hi all,

I have a little project.

Scenary:

500.000 registers ( about 1,4 TB ) of Web server logs.

Dedicated server with cloudera express 5.5 distribution ( in an ISP ).

I need:

1.- Load logs

2.- modify columns, clean rows,....

3.- Create Reports from logs about time sessions, views, geolocalization,.....

Questions:

Im newbee in knime. With a little part of logs ( only 30 MB ) i have created a workflow that modify columns, clean rows...and generate one report.

With the small file all is allright ( about 3 minutes to finish all the process ), but i need run this workflow with the whole file ( the big one).

The big file is on Dedicated server, the i suposse that KNIME must not ejecute "database reader" because is too slow. I think i must modify columns, and clean rows with some node that ejecute actions in hadoop server.

¿Is this true?

Second question is which is the best method for o this?

Thanks in advance and sorry for my english.

Roberto Garcia

Ergonomist · April 11, 2016, 3:43pm

Roberto,

1.4 TB is a lot, but you can probably still pipe everything through a single machine if you bring the time. :-)

Functionalities to look for are:

Chunk loops (to go in batches, like the 30 MB ones you have used so far)
Parallel chunk loops (to speed up processing of batches by using mutliple threads at once)
"Don't save" nodes from the Image mining extenstion to save disk I/O (though you'll lose result inspection options)
Streaming to save I/O and the wait time between nodes (works only for some nodes, and you lose inspection options, but it can speed things up massively)

Un saludo,
el Ergonomista :-)

jonfuller · April 12, 2016, 2:37pm

Hi Roberto,

If you have access to a Hadoop cluster using the Hive connectors and possibly the Spark executors might be appropriate for the task that you describe. You can learn more about the technologies here:

https://www.knime.org/knime-big-data-extensions

If that looks interesting (also check that your Hadoop cluster is supported by KNIME - see bottom of webpage), then you can request a 30-day trial license to test your use case: https://www.knime.org/big-data-extensions-free-30-day-trial

There are some example workflows for the Big Data extensions available on the KNIME public examples server that will be helpful to get started.

Best,

Jon