How knime works ?

Hi all,

I have a little project.

Scenary:

500.000 registers ( about 1,4 TB ) of Web server logs.

Dedicated server with cloudera express 5.5  distribution ( in an ISP ).

I need:

1.- Load logs

2.- modify columns, clean rows,....

3.- Create Reports from logs about time sessions, views, geolocalization,.....

 

Questions:

Im newbee in knime. With a little part of logs ( only 30 MB )  i have created a workflow that modify columns, clean rows...and generate one report.

With the small file all is allright ( about 3 minutes to finish all the process ), but i need run this workflow with the whole file ( the big one).

The big file is on Dedicated server, the i suposse that KNIME must not ejecute "database reader" because is too slow. I think i must modify columns, and clean rows with some node that ejecute actions in hadoop server.

¿Is this true?

Second question is which is the best method for o this?

Thanks in advance and sorry for my english.

Roberto Garcia

Roberto,

1.4 TB is a lot, but you can probably still pipe everything through a single machine if you bring the time. :-)

Functionalities to look for are:

  • Chunk loops (to go in batches, like the 30 MB ones you have used so far)
  • Parallel chunk loops (to speed up processing of batches by using mutliple threads at once)
  • "Don't save" nodes from the Image mining extenstion to save disk I/O (though you'll lose result inspection options)
  • Streaming to save I/O and the wait time between nodes (works only for some nodes, and you lose inspection options, but it can speed things up massively)

Un saludo,
el Ergonomista :-)

Hi Roberto,

If you have access to a Hadoop cluster using the Hive connectors and possibly the Spark executors might be appropriate for the task that you describe. You can learn more about the technologies here:

https://www.knime.org/knime-big-data-extensions

If that looks interesting (also check that your Hadoop cluster is supported by KNIME - see bottom of webpage), then you can request a 30-day trial license to test your use case: https://www.knime.org/big-data-extensions-free-30-day-trial

There are some example workflows for the Big Data extensions available on the KNIME public examples server that will be helpful to get started.

Best,

Jon