I am working on the DL4J package and nodes. I have got it working and I am trying to integrate with GPU: 4NVIDIA GRID K1's. I have set up the proper backend installations which DL4J connects to: CUDA driver and CUDA Toolkit v 7.5. After that, I go into the KNIME preferences and check 'Use GPU" for DL4J. However, in my tests, running on the GPU takes more time than running on the CPU (14 times longer). Surely, there is some error. I know that the GPU is not the issue because with the same backend installations, I have a deep learning program running in Python, and it runs faster on the GPU than on the CPU. So unless some other dependencies need to be installed, I think there is some issue with how DL4J in KNIME is connecting to and using the GPU - some disconnect exists there.
I am on a Windows Server 2012 system.
I have attached the workflow I am using. The dataset used by the topmost node (Income) is found here: https://archive.ics.uci.edu/ml/datasets/Adult.
If anyone has experienced setting up GPU use with DL4J, any help would be appreciated. Thank you.
in order to verify GPU usage, you can monitor the GPU utilization using a tool like GPU-Z (https://www.techpowerup.com/gpuz/). Is it correct that you are using 4 x NVIDIA GRID K1 GPUs? Currently, the GPU support is not able to use a multiple GPU setup and will only use one GPU. I took a quick look at the specifications of your GPUs and compared the floating point performance to a recent GTX 1080.
single NVIDIA GRID K1: 326.4 GFLOPS
NVIDIA GTX 1080: 8,228 GFLOPS
Therefore, I think that the Python library that you are using is able to use all GPUs, hence it runs faster than on CPU. Unfortunately, the KNIME Integration is not able to do that so I would expect a longer runtime than on CPU keeping the floating point performance in mind.
I will add multi GPU support to our requested features list.
I'd like to add to this question: how can one improve the usage of GPU by DL4J? I have been trying a multitude of different types of DL4J networks of all size and always end up with 20% GPU usage at most. What could the problem be?
As a result, my CPUs are way quicker. Still the CPU load is 50% only...does KNIME support multi-socket CPU setup?
Preference option to use GPU is checked, CUDA 8.0 framework installed.
Hardware setup: 2 CPU Xeon X5650 (6 cores each) + 36 GB Ram vs GTX 1060 6GB
Today I tried on an Microsoft Azure VM NC6 with nvidia K80 GPU the font detection example ...
I installed the cuda driver and activated the GPU calculations like described here -> https://www.knime.com/blog/learning-deep-learning
With GPU and without (CPU only) it took about 7 minutes. So activating/deactivating GPU calculations in the KNIME DL4J options had no impact on the calculation time.
I'm sorry but I have to say that without good GPU support the DL4J integration is pretty much useless.
sorry for the late answer. The usage of GPU and CPU strongly depends on the used network and the used data. Unfortunately, there is no single answer to these questions.
Regarding GPU usage: In the current version of KNIME we have some performance issues with GPU which we are currently working on. It should be better in one of the next releases as we are upgrading the library version. Also, GPU usage depends on the batch size that is used. With higher batch size and bigger input data sizes the GPU should be faster than CPU nevertheless. I also can't really estimate how fast a K80 GPU is expected to be as I only have experiences with an GTX1080. However, currently only single GPU setups are supported (I think to remember that some of the GPU cards use several GPUs on one card).
Regarding the font detection example: In this case, the example is rather small using small input images. You could try to increase the batch size. Unfortunately, due to the current performance problems, GPU will only bring better performance for larger networks and larger input sizes. Maybe you could try an AlexNet with 100x100 images.
Regarding multi CPU setups: That should be no problem. However, all multi-threaded calculations are handled by the library so we can't do anything about that.
In general, there are many GPU specific parameters (these could increase GPU usage) which are currently not configurable in KNIME because we are using the DL4J default values. We plan to add make them configurable in the future but I can't promise on a date. Note that the DL4J Integrations is currently in KNIME Labs.
Sorry for the inconveniences
Thank you for your answer. Today I encauntered two error messages from the dl4j feedforward learner:
Execute failed: You can't allocate more memory, then allowed with configured value: 
Execute failed: Cannot allocate new IntPointer(8): totalBytes = 513, physicalBytes = 3G.
What do they mean?
That’s a DL4J error message indicating that you ran out of memory.
Additionally, to the java heap space, DL4J uses off heap memory. The maximum size of both those memories can be configured in the knime.ini. Therefore, in the case when you try to run something that exhausts your memory limits you may need to adjust those values. Usually, it is sufficient to set the -Xmx value to a higher number. However, the memory limits can be adjusted with some more options. For more information please see the DL4J documentation here: https://deeplearning4j.org/memory
If you already used the max amount of available memory on your machine, you could just try to reduce the batch size. That’s also the easiest way to deal with this kind of problems.
Oh and I just remembered: Sometimes a restart of KNIME helps if you started a lot of Learner nodes in one session. We are currently investigating this problem.
Increasing the Xmx value helped, thank you.
Just to be clear, if I select in the KNIME/dl4j options GPU calculation the Xmx value represents the Memory used from the GPU ? So if I have a Nvidia K80 with 24GB I can increase the Xmx value to 24GB ?
Xmx is the amount of memory used by java heap space. Apart from that there is another off-heap memory which size you can configure. Please have a look at https://deeplearning4j.org/memory . I think by default they specify the off-heap limit to be twice the size of your Xmx value. So, if you set it to 12GB it should be fine in your case.