Python Script 2=> 1 stuck while running Neural Machine Translation

python

#1

Hi

I’m replicating the Google NMT benchmarking for English–> Vietnamese, using KNIME example as my guide. I have successfully added the data sets for English and Vietnamese. However, the python script that does the “index encoding and padding” seems to be stuck. (waited for more than 4hrs to finish).
Row count of the table: 132,837

The script seems to run fine inside the KNIME node and in Jupyter Notebook. (See Attached screen shots). KNIME Log does not show much and I’m not sure what is causing this.


#2

Hi,

at which percentage mark does the python script get stuck?
Possibly at 70%?


#3

Forget what I wrote, I should have taken a closer look at your screenshot.
This is probably an issue with serialization from Python to KNIME.
Could you try whether it works for smaller tables (i.e. 10k) just to check?
I have an idea for an approach using java snippets but I will have to verify whether it actually works.

Cheers,

nemad


#4

Hi nemad,

Yeah It works with 10,000 records, However it took longer time to finish, that is mostly because it has larger max_lengths I suppose.


#5

Hello again,

there is an alternative approach using Java Snippets, see the attached workflow
JSnippet_Preprocessing.knwf (137.1 KB)

The workflow shows how to do the encoding and padding for the english sentences.
Achieving the same for other languages can be achieved by copy and paste.
With this preprocessing you should see a large speedup because it doesn’t require communitation with Python.

Note that I had to truncate the data because of file size limits here in the forum.
Please see this as an inspiration that still has room for improvement.

Cheers,

nemad


#6

Sure I will try this out and let you know. It will be good to have all of it in Python if we can.

Thanks !

Mohammed Ayub


#7

Hi Nemad,

Update: Java Solution works fine. Also prompted me to set the “Rows per Chunk” value in DL Python Network learner to small value like 1000, otherwise shows deserialization error.