Timeout when reading csv files

Hi everyone,

I’m running into a timeout issue when reading a large number of CSV files from a network drive and wanted to see if anyone has dealt with something similar.

I have ~400 CSV files (same layout), totaling around 80 million rows. These files are stored on a shared network location (so in order to run into the hub, I need SMB Connector node).

My initial approach was:
List Files/Folders → Table Row to Variable Loop Start → CSV Reader → Loop End

This worked for a while but would consistently fail after ~150 files with a TransportException / SocketException (connection timeout).

Then I switched to a no-loop approach:
List Files/Folders → CSV Reader (filtered by wildcard)

This improved things slightly, but I’m still getting:
SMBRuntimeException: Timeout expired

The error usually happens in the middle of reading a file (so it doesn’t look like a parsing or data type issue anymore).

Constraints:

  • This workflow will run on KNIME Hub (so I can’t just copy files locally beforehand)

  • Error message seems to be due to lost connection with SMB Connector.

Any ideas on how can I fix this?

Here is the error message:

Execute failed: com.hierynomus.protocol.transport.TransportException - java.util.concurrent.ExecutionException: com.hierynomus.smbj.common.SMBRuntimeException: java.util.concurrent.TimeoutException: Timeout expired

Parser Configuration:
CsvParserSettings:
Auto configuration enabled=true
Auto-closing enabled=true
Autodetect column delimiter=false
Autodetect quotes=false
Column reordering enabled=true
Ignore leading whitespaces=true
Ignore trailing whitespaces=true
Input buffer size=1048576
Input reading on separate thread=true
Maximum number of characters per column=524288
Maximum number of columns=8192
Skip empty lines=true

Format configuration:
CsvFormat:
Field delimiter=,
Line separator=\n
Quote character="

Internal state when error was thrown:
line=459977, column=0, record=459977, charIndex=53869367
Headers=[Person No, Pay Rule Name, Full Name, Tots Apply Dt, Tots Labor Level Name 6 - Unit, Tots Labor Level Name 7 - Reset, Tots Money Amt, Tots Wage Amount, Tots Time In Hours, Tots Pay Code Name]

Thanks :slight_smile:

@bsartorelli two approaches. You can list the files and create an error handling (retry the ones that failed). Also you can download them first to a local drive and then import them.

Blog: Collect and Restore — or how to handle many large files and resume loops

1 Like

From my gut feeling in dealing with such things as huge quantities of files from network folders, I’d suggest the following debug path and possible solution :

  • seeing you are already using the list node instead of the csv directly, I’d test a row filter between them on rowIDs < X and see how far it goes without issues,
  • From there you can chain several of such filters (X < rowIds < Y; …) with the variable connection so 1 runs only after the previous one (all having the table connection to the list node)
  • Concatenate all later and go on to your flow from there.

This I suggest because oftentimes we mere users of infrastructure cannot do much in terms of network/driver timeouts or even instabilities. If it not ideal but if it helps you move forward, that’s great!

1 Like