File reader Node failing to access Large TEXT file containing 34 Millions records

I have a large file that is a pipe delimited text file. this file is having 35 millions records in it. I used the file reader node facing below challenges.

  1. Unable to add Pipe i.e. | as Delimiter
  2. Node is failing to read the file due to large line size i.e. 35 millions.

I want to create a table with this file and compare the data with database table. but while accessing the file through file reader node, getting error as " Execute failed: New line in quoted string (or closing quote missing). In line 3079174."

What is the best possible way to work with this problem?

I am okay with solution if I have to work with splitting the file into multiple file.

Appreciate any help you can give.

Thanks,
Amit

@singlaamit501 welcome to the KNIME forum. You could try the new CSV reader node or the R package readr which works in most cases.

Also if you have unbalanced quotes you might have to read the docs and tweak the settings.

Here is an overview over several nodes
https://hub.knime.com/mlauber71/spaces/Public/latest/forum/kn_forum_28064_r_import_csv_r_readr_strange_charaters

I want a solution for large text file containing 35 millions rows. Also, i don’t know how to access the R package reader. more detailed answer is appreciated.

Thank you for looking into my problem. I am new to KNIME.

I would recommend to try the 3 options from the example from the KNIME hub I linked. R would only be a matter of last resort.

If you could provide us with a sample of the data including the lines that would result in the problems it might be easier to assist.

Other that that you would have to try a few options and settings.

If you are interested in using R I have provided a collection how to install R:

Thanks again for your generous help. I will look into this. I have a text file which i want to compare with database table. With 35 millions lines mentioned below.

Example.
XXXaabdsdf-d238-50|XX|2XXXXXXX5|BXXL|FXXXXXX|X|1XXXX|||XXX E XXX ST APT XL||||XX|XX|XXXX|
XXXaabdsdf-d238-50|XX|2XXXXXXX5|BXXL|FXXXXXX|X|1XXXX|||XXX E XXX ST APT XL||||XX|XX|XXXX|

Any idea how i can handle 35 million lines. bcoz file reader allow me to only 1129362 lines to create a table out of this file.

Hi,
you can actually use a pipe delimiter in a file reader node:

image

Just write in a pipe.

About the error: you probably have a cell at that line that is raising the error, so you can try to make these changes in configuration:

image

image

image

Let me know if is working now.

Luca

4 Likes

What you could try is the simple file reader or csv reader labs. Both of them are way faster than the file reader and are somewhat more robust when it comes to quoted strings.

Furthermore if that node also fails please investigate what’s wrong with the line mentioned in the file reader. Maybe you could provide us with the row so we can help you to find the proper settings.

1 Like

Hello again mate - I Figure out that File reader node is only showing 4664593 rows. I have total 34,000,000 rows in the Text file. How I can access all those together.

did u get some errors?

No error. File reader node only shows 4664593 rows and able to complete the execution sucessfully.

1 Like

:slight_smile: are you really sure that you have 34M rows?

Is it possible that because of strange delimiters some lines get skipped?

Another possibility is to try and use KNIME’s local big data engine and create an external table and see what this does. Other that that I would recommend trying to use the R package since this can handle a lot of import problems.

1 Like

As i said all the lines are having same format

example- XXXaabdsdf-d238-50|XX|2XXXXXXX5|BXXL|FXXXXXX|X|1XXXX|||XXX E XXX ST APT XL||||XX|XX|XXXX|

Can you give me sample workflow for Big data engine. I have never used Big data engine or R package before.

The examples have been in the links to the KNIME hub. You could download them and try to adapt them to your needs. Also it would help you to understand how these things do work.

I am trying to add this workflow. getting error as “Hub request failed”

Also i tried to splitting the text files into multiple file and then access and build a loop around this so that I can access all 35 millions records. Is there any mistake in my workflow , because i am not getting desired results.

You can try to load it here

2 Likes

I really appreciate your response. This things looks more complicated to work with. There is no proper step to how to configure. I am newbie to everything.

I think it would help us If you could send us the line where the node failed before you changed its configuration. I think @mlauber71 is right and your format is not consistent.

Best
Mark

1 Like