how to parse a text file

Hi,

I have a log file (see below) from an external (rel. to KNIME) program that I would like to analyze and display with KNIME-Reporter.

As the file can be regarded as a concatenation of many different files, I think the best is to split the file many different smaller files that can then be read using the File Reader. Any other option using KNIME and either the LineReader or File Reader would involve a VERY complicated workflow that won't be very flexible as well.

Do you have any other idea on how to process this file? I don't like the idea of producing many files as some of the files will probably only contain one line and it just pollutes the file system...

 

Thanks a lot for any comments.

Bernd

As you can see the file contains sections that are encapsulated in ">> Name" and ">>END_MODULE". Within these sections a File Reader would be able to read the columns appropriately.

##FastQC	0.7.2
>>Basic Statistics	pass
#Measure	Value	
Filename	s_1_no_adapter.txt	
File type	Conventional base calls	
Total Sequences	20194864	
Sequence length	10-36	
%GC	48	
>>END_MODULE
>>Per base sequence quality	pass
#Base	Mean	Median	Lower Quartile	Upper Quartile	10th Percentile	90th Percentile
1	36.87651870297319	38.0	37.0	38.0	34.0	38.0
2	35.91392811558424	38.0	36.0	38.0	32.0	38.0
3	36.08842396759889	38.0	36.0	38.0	32.0	38.0
>>END_MODULE
>>Per sequence quality scores	pass
#Quality	Count
11	4.0
12	3.0
13	19.0
14	41.0
>>END_MODULE
>>Per base sequence content	fail
#Base	G	A	T	C
1	3494290	1633260	13727865	1331803
2	10505590	3343578	4101416	2244081
3	9215728	4416179	3275544	3285580
4	6309121	8909270	3398507	1577865
5	3014520	5830317	3164529	8185416
6	8489220	3475586	4206412	4023646
>>END_MODULE
>>Per base GC content	fail
#Base	%GC
1	23.906676987388753
2	63.13385738263051
3	61.90902197891936
4	39.0546103462566
5	55.4595538590117
6	61.96063513970681
>>END_MODULE
>>Per sequence GC content	fail
#GC Content	Count
0	124.0
1	124.0
2	149.0
3	263.5
>>END_MODULE
>>Per base N content	pass
#Base	N-Count
1	0.037861111617290416
2	9.853990598797794E-4
3	0.009076565209847415
4	5.001271610445111E-4
5	4.060438337193061E-4
6	0.0
>>END_MODULE
>>Sequence Length Distribution	warn
#Length	Count
10	59248.0
11	43350.0
12	63321.0
13	54282.0
>>END_MODULE
>>Sequence Duplication Levels	fail
#Total Duplicate Percentage	94.91320156178112
#Duplication Level	Relative count
1	100.0
2	53.83167044994633
3	28.596315261574198
4	17.84476235724718
5	13.160534234281792
>>END_MODULE
>>Overrepresented sequences	fail
#Sequence	Count	Percentage	Possible Source
TGGACGGAGAACTGATAAGGGC	4116367	20.383237044824863	No Hit
GTCGACAGGGAGATAAATCACT	1219086	6.036614061872365	No Hit
>>END_MODULE
>>Kmer Content	fail
#Sequence	Count	Obs/Exp Overall	Obs/Exp Max	Max Obs/Exp Position
CTGAT	6703060	24.08969	683.2534	12
ACTGA	6698550	18.509642	527.1035	11
GAACT	6679100	18.455896	525.52765	9
AACTG	6672260	18.436995	528.3039	10
>>END_MODULE

Hi Bernd,

One option is to copy the log textfile to its own directory somewhere. Then use the "Flat File Document Parser" in "KNIME Labs/Text Processing" and choose location of the directory from the node config.

This puts the logfile in a "document" cell, now use the "Document Data Extractor" node to convert this into a "String" cell. In the config of the node choose "text". Now connect up the "Data To Report" node.

In Report Designer, when choosing the Column (which will be called Text) and you add this to the report page, highlight it, and be sure to go to Advanced Options and "WhiteSpace" and change it from "No Wrapping" to "Normal" otherwise you will only see the first line of the log file.

Does this meet your needs?

Simon.

Hi Simon,

I actually want to plot some graphs based on the values and might even compare different files. So I think I will need the data within KNIME in a usable form.

Maybe I could write a new input node, which has arbitrary number of outputs? Is this possible? (I only know variable inputs)...

Otherwise I have completed the script to create multiple files and this seems to be working. Unfortunately this was easier then devising an algorithm in KNIME. An awk / gawk node might be very useful... Well, the hackathon is coming soon and maybe we can come up with a solution there...

Thanks for all your help,

 

Bernd

To read in multiple files in KNIME is quite easy. If you copy all the files to a single directory, you can then use "List Files" node to get a list of all the filenames from that directory. Then connect up a "TableRow To Variable Loop Start" node, this now has all the filenames as flow variables.

You can now use the "File Reader" node, by showing its variable ports, and connecting the "TableRow To Variable Loop Start" output to the variable input node of the "File Reader" node. To configure this in the "File Reader" node go to the Flow Variables tab, and choose the variable called "URL" from the dropdown list next to "DataURL". Then after this node put in a "Loop End" node. This will now load all the text files in, in one go, from the directory, is this helpful ?

If the File Reader node doesnt load the files the desired way, remember you can swap it for the "Flat File Document Parser" node instead, followed by the "Document Data Extractor" node, and then do the "Loop End" node. Using this approach allows you to do more manipulations with the document too, so instead of extracting the whole lot into one cell with "Document Data Extractor" you can extract each sentence into a cell at a time with "Sentence Extractor", or filter certain terms and characters out from the range of nodes in "Text Processing/PreProcessing".

Simon.

One problem with the loop over different files is that the file reader won't adjust the format (i.e. the table structure, column names etc) for each new input file. That will cause a problem...

I have now solved the problem with the awk script to create different files and then individually read in the different files (no loop). This is OK since they all have to be handled differently in the report anyways and also need different Data to Report nodes.

Though the report designer is very flexible it is still limited to very specific instances where the data has to be nearly the same. And that is actually what I want anyways. It just takes time to generate and you have to have a plan before starting. That is somehow unsatisfactory because the data is changing quite often but I guess I can live with that. 

Thanks for your kind help,

Bernd

Hi ,

I was looking at this solution, and I noticed that as of today, the File Reader doesn’t have an option to update to the DataUrl. Do you have any additional recommendations?

Hi @CarishmaM -

This thread is more than 10 years old, so it’s likely that the solution posted has since been superseded.

It sounds like you’re interested in reading in multiple files from a directory? Please make a new thread and include as many details as possible about what you’re attempting to do. :slight_smile:

4 Likes