Text File Data Mining extracting specific text (parsing)

velasced · January 13, 2016, 7:16pm

Hi-

I'm new to KNIME and perhaps I've not done enough research (currently watching the Text Mining Webinar). I have several log files from Informatica. These files are in a directory (names change). I need to extract specific text sections. Kind of a UNIX awk, grep, regular expression etc...

I started with the Flat File Document Parser, then no clue what to do next. What will be an example flow to accomplish this? Any advice will greatly appreciate it! Ed

Here are some entries of the logs....

These are standard (Strucutured)

DIRECTOR> TM_6685 Workflow: [wf_ADS_Master30_Core_Aggregates_EU] Run Instance Name: [] Run Id: [2274768]
DIRECTOR> TM_6101 Mapping name: m_aggr_sales_plan_week [version 15].

These are quite unstructured

LKPDP_2> DBG_21097 Lookup Transformation [SC_Lkp_ADS_cal_dt_by_cal_dt_hist_id2]: Default sql to create lookup cache: SELECT cal_dt,cal_dt_history_id FROM cal_dt WHERE etl_current_flg = 1 ORDER BY cal_dt_history_id,cal_dt

LKPDP_2> TE_7212 Increasing [Index Cache] size for transformation [SC_Lkp_ADS_cal_dt_by_cal_dt_hist_id2] from [38836148] to [38838800]. <---this is the next log entry

kilian.thiel · January 15, 2016, 2:09pm

Hi Velasced,

your task seems to me not like a usualy text mining task. The Textprocessing extension is usefule if you want to do, POS tagging, NE recognition, filtering of documents, conversion to vector space.

If you want to process log files in a awk, grep, regex fashion other nodes are more usefule here.

How are these logfiles structured? Is every line structured the same way? If so, you could start with the Line Reader node, that create one column as output and one row per line. The proceed with the String Manipulation or Column Splitter node and so on. If you are interested in only some specific rows/lines, use the Row Filter with e.g. a regex to filter these lines and then split and manipulate these row further.

Cheers, Kilian

velasced · January 19, 2016, 5:05pm

Hi Kilian-

Thanks for the reply. There is no structure as per say, in log files every entry starts with a "LABEL>" then text.However you can infer the text delimiters because there is an standard text like "workflow:" ending in "] Run". The recommended flow is a great start I'll give it a shoot.

Regards, Ed

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.