auto detect position in text file: discover pattern

HI ALL,
i have a txt file (no csv) where data have a specific position.

For 2020-2022 i have the Rules of these files, but before 2020 the file was little bit different and i don’t have any documentaion about.

Is it possible to get the “pattern” of this unknow data? I’m looking for a node with a name like this: discover pattern

BEFORE
row1 A12CD12
row2 A13CA12
row3 A18CB11
row4 R17DA13
row5 M22BC12

AFTER
col _1) 2) 3) 4)
row1 A 12 CD 12
row2 A 13 CA 12
row3 A 18 CB 11
row4 R 17 DA 13
row5 M 22 BC 12

My problem is more stronger…here just an example :smiling_face_with_tear:

Hi @gcas

When you say you need the node to “discover the pattern” do you mean you have no idea what the pattern might be and you need it to inspect all the data and just find one, and it could literally be given data in any format?

What is the basis by which it might “discover” a pattern? Would it always split on letter/number boundaries?
Why for example might the pattern not equally be…

Col1, Col2
A12, CD12
A13, CA12

and so on?

My approach to a task like this is to ask how I might achieve the task if I were performing it manually and then try to work on how to automate that same process. It feels at the moment like either I’m not aware of the “rules” or else there is a lot that is subjective here.

It is an interesting challenge but I think more discussion is needed before anybody would be able to embark on a generalised solution.

3 Likes

Thanks for you reply @takbb

In the real world i should print 5-10 record in order to understand where each Column start and finish and after this use “cell Splitter node”.

Could be Nice to use a node that can do this automatically, or gelo me to do in a visual mode… as Excel do in “import data”

Hi @gcas,

First, I would check the conformity and convert strings into a general format using the String Manipulation node like upper case letters to “A”, lower case letters “a” and digits to “9” leaving other characters as they are.
For example:
John Smith 36 Teacher → Aaaa Aaaaa 99 Aaaaaaa

Then I would go with Regex Split to create my separate columns. Regarding the previous example I can use this regex:
([A-Z][a-z]+)\s([A-Z][a-z]+)\s(\d+)\s([A-Z][a-z]+)
Output:

col1   col2   col3   col4
John   Smith   36   Teacher

:blush:

5 Likes

THNAKS! this look so tricky for me. As i wrote before I printed 5/10 row and I plit manually, with a pen, and after this i used cell splitter.
I supposed knime should do this automatically, but after read both answer to me i think is not possible.

44032.knwf (12.5 KB)

Here is an example workflow. I have generated some fake input to demonstrate how the method works.

First 5 values:

MollyPiper60molly.piper@someemail.comUrban_Planner
NatalieBlake40natalie.blake@someemail.orgCashier
TimCornish44tim.cornish@someemail.bizReal_Estate_Agent
BenjaminLyman33benjamin.lyman@someemail.helpMassage_Therapist
FelicityMacLeod64felicity.macleod@someemail.deProfessional_athlete

By converting the characters to base characters, we make it easy to find the pattern:

AaaaaAaaaa99aaaaa.aaaaa@aaaaaaaaa.aaaAaaaa_Aaaaaaa
AaaaaaaAaaaa99aaaaaaa.aaaaa@aaaaaaaaa.aaaAaaaaaa
AaaAaaaaaa99aaa.aaaaaaa@aaaaaaaaa.aaaAaaa_Aaaaaa_Aaaaa
AaaaaaaaAaaaa99aaaaaaaa.aaaaa@aaaaaaaaa.aaaaAaaaaaa_Aaaaaaaaa
AaaaaaaaAaaAaaa99aaaaaaaa.aaaaaaa@aaaaaaaaa.aaAaaaaaaaaaaa_aaaaaaa

Or by using even less granular form:

AaAa9a.a@a.aAa_Aa
AaAa9a.a@a.aAa
AaAa9a.a@a.aAa_Aa_Aa
AaAa9a.a@a.aAa_Aa
AaAaAa9a.a@a.aAa_a

Now it’s much easier to find the pattern, isn’t it?
Having the pattern, we can use the proper regex to spit the string into columns.

:blush:

PS: I think I could start working on some component to split string based on most repetitive groups. It may be a bit tricky.

5 Likes

thanks a lot for explanation ant time spent for me! this look soooooooo much smart, i’ll test soon as possible!

PS: If you are going to develope something i suggest to use a visual model, as excel do in “split by position”, where i can draw column selecting only selecting the position.

immagine

Could you send me a sample file containing a few examples?

1 Like

Are you reading my mind? :slight_smile:
this is the workflow i created for test.
Rows are much more

tesk.xlsx (40.3 KB)

1 Like

Actually, it would be very helpful if you provide me with the desired output table as well. So I know how exactly you need these values be split. It would be a help to create the general approach.
If that’s not possible, no worries. I will work on it anyways.

the workflow i sent to you is based on records before the 2019.

Since 2019 a have the index of that records, so i know what is (and how much long is) the column 1,2,3,…80

Before the 2019 i don’t have documentation about, for this reason i created this thread and for this reason i cannot give to you the desired output table

1 Like

Good news @gcas!

I could find a way to split your strings:
44032.knwf (102.7 KB)

The idea is to split strings character by character (I was inspired by your “split by position” idea). Then we find positions for which all rows are blank. We consider these positions as delimiters (multiple blank positions aggregate into one). And the rest is straightforward. Yet, I think there are some columns containing strings which can be split further. Take a look and let me know if this is what you were looking for.

:blush:

2 Likes

THANKS A LOT FOR YOUR TIME!
This approch look so smart and i’m going to deep dive in it

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.