Organizing Strings based on Pattern

Hello all,

I have a table that has a primary pattern of rows which everything else from that pattern should match. On the same file, I have data that must be ‘restructured’ back onto its original order. Example:

Vehicle Attributes Vehicle Data
Vehicle Tvpe Conv
Year Make 2000 Ford
Model Mustang
Mileage 88663
MPG (Citv/Highwav) 17/25
Drive Type RWD
Transmission Auto
Engine 4.6LV8
Payload (lbs) 1000/7.7
Feel Capacity(ogl) 15.7
Base Engine Displacement 4L
Vehicle Tvpe Conv
Year Make 2007 BMW
Model M6
Mileage 34737
MPG (Citv/Highwav) 44914
Drive Type RWD
Transmission Auto
Engine 5LVIO
Payload(lbs))] 12.4ft/8491bs
Feel Capacity(ogl) 18.5
Base Engine Displacement 5L
Vehicle Tvpe Conv Year Make 2008 Audi
Model TT2.0T
Mileage 29092
MPG (Citv/Highwav) 22/29
Drive Type FWD
Transmission Auto
Engine 2L14
Payload(lbs))] 8.8ft
Feel Capacity(ogl) 14.5
Base Engine Displacement 2L
Vehicle Tvpe Conv
Year Make 2003 BMW
Model Z4Roadster
Mileage 94749 MPG (Citv/Highwav) 21/29
Drive Type RWD
Transmission Manual Engine 3.0L16 Payload(lbs))] 551/9.2 ft
Feel Capacity(ogl) 14.5
Base Engine Displacement 3L
Vehicle Tvpe Conv
Year Make 2007 Honda
Model S2000
Mileage 24326
MPG (Citv/Highwav) 20/26
Drive Type RWD
Transmission Manual
Engine 2.2L14
Payload(lbs))] Sft
Feel Capacity(ogl) 13.2 Base Engine Displacement 2L
Vehicle Tvpe Coupe
Year Make 2007 Honda Model CivicLX
Mileage 45921
MPG (Citv/Highwav) 30/40
Drive Type FWD
Transmission Auto
Engine 1.8L14
Payload(lbs))] 1000111.511
Feel Capacity(ogl) 13.2
Base Engine Displacement IL
Vehicle Tvpe : Coupe
Year Make: 2005 Honda
Model RSXTypeS
Mileage 63119
MPG (Citv/Highwav) 23/31
Drive Type FWD
Transmission: Manual
Engine 2L14
Payload(lbs))] 25.31
Feel Capacity(ogl) 13.2
Base Engine Displacement 2L
Vehicle Tvpe Coupe
Year Make 2001 Honda
Model Prelude
Mileage 83408
MPG (Citv/Highwav) 22/27
Drive Type FWD
Transmission Manual
Engine 2.2L14
Payload(lbs))] 8.71
Feel Capacity(ogl) 15.9
Base Engine Displacement 2L
Vehicle Tvpe Hatchback
Year Make 2009 Toyota
Model Venza
Mileage 3677
MPG (Citv/Highwav) 19/26 Drive Type : FWD Transmission : Auto
Engine 3.5LV6
Payload(lbs))] 3500/70.11825
Feel Capacity(ogl) 17.7
Base Engine Displacement 3.5L
Vehicle Tvpe Minivan
Year Make 2005 Chevrolet
Model VentureLS
Mileage 142945
MPG (Citv/Highwav) 19/26
Drive Type FWD
Transmission Auto Engine 3.4LV6
Payload(lbs))] 140.7ftl1457lbspayload
Feel Capacity(ogl) 25
Base Engine Displacement 3L
Vehicle Tvpe Sedan
Year Make Mercedes-Benz
Model 863 AMG
Mileage 40065
MPG (Citv/Highwav) 44882
Drive Type RWD
Transmission Auto
Engine 62LV8 Payload(lbs))]: 16.7ft Feel Capacity(ogl) 0
Base Engine Displacement 0

As you see. I need to get the data on columns 2, 3 and 4, in their own rows. The order will always be based on rows 2-12 (Vehicle Type - Base Engine Displacement). And the data out of order will always be simply a ‘carriage return’ away. I can easily do this on a Rich Text Editor in chunks, but when the data becomes larger (hundreds of records), it can take sometime.

Second issue is, I need to remove some of the unwanted strings, like colons < : > and unnecessary strings as seen.

Third issue, as you can also see here. Sometimes the rows are separated by lines, I need to bring them together.

Lastly, I had this data on a single column, and I have learned how to spread them throughout columns last time, example:

Vehicle Tvpe Conv Vehicle Tvpe Conv Vehicle Tvpe Conv Vehicle Tvpe Conv
Year Make 2000 Ford Year Make 2007 BMW Year Make 2008 Audi Year Make 2003 BMW
Model Mustang Model M6 Model TT2.0T Model Z4Roadster
Mileage 88663 Mileage 34737 Mileage 29092 Mileage 94749
MPG (Citv/Highwav) 17/25 MPG (Citv/Highwav) 44914 MPG (Citv/Highwav) 22/29 MPG (Citv/Highwav) 21/29
Drive Type RWD Drive Type RWD Drive Type FWD Drive Type RWD
Transmission Auto Transmission Auto Transmission Auto Transmission Manual
Engine 4.6LV8 Engine 5LVIO Engine 2L14 Engine 3.0L16
Payload (lbs) 1000/7.7 Payload(lbs))] 12.4ft/8491bs Payload(lbs))] 8.8ft Payload(lbs))] 551/9.2 ft
Feel Capacity(ogl) 15.7 Feel Capacity(ogl) 18.5 Feel Capacity(ogl) 14.5 Feel Capacity(ogl) 14.5
Base Engine Displacement 4L Base Engine Displacement 5L Base Engine Displacement 2L Base Engine Displacement 3L

So the data is currently handled on a table manner. I will need to have it joined back onto a single column in order to spread it properly the way shown in this last table above. Any help will be extremely appreciated. I don’t even know where to start. Thanks!

J.
vehicle_sample-data.xlsx (11.2 KB)
vehicle_sample-data.txt (2.4 KB)

Hi @jarviscampbell

Your desired output confuses me a bit. Do you want a two-columns list or do you want that table overview?
It’s definitely possible to get the former with the additional headers as you described, but just want to confirm this before putting time and effort into writing about a possible solution.

Thanks a lot @ArjenEX ,

The end result should show the attributes in only the first column. All the following columns should only be the results/data organized/cleaned up.

Does that make sense?

Thanks a lot.
J.

This then?

image

1 Like

Hi @ArjenEX , Exactly that. Thanks again.

Allright @jarviscampbell! Below is a bit rough cut and probably can be done more efficient but with a computation time of 86 ms it’s not too unreasonable.

Some key points:

First task is to tackle the additional information in the most right columns. Since they are in pairs, I’m opting to put them in groups of two which makes it easier to keep the information together and split it again later. You approach this in multiple ways, collections being one of them. Afterwards, a final List is generated via the Column Aggregator node.

If you then ungroup it and split the list, all the info of the additional columns is now normalized and captured in Vehicle Attributes and Vehicle Data.

The data is indeed quite dirty, so I had to clean it pretty strictly via String Manipulation (Multiple Columns) with replace(regexReplace($$CURRENTCOLUMN$$,"[^A-Za-z0-9()/.]",""),"))",")")

To go from the column format to the table format, I opt for the Group Loop → Column Append route. To utilize this, a identifier has to be added to designate this which is initiated by looking for the VehicleType row.

Within the loop, some cleanup steps are included plus a RowID node. The ensures a clean output since the Loop End Append is done based on the RowID.

Inner Loop:
image

Loop output:

If you want the use the Vehicle Attributes again as column in an output, simple add another RowID node and use the Append RowID Column option.

image

Note: the columns are currently named after the iterations since they can not be of the same name, like you have in your expected output.

Final output:

See WF:
Organizing Strings based on Pattern.knwf (161.6 KB)

Hope this provides some inspiration!

1 Like

@ArjenEX Man, you’re awesome. Let me start going through this and trying a few things and I will reply soon. Thank you so much.

Hi @ArjenEX , I have added the table I am working with, but removed the sensitive data. I see that right on the first ‘Row Filter’, in yours, you get the Split Collection from the Node prior as a ‘String’, however, I get on mine a list and I could not see the reason behind that. Therefore, right after, on the ‘String Manipulation’ Node, it breaks.


SampleBook4KnimeTest1.xlsx (35.5 KB)

I’m afraid I need to see/test that part of your workflow to determine what could be going on there. If I copy and paste the original nodes, attach your network data, it’s still splitting it as string for me.

Hi @ArjenEX , I understand. This table is closer to the real data as I went around all over and removed sensitive data which I have not performed yesterday. I tried to create a dummy table as close as I could and that seemed to have not worked quite as I wish it did. The current table is the one that must work. Apologies. I was trying to protect data w the least resistance. Really hoping you make this one work w your current Workflow and few changes.