Manipulate Strings individually

StefanJoinPlus · October 11, 2023, 1:58pm

Hi guys,

i’m currently trying to process textual data from a PDF file into a KNIME Table.
Here the initial data from the file and how its supposed to look.

I was able to read in the data and split strings into individual rows.

My current problem is that each row is of different length, meaning I can’t just use a cell splitter using space as a delimiter or else it would turn out like this.

Like “Home” and “Office” shouldn’t be split or like especially with city names where each of them consists of multiple words like “New York City” consisting of 3 substrings and on the other side we have cities like “Seattle” only consisting of 1 substring.
Is it possible to specifically target each substring from each row individually like creating a substring for example with cities having both “New York City” and “Seattle” in the same column?

emilio_s · October 17, 2023, 10:59am

Hi @StefanJoinPlus ,

have you found a solution in the meantime?
You are unfortunately facing a common but non-trivial problem in my opinion.
Two possible solutions that you can explore:

Dive into the word of regular expressions and use a Regex Split node or
Since you can discern the domain of the data you are dealing with, a more basic solution might be creating a dictionary of the problematic words (e.g. New York City, United States etc…) and replacing them with their “hyphen” version (New-York-City, United-States…) using, for example, a String Replacer (Dictionary) node. After that, the Cell Splitter node will do the job correctly.

Let me know if this helps or if you found a more creative solution!

Cheers,
Emilio

StefanJoinPlus · October 17, 2023, 11:49am

Hi @emilio_s,

I unfortunately haven’t found a solution yet but your
suggestion with replacing problematic words with a hyphen version was exactly what I was looking for.
Thanks a lot you are a lifesaver.

Cheers,

Stefan

Daniel_Weikert · October 17, 2023, 4:28pm

which pdf parser did you choose? just curious
br

StefanJoinPlus · October 19, 2023, 8:15am

Hi @Daniel_Weikert,

I used the Tika Parser for this workflow.
My workflow was more or less based on this one Challenge 15 - Extracting a Table from a PDF – KNIME Community Hub

Cheers,

Stefan

iCFO · October 19, 2023, 1:35pm

I recommend approaching this in a multi-step approach. Looking at your specific table structure, I would focus on the fields that appear to be uniform and based on a restricted list. Identifying all of potential list of values for the following fields (Segment, Country, State, Region) and pinpointing them via Regex should allow you to break this up into columns cleanly without a complex paragraph long single Regex expression. Splitting before and after those possible values should do the bulk of the cleanup. After that, you should be able to split between Sales and Quantity columns using the space character.

Unless you get really unlucky and one of the customers happens to live in a town named “Home Office, Texas”.

system · October 26, 2023, 1:35pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.