Manipulate Strings individually

Hi guys,

i’m currently trying to process textual data from a PDF file into a KNIME Table.
Here the initial data from the file and how its supposed to look.

I was able to read in the data and split strings into individual rows.

My current problem is that each row is of different length, meaning I can’t just use a cell splitter using space as a delimiter or else it would turn out like this.

Like “Home” and “Office” shouldn’t be split or like especially with city names where each of them consists of multiple words like “New York City” consisting of 3 substrings and on the other side we have cities like “Seattle” only consisting of 1 substring.
Is it possible to specifically target each substring from each row individually like creating a substring for example with cities having both “New York City” and “Seattle” in the same column?

Hi @StefanJoinPlus ,

have you found a solution in the meantime?
You are unfortunately facing a common but non-trivial problem in my opinion.
Two possible solutions that you can explore:

  1. Dive into the word of regular expressions and use a Regex Split node or
  2. Since you can discern the domain of the data you are dealing with, a more basic solution might be creating a dictionary of the problematic words (e.g. New York City, United States etc…) and replacing them with their “hyphen” version (New-York-City, United-States…) using, for example, a String Replacer (Dictionary) node. After that, the Cell Splitter node will do the job correctly.

Let me know if this helps or if you found a more creative solution!

Cheers,
Emilio

6 Likes

Hi @emilio_s,

I unfortunately haven’t found a solution yet but your
suggestion with replacing problematic words with a hyphen version was exactly what I was looking for.
Thanks a lot you are a lifesaver.

Cheers,

Stefan

1 Like

which pdf parser did you choose? just curious
br

1 Like

Hi @Daniel_Weikert,

I used the Tika Parser for this workflow.
My workflow was more or less based on this one Challenge 15 - Extracting a Table from a PDF – KNIME Community Hub

Cheers,

Stefan

I recommend approaching this in a multi-step approach. Looking at your specific table structure, I would focus on the fields that appear to be uniform and based on a restricted list. Identifying all of potential list of values for the following fields (Segment, Country, State, Region) and pinpointing them via Regex should allow you to break this up into columns cleanly without a complex paragraph long single Regex expression. Splitting before and after those possible values should do the bulk of the cleanup. After that, you should be able to split between Sales and Quantity columns using the space character.

Unless you get really unlucky and one of the customers happens to live in a town named “Home Office, Texas”. :wink:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.