Clean mistyped text from OCR files and count occurrences of strings

Hi all,
still struggling with the output of some stupid OCR files. The output looks like this:

Document ID weight material type Number of drillings Number of bendings
ABC_1 1,31 steel 22x 90.00º down 90.00º down
ABC_2 2,35 kg aluminium (3x) 85.00º down. 90.00º down 90.00º down
ABC_3 3,5 KG alu typeA (14x Bend Angle 90.00
ABC_4 Gew.: 5,67 kg steel Black Sea 5 unten 90º
ABC_5 …4,3 steeeel Diameter 8, 12x 3x 45.00

The target structure should look like this:

Document ID weight material type Number of drillings Number of bendings Number of bendings_COUNT
ABC_1 1,31 steel 22 90.00º down 90.00º down 2
ABC_2 2,35 aluminium 3 85.00º down. 90.00º down 90.00º down 3
ABC_3 3,5 aluminium 14 Bend Angle 90.00 1
ABC_4 5,67 steel 5 unten 90º 1
ABC_5 4,3 steel 12 3x 45.00 3

So there are 4 transformations that I have to do:

  1. Extract the weight as an integer from “weight” column
  2. Harmonize “similar” text expressions, containing steel or aluminium as “search” words
  3. Extract the integer from column “number of drillings”, but ONLY if it is marked with an “x”, so make a 12 out of “Diameter 8, 12x”
  4. Create a new column that counts the number of “angles” in column “number of bendings” - unfortunately separated by blank, so that counting the words would result in a wrong number - so I must count “90.00 down” and “85.0 up” or only “90.00”. etc.

I already tried string manipulator node to replace “kg” with “” and then use string to number node, but I failed to use several “replace” commands in the string manipulator and also replace “Gew.:” with “”,
Hope you guys can help on that!

Regex will be your friend here. String manipulation node with various formulas
weight column sth like
regexReplace($column2$,"[^\\d,]+" ,"" )
column 5 sth like
countChars($column5$,"º" , "")

This will not get you to the final solution directly but maybe sth to work with
br

3 Likes

Thanks a lot again Daniel! Your input is helping a lot,
Best regards Christian

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.