I am using XPath to parse through some XML data and have some data as collection cells that I need to convert into string cells in order to export into Excel. I have tried Split Collection Column and then Column Combine. However, some values are missing and return as a “?”, so my combined column includes a bunch of “?,?,?,?,?,?,?”. I’ve tried to simply remove these after using String Manipulation. However, I feel that this is not ideal, since I now have hundreds of columns that my collection cell has split into, most just returning “?”. I have also tried parsing the XML data into Multiple Columns instead of a collection cell, but the output would be the same as aforementioned. I have also tried Ungroup, but that just creates multiple duplicate rows, which I’m not sure I necessarily want either. Open to any solutions, so any ideas at all would be great.
node to get rid of ?.
This might also help you eventually:
First of all, I think the reason for having cells with missing values after splitting collections is that your collections do not have the same length so when you split collections, the number of columns in your table will be the same as the number of instances in the longest collection so when the shorter collections are splitted, the remaining columns are filled with missing value.
Now how to solve your problem:
If you have deselected “Quote always” option in the configuration window of the “Column Combiner” node then your output would be like this:
And so on…
To remove the additional part with missing signs from the strings, you can use a “String Manipulation” node applying this expression:
regexReplace($combined string$, ",\\?.*", "")
This will modify the examples like this:
I’ve tried using this, but not sure any of the configurations are things I would want, as the data is missing values due to what Armin has explained.
Yes, this would be perfect! Eagerly looking forward to it, Steve!
That is exactly the case here and what I have done temporarily, though I hope for a more elegant solution in the near future. Thanks so much for your detailed input, Armin!
Good news - got permission today to release it to the community. Hopefully that will now happen within the next few days. I will comment again here when it does.
Hi, Armin. I have a follow-up KNIME/Regex question I was hoping you might be able to shed some light on. I now have my data looking like this for example:
BorisovaSvetlana A.; KimHak Joong; PuXiaotao; LiuHung-wen*
I am trying to separate the last name and first names with a comma and space to look like this:
Borisova, Svetlana A.; Kim, Hak Joong; Pu, Xiaotao; Liu, Hung-wen*
I have tried using
regexReplace($new column (#3)$,"([A-Z][a-z-]+)([A-Z][a-z-]+)" , "\1, \2"), which seems to work fine on online regex testers, but in KNIME it looks like nothing is being captured, as the captured groups simply return nothing:
, A.; , Joong; , ; , *
Any ideas? I can also ask this in a separate post, if that is more appropriate. Thanks!
This is what you should apply in the String Manipulation node:
regexReplace($new column (#3)$, "(?<![a-z] )(\\b[A-Z][a-z]+)", "$1, ")
(\\b[A-Z][a-z]+) matches strings starting with a word boundary and a capital letter followed by one or more lower case letters. But this will match the separated last names like “Joong” as well. So we use a look behind like
(?<![a-z] ) to dismatch those which are preceded with a lowercase letter and a space.
$1, " reuses the match and adds a comma and a space to it.
Is it what you were looking for?
Yes! This has mostly worked, except in the cases where special characters exist, like
SánchezCésar†; Zhu, Lili‡; Bra, ñaAlfredo F.†; Salas, Aaroa P.†; Rohr, Jürgen‡§; MéndezCarmen†; Salas, José A.†§
where SanchezCesar is not separated and Bra, ñaAlfredo is separated at the ñ when it should be separated like Braña, Alfredo or
Haug, -Schifferdecker, Elisa‡; Arican, Deniz§; Br, ücknerReinhard§; Heide, Lutz‡1
where Haug, -Schifferdecker, Elisa‡ should be Haug-Schifferdecker, Elisa‡ and Br, ücknerReinhard§ should be Brückner, Reinhard§.
I tried to replace [a-z] with \w, but this did not work. Thanks again for all of your help!
This should work in that case:
regexReplace($new column (#3)$, "(?<![^;] )(\\b[A-Z][^A-Z; ]+)", "$1, ")
Thank you! Seems to work in all cases except for
Haug-, Schifferdecker, Elisa‡; Arican, Deniz§; Brückner, Reinhard§; Heide, Lutz‡1
Yeah, I just noticed that.
Haug-Schifferdecker, Elisa‡ should be the correct form.
Give me some time and let’s see if I can fix that one too.
Here you are:
regexReplace($new column (#3)$, "((?<!([^;] )|-)(\\b[A-Z][^A-Z; ]+(-[A-Z][^A-Z; ]+))|(?<!([^;] )|-)(\\b[A-Z][^A-Z; ]+))", "$1, ")
EDITED: Added “|-” to look behind patterns so that if the second part of the last name begins with a capital letter, now it won’t be a match. E.g. LiuHung-Wen
You are awesome! Thank you so much, Armin! Unfortunately, what I failed to consider personally was that going about this via string manipulation would not work in cases where names are for example McLuskey, Karen or Amstalden van Hove, Erika R., but you have been beyond helpful and I really appreciate it.
Regex is really a powerful functionality booster and KNIME makes use of it very well.
Thanks so much, Steve! It works like a dream!
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.