String Manipulation within a big string

Paddymaster · August 29, 2019, 2:26pm

Hi there,

I’ve a big text.E.g.the following.

Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd <DATA>gubergren<DATA>, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, <DATA>sed<DATA> diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet.

In the text are many <DATA>'s . I want to uppercase the first letter after <DATA>. E.g. <DATA>gubergren<DATA> to <DATA>Gubergren<DATA>. I must do this with every<DATA> word.

How can i do this ? Sorry. I dont know how

Fabien_Couprie · August 29, 2019, 3:16pm

There should be something more simple, but at least it works.

replace_between_data.knwf (59.5 KB)

quaeler · August 29, 2019, 3:53pm

paddy’s data string.knwf (6.3 KB)

Paddymaster · August 29, 2019, 4:16pm

thx its a little bit long but a good work around

Paddymaster · August 29, 2019, 4:16pm

thx you for your code thk you, that you take time for this.

its really genius; top

izaychik63 · August 29, 2019, 4:42pm

I have feeling that Regex could be used for the task. Armin is a master of Regex and may provide more compact solution.

armingrudd · August 30, 2019, 9:29am

Hi,

There is a feature in regex for replacement text case conversion but it seems it is not supported in KNIME.

In regex, for replacement text case conversion one can use \U (all to uppercase), \L (all to lowercase), \I or \u (capitalize - first character to uppercase) and … but they do not work in String Manipulation node or String Replacer node.

Actually, a solution to this topic could be something like these expressions in the String Manipulation node:

regexReplace($column1$, "<DATA>(.*?)<DATA>", "<DATA>\\I1<DATA>")

or

regexReplace($column1$, "<DATA>(.*?)<DATA>", "<DATA>\\u$1<DATA>")

Unfortunately, none of these expressions work as expected. I also tried so many different ways but they did not work either.

Maybe I am missing something here, so I am going to ask @ScottF if there is a way to use this regex feature in KNIME and how. If not, maybe it is a good idea to add the feature in the future.

References:
https://www.regular-expressions.info/refreplacecase.html
https://www.regular-expressions.info/replacecase.html

ScottF · August 30, 2019, 2:34pm

It’s a good question, but I am notoriously terrible with RegEx. Perhaps @quaeler or someone else has an idea? I can also ask internally.

dnaki · August 30, 2019, 4:34pm

Hi.
If the String Manipulation node uses the underlying Java regex class (e.g. Pattern), it should be noted that the preprocessing operations \l \u, \L, and \U are not supported. See “Comparison To Perl 5” in https://docs.oracle.com/javase/10/docs/api/java/util/regex/Pattern.html
-Don

quaeler · August 30, 2019, 5:40pm

Given the Pattern dependency Don cites, one could split the string into a List column using the Column Expressions node, then Ungroup, then operate on every other row capitalizing the split column’s value with String Manipulation, then rejoin it all… or, one could use a single Java Snippet node.

Paddymaster · September 3, 2019, 7:46am

Thank you @quaeler, @dnaki @ScottF @armingrudd @izaychik63 @Fabien_Couprie
I think the best way in state of the art is the code solution from @quaeler and the non-code solution from @Fabien_Couprie . Well i think is more better to use Regex. But its not supported.

Its really hard for me to check coding rules from some nodes as a beginner. Maybe somebody have info material for syntax etc ? I don’t found that in the knime tutorials. Or should i just learn javascript + java and its okay ?

I really appreciate getting so good help from you here. I know that I come with not everyday difficult topics. Of course I would like to relieve you of course something with more self-help ^ ^
sry

Paddymaster · September 3, 2019, 3:13pm

Well. Dont know, but i get one problem.

If i enter these String :
<DataSet> <Version>1</Version> <Media> <Table> <URL>test.txt</URL> </Table> </Media> </DataSet>

i get these string back
<DataSet> <Version>1</Version> <Media> <Table> <URL>Test.txt</URL> </Table> </Media> </DataSet><URL>

The Uppercase between the <URL> and </URL> is really okay. But i get an URL in the End of the String. Dont know why. I modified your code with realdata. Thats why i change the line
final String delimiterText = “<URL>”; In Origin it was <DATA>

BTW the string is always an xml code

quaeler · September 3, 2019, 4:44pm

The problem is that your original example had the same delimiter as the start and end (<DATA> as opposed to a more standard opening and closing tag like <DATA> and </DATA>) and the algorithm is not suited to the opening and closing tags being different.
I’ll modify the existing one for opening and closing tags and post it today… but to be honest, if you’re dealing with correct XML, it would be easier to use an XML parser and just ask for the URL element. Perhaps there is a person here who knows of an XML node well suited to this… ? Regardless, i’ll include a 2nd Java Snippet node which parses the DOM to get the URL element if i have time.

quaeler · September 3, 2019, 6:15pm

paddy’s xml content.knwf (8.7 KB)

Paddymaster · September 4, 2019, 9:07am

F**K this is really Awesome. Genius Code. I like you. Code is Like a G6 . Must Learn to Code Like you. Any suggestion for a good book for this ? Mean Suitable for Knime. Normally is Python on my #1

quaeler · September 4, 2019, 5:13pm

Thanks - i don’t know of a good book for this sort of thing. Usually i’m just tenacious and use Google to fill in the blanks; i’ve also been around the block now like a hundred times, which i think is what helps the most.

system · September 11, 2019, 5:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.