Formatting cell text (Upper case after period) - STRING MANIPULATION - Capitalize (str, chars)

Hi Knime folks!

I’m currently working on fixing text format within an EXCEL file using Knime.

One of my tasks is to ensure that every time a sentence comes up after a period it starts with an Upper case.

In order to achieve this I have first loaded the excel file into an “Excel Reader” node, then connected it to a String Manipulation (Multi Column) and asked Knime to express the output with an Excel Writer Node:

My String Manipulation Node was set-up to:

  1. Ensuring that all sentences of a specific column had a full stop at the end.
  2. Ensuring that after said full stop, the next sentence begins with an Upper Case.

The first goal was achieved with the extra kind help of @takbb !

For the second goal I used the CAPITALIZE expression:

From what I understood, writing the following should ensure that every character after a period and a space would be capitalized :

capitalize($$CURRENTCOLUMN$$,". ")

However I get something like this instead:

Should Be Continuous Transmission Of Data And Open To […]

I figured perhaps there was something wrong with using a dot as a delimiter, so I used the find and replace function to replace every dot and space combo ( ". “) with an asterisk and a space (”* "). I then rewrote my code using an asteriska s a delimiter… and got this instead:

Supervision Should Be Risk -based* Specifications Should Ideally […]

So yeah… I’m quite lost here, no clue about:

  • Why this is not working
  • What to do to achieve my goal

Thanks everyone in advance!

Hi @KaboboJakobo,

in order to transform a particular character to uppercase, you first need to separate it. Here is a very simplistic approach accomplishing what I believe you asked for:

Best
Mike

4 Likes

Hi @mwiegand , now that you’ve done all the hard work… I wondered if the new String Cleaner node (KNIME 5.2 onwards) had anything that might assist.

It seems it can do the job of that “magical” :wink: String Manipulation node…


… scroll down to the “String Manipulation” section…
image

image

The rest is the same, other than the GroupBy node config changing to work on the column replaced (or appended) by String Cleaner.

@KaboboJakobo , assuming this all resolves your problem, kudos (solution) should go to @mwiegand’s post I feel.

2 Likes

More an FYI - In curiosity about real spell checking I thought of using AI nodes like ChatGPT but also have found these:

Hi!
Thank you very much for your kind answer,

This said, I’m not sure I know how to use this workflow (?):

  • I’m starting from a large pre-existing excel table, I have replaced the first node with an Excel Reader but I’m not sure whether that will create any issues?

  • I guess the column I have chosen in the case converter is the one to be split by full stop / ungrouped and loaded into the string manipulation node (?)

  • In the last node, I guess I still only select the same column, I do not need to remerge it with its former self?

  • Nothing happens when I run it, I guess I still need an Excel Writer (?)

Lastly, and more importantly, why is it that the simple capitalize expression with delimiter didnt work? Why did we need to split in between?

Thanks a million in advance :slight_smile:

Hi @KaboboJakobo , for this specific question, … ok you are on to something, lol!

Your String Manipulation’s capitalize function didn’t work because with a delimiter specified it actually capitalises the very first character of the string and then capitalises each character immediately after the delimiter. So in principle it IS what you want. except…
… when you supply a delimiter as two characters, it treats each independently, which is not what you want (yeah… it sucks!).

Let’s look at the effect of these:
i.e. in String Manipulation (Multi-Column)

capitalize($$CURRENTCOLUMN$$,".") – capswithFullStop
capitalize($$CURRENTCOLUMN$$,". ") – capswithFullStopSpace

You can see that in “Row0” the space, the space in Sample Text which appears after “dog.” gets in the way of the capitalisation if it’s just delimited on “.”, because it tries to capitalise the space! Meanwhile, if using two delimiters, every letter after each space is capitalised.

Meanwhile in “Row1”, where there wasn’t a space after “dog.”, it does pretty much what you want,… but of course there is no space between sentences.

So that got me thinking… what if we were to remove all the whitespace after every full stop, do the capitalisation and then add a space back after each full stop.

Like this, in String Manipulation (Multi-Column), :slight_smile:

regexReplace(
	capitalize(
	   regexReplace($$CURRENTCOLUMN$$, "\\.([\\s]+)","\\." )
	 ,"."),
"\\.([^\\s])","\\. $1")

and yeah, it works… as long as you don’t mind using regex and there being potential changes to whitespace between sentences (and this could present a big problem if you have line breaks for example):



:wink:

Alternatively, without regex (but similar limitations)

replace(
   capitalize(
	   replace(
	   		removeDuplicates($$CURRENTCOLUMN$$)  /* remove any duplicate spaces */
	   		,". ",".")                           /* replace period-space with period */
	   ,".")                                       /* capitalise using "."*/
	 ,".",". ")                                    /* replace period with period-space */
2 Likes

You re most welcome. Though, maybe you share the workflow in question. In case it contains sensitive data, you can anonymize via the String replace like so:

1 Like

@KaboboJakobo,

I have a further variant that I think will be “non-destructive” to any whitespace, and will also work if you have carriage returns between sentences.

replace(
	  capitalize(
		regexReplace($$CURRENTCOLUMN$$, "(\\.\\s*)([A-Za-z])", "$1¬$2")
		, "¬")
       , "¬", "")

This places a “¬” character immediately before the first letter that follows any white space after a full stop. It then capitalizes using “¬” as the delimiter, and finally removes the “¬”.

If by chance your text actually contains the ¬ character, then you’d need to choose a different character that won’t appear in the text.

Thank you! This would have been exactly what I had in mind, but for some reason it will not execute:

image

I’m a complete noob, so I’m mostly left scratching my head whenever I get an error code, dont even know where to start.

Thanks a million! But I think I would get in trouble even if sharing the data which is anonymized.

In short its just an excel with a few snippets of text, the tricky part of the text is actually public and composed of comments such as this one:

“Supervision should be risk -based. specifications should ideally include a list of species + transport -types to supervise and a list of derogation-options. should also include a timeframe in which they will occur. capacity building for veterinarians should be arranged. here the supervision at unloading by a veterinarian cannot be ensured, lawful supervision alternatives should be defined (e.g.videosurveillance).”

I did some previous work to ensure all of these comments would end-up with a full stop, while also ensuring that there were no random upper case letters. This in turn made it so that upper cases after full stops were turned into lower cases.

I was now trying to get these back to upper cases.

No probs, I can relate to the struggle. Though, did I got you right that you made some string modifications before actually causing the issue you try to fix? Maybe we can find a better way to prevent the issue from appearing in the first place?

1 Like

Hi @KaboboJakobo , this is almost certainly because for at least one value that it is trying to manipulate, you have missing data.

Try this:

$$CURRENTCOLUMN$$==null?toNull("")
:replace(
	  capitalize(
		regexReplace($$CURRENTCOLUMN$$, "(\\.\\s*)(\\S)", "$1¬$2")
		, "¬")
       , "¬", "")

This does a test, and if the particular value is null (missing) it returns toNull("") which is how String Manipulation can return a missing value, otherwise it returns the calculated value as before.

You probably won’t find that syntax in the node documentation. For further info see here:

Specifically, the syntax of the conditional statement is described here:

1 Like

Yes indeed :slight_smile: , I had a number of tasks to perform in this excel originally:

  • Correcting Typos
  • Capitalizing the first sentence in all cells
    -Getting a variable length snippet of the text inside the first column to be in italics (1- Article XYZ: Provisions for getting Pablo to be a little more efficient
    2- Article XYZw1234: Why durums should alway come with sauce andalouse )
    -Capitalizing after periods
  • Text Wrapping

I was taking each task separately and so far it seems like these two posts have helped me get through with the capitalization :hugs: !

I have used in built excel features for the typos and text wrapping, I gave up on the italics and decided to have the whole column in Italics (I sure hope that is good enough).

1 Like

Wish I could like twice, that was likely the issue, perhaps some of my periods were missing the space behind them already.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.