Remove HTML tags from strings

TigerCole · May 13, 2022, 1:32pm

Hi,

I have a project to do some analytics on customer surveys but in a number of the free-text columns, I am finding HTML tags like <br/>, <b> and <strong> that are causing all kinds of havoc. I have tried to remove the tags with the “String Manipulation” node but it is not working as I hoped.

I am sure that I am not the first person to encounter this problem. Does anyone have a solution?

tC/.

duristef · May 13, 2022, 1:49pm

hi @TigerCole Have you tried this “quick and dirty” regex in the String Manipulation node? It works in the vast majority of cases

strip(removeDuplicates(regexReplace($column1$, "<[^>]+>","")))

Example input

          <p>Hi,</p>
<p>I have a project to do some analytics on customer surveys but in a number of the free-text columns, I am finding HTML tags like <code>&lt;br/&gt;</code>,  <code>&lt;b&gt;</code> and  <code>&lt;strong&gt;</code>  that are causing all kinds of havoc. I have tried to remove the tags with the “String Manipulation” node but it is not working as I hoped.</p>
<p>I am sure that I am not the first person to encounter this problem. Does anyone have a solution?</p>
<p>tC/.</p>
        </div>

Output

Hi, I have a project to do some analytics on customer surveys but in a number of the free-text columns, I am finding HTML tags like &lt;br/&gt;, &lt;b&gt; and &lt;strong&gt; that are causing all kinds of havoc. I have tried to remove the tags with the “String Manipulation” node but it is not working as I hoped. I am sure that I am not the first person to encounter this problem. Does anyone have a solution? tC/.

ScottF · May 16, 2022, 3:21pm

It occurs to me we might need some special forum titles perhaps. @duristef, “RegEx Wizard”

Hmmm

duristef · May 16, 2022, 3:40pm

It would be too much honor. I think of myself as a mere substitute, a humble replacement: regexReplace("Regex","^(.)..(..)$","$1$2")

qqilihq · May 16, 2022, 4:11pm

If you want to make this more “readable” I recommend the HTML Node to Text node from Palladian which tries to keep the HTML semantics (i.e. new lines after block elements, filter comments and script, and style tags, etc.)

You’ll need to parse the HTML string using the HTML Parser node first and feed this to the HTML Node to Text.

For the snippet from above, this would produce:

Hi,

I have a project to do some analytics on customer surveys but in a number of the free-text columns, I am finding HTML tags like <br/>, <b> and <strong> that are causing all kinds of havoc. I have tried to remove the tags with the "String Manipulation" node but it is not working as I hoped.

I am sure that I am not the first person to encounter this problem. Does anyone have a solution?

tC/.

Example workflow here.

TigerCole · May 18, 2022, 11:57am

Hi @duristef …

My apologies for taking so long to respond, but your “quick & dirty” did the trick. I really need to sort out my regex skills.

Thanks …

tC/.

system · August 16, 2022, 11:57am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.