Remove HTML tags from strings

Hi,

I have a project to do some analytics on customer surveys but in a number of the free-text columns, I am finding HTML tags like <br/>, <b> and <strong> that are causing all kinds of havoc. I have tried to remove the tags with the “String Manipulation” node but it is not working as I hoped.

I am sure that I am not the first person to encounter this problem. Does anyone have a solution?

tC/.

hi @TigerCole Have you tried this “quick and dirty” regex in the String Manipulation node? It works in the vast majority of cases

strip(removeDuplicates(regexReplace($column1$, "<[^>]+>","")))

Example input

          <p>Hi,</p>
<p>I have a project to do some analytics on customer surveys but in a number of the free-text columns, I am finding HTML tags like <code>&lt;br/&gt;</code>,  <code>&lt;b&gt;</code> and  <code>&lt;strong&gt;</code>  that are causing all kinds of havoc. I have tried to remove the tags with the “String Manipulation” node but it is not working as I hoped.</p>
<p>I am sure that I am not the first person to encounter this problem. Does anyone have a solution?</p>
<p>tC/.</p>
        </div>

Output

Hi, I have a project to do some analytics on customer surveys but in a number of the free-text columns, I am finding HTML tags like &lt;br/&gt;, &lt;b&gt; and &lt;strong&gt; that are causing all kinds of havoc. I have tried to remove the tags with the “String Manipulation” node but it is not working as I hoped. I am sure that I am not the first person to encounter this problem. Does anyone have a solution? tC/.
10 Likes

It occurs to me we might need some special forum titles perhaps. @duristef, “RegEx Wizard” :slight_smile:

Hmmm :thinking:

It would be too much honor. I think of myself as a mere substitute, a humble replacement: regexReplace("Regex","^(.)..(..)$","$1$2")

1 Like

If you want to make this more “readable” I recommend the HTML Node to Text node from Palladian which tries to keep the HTML semantics (i.e. new lines after block elements, filter comments and script, and style tags, etc.)

You’ll need to parse the HTML string using the HTML Parser node first and feed this to the HTML Node to Text.

For the snippet from above, this would produce:

Hi,

I have a project to do some analytics on customer surveys but in a number of the free-text columns, I am finding HTML tags like <br/>, <b> and <strong> that are causing all kinds of havoc. I have tried to remove the tags with the "String Manipulation" node but it is not working as I hoped.

I am sure that I am not the first person to encounter this problem. Does anyone have a solution?

tC/.

Example workflow here.

4 Likes

Hi @duristef

My apologies for taking so long to respond, but your “quick & dirty” did the trick. I really need to sort out my regex skills.

Thanks …

tC/.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.