Remove XML and HTML tags

And_Z · August 31, 2018, 10:45am

Hi all,

i’m looking for a way to remove XML tags and also HTML tags from text strings. The tags are not always seperated from the text and therefore a normal string replayer doesn’t really do the job. Additionally the string manipulator would work, but as far as i got it I would need one manipulator for each tag e.g. one for and the next for
…

i also tried the string replacer (dictionary) node to replace some unicode and html, but didn’t really got it to work. the dictionary file looks like this:

beta “\” β
& “\” &
“>” “\” >
“<” “\” <

in this case i used \ as delimiter

Does somebody have a better/easier way to do this ?

cheers

Andreas

qqilihq · August 31, 2018, 1:08pm

In case it’s mostly about HTML, and you have the input cell parsed as XML column, you can use this node from the Palladian plugin:

It’ll try to consider HTML tag semantics when creating the text string (e.g. text from HTML block-level elements will be output to a new paragraph, comments, stylesheets, unnecessary whitespace will be removed). Not sure whether it’ll work equally well for generic XML (which was not our focus), but it’s at least worth giving it a try.

– Philipp

And_Z · September 5, 2018, 10:32am

Hi,

so after checking your proposal I realized, that it won’t really work out for my specific case, since i still wanted to leave some of them in. so i did a string manipulation in the end, using a regex:

just in case somebody else encounters this i used:
(<.[^(><.)]+>)

to replace all tags and

(<(?!para|ulink|/ulink).[^(><.)]+>) for some columns where I needed to retain para and ulink

best

Andreas

denisfi · July 14, 2022, 11:23am

You can use “String Replacer” to change some information before use the “String to XML” component. You can use regex in String Replacer to clean some trash at your code. Then the String to XML wit format as a pretty XML code… If necessary, you can convert XML to table at the end of the flow.