Help on XPath to relate unrelated nodes

robdisseldorp · April 14, 2023, 3:27pm

Hello everyone,

I am quite new to KNIME and would like to scrape a website to retrieve a list of items which are unrelated in the HTML structure but need to be related into a table.

I have the following XML generated by the Webpage retriever node (cleaned it for clarity):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html>
<html>
	<body>
		<h2>
			<span class="headline" id="continent1">Continent1</span>
		</h2>
		<h3>
			<span class="headline" id="country1">Country1</span>
		</h3>
		<ul>
			<li>City 1 ("link_to_city1"), 350,000 citizens</li>
			<li>City 2 ("link_to_city2"), 100,000 citizens</li>
		</ul>
		<h3>
			<span class="headline" id="country2">Country2</span>
		</h3>
		<ul>
			<li>City 3 ("link_to_city3"), 20,000 citizens</li>
			<li>City 4 ("link_to_city4"), 60,000 citizens</li>
		</ul>
		<h3>
			<span class="headline" id="country3">Country 3</span>
		</h3>
		<ul>
			<li>City 5 ("link_to_city5"), 45,000 citizens
			</li>
		</ul>
		<h2>
			<span class="headline" id="continent2">Continent 2</span>
		</h2>
		<h3>
			<span class="headline" id="country6">Country 4</span>
		</h3>
		<ul>
			<li>City 6 ("link_to_city6"), 150,000 citizens</li>
		</ul>
	</body>
</html>

I would like to retrieve the continents, the countries and the cities into 1 table like below:

Screenshot 2023-04-14 at 17.15.04

Unfortunately there is not a real hierarchy in the HTML structure so I cannot detect if city is related to a certain country and if country is related to a certain continent, other than by sequentially going through every record perhaps. However, from what I understand, HTML parsing using RegEx is not recommended and I see people recommending using XPath node instead.

Having read XPath syntax documentation, I feel it should be possible using the right XPath expression to get the data correctly our of the document.

However, I am not an expert on XPath syntax so would be interested to see if anyone has some suggestions to tackle this challenge.

<h2> contains the continent level, <h3> contains the country level and the <li> tags contains all the cities.

Many thanks in advance,
Rob

ArjenEX · April 14, 2023, 5:02pm

Hi @robdisseldorp

Welcome to the KNIME Community!

To be honest I don’t think this is feasible with Xpath because of the offset and the fact that it needs to keep repeating <h2> and <h3> for each <li> until a new one is found.

As long as your Regex is build in such a solid way, it shouldn’t be a problem if Xpath is not sufficient.

Anyway, below is a way to achieve the desired result.

Since I’m going for a text based approach, I get the data with a Line Reader first.

Since the original xml is formatted, the idents are cleansed via [\r\t\n]

Next is a Column Expression trick. The line that holds the actual value of <h2> is always proceeded by a line which contains this tag. If you therefore check if this is true, you can conclude if you are looking at the right line. If so, extract the continent (cleaning up all the HTML tags). If not, leave it blank.

if (contains(column(0,-1),"<h2>") == true) {
    regexReplace(column("Column"), "<[^>]+>","")
} else {
    null
}

In the below example when evaluating Row5, it find that Row4 contains <h2> and therefore retrieve the content of Row5 in a cleansed format.

Make sure to enable to access feature under the Advanced tab.

Repeat this process for the others. To associate the continents and the countries to the cities, use a Missing Value node with replacement Previous Value. Filter on not null cities to only retain the relevant records.

See WF:
Help on XPath to relate unrelated nodes.knwf (36.0 KB)

Hope this helps!

robdisseldorp · April 14, 2023, 9:06pm

Hi @ArjenEX,

Wow! Thanks for the quick reply, this is awesome! It works as expected!

Am starting to love KNIME, thanks for your time!

Best,
Rob

system · July 13, 2023, 9:06pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.