XPath Output String Problem

amanu · December 2, 2015, 5:57pm

Hi,

I have a problem regarding the generated output of my XPath node.

Basically what I want to do is to collect all related categories from a wikipedia article using an XPath command.

The relevant xhtml code (example article) looks as follows:

<div class="mw-normal-classcatlinks" id="mw-normal-catlinks">
                        <a href="https://en.wikipedia.org/wiki/Help:Category" title="Help:Category">Categories</a>: <ul>
                            <li>
                                <a href="https://en.wikipedia.org/wiki/Category:Berlin" title="Category:Berlin">Berlin</a>
                            </li>
                            <li>
                                <a href="https://en.wikipedia.org/wiki/Category:German_state_capitals" title="Category:German state capitals">German state capitals</a>
                            </li>
                            <li>
                                <a href="https://en.wikipedia.org/wiki/Category:Capitals_in_Europe" title="Category:Capitals in Europe">Capitals in Europe</a>
                            </li>
                            <li>
                                <a href="https://en.wikipedia.org/wiki/Category:City-states" title="Category:City-states">City-states</a>
                            </li>
                            <li>
                                <a href="https://en.wikipedia.org/wiki/Category:Members_of_the_Hanseatic_League" title="Category:Members of the Hanseatic League">Members of the Hanseatic League</a>
                            </li>
                            <li>
                                <a href="https://en.wikipedia.org/wiki/Category:Populated_places_established_in_the_13th_century" title="Category:Populated places established in the 13th century">Populated places established in the 13th century</a>
                            </li>
                            <li>
                                <a href="https://en.wikipedia.org/wiki/Category:Populated_places_established_in_1237" title="Category:Populated places established in 1237">Populated places established in 1237</a>
                            </li>
                        </ul>
                    </div>

Now using one of the XPath nodes (normal or deprecated) with query

XPath:

//dns:div[@id="mw-normal-catlinks"]

resp.

XPatch (deprecated):

//xhtml:div[@id="mw-normal-catlinks"]

as a result I get a column containing the different categories, which is nice, but unfortunately with the words attached together as one whole string, looking like:

Categories: BerlinGerman state capitalsCapitals in EuropeCity-statesMembers of the Hanseatic LeaguePopulated places established in the 13th centuryPopulated places established in 1237

Is it possible to add a space between the category tagwords or a comma or the like? e.g. like:

Categories: Berlin, German state capitals, Capitals in Europe, City-states, Members of the Hanseatic League, Populated places established in the 13th century, Populated places established in 1237

Or even better, to get every category in an own cell, column wise (going like catword1, catword2, and so on).

I appreciate any help!

Thanks,

Manu

thor · December 11, 2015, 9:44am

Your XPath selects the whole div element. Depending on the desired result type (which I assume is "string" in your case) you won't get the desired output. If you only want the categories select the <li> elements and change the output type to something like collection or several columns. It really depends on how exactly the output should look like.

amanu · January 9, 2016, 4:54pm

Hi Thor,

thanks for your answer.

I was able to figure out a solution based on Tobias' advice in following thread I want to describe

https://tech.knime.org/forum/knime-users/problem-retrieving-multiple-elements-with-xpath

Using the HtmlParser I get the whole wikipedia articles XHTML code.

I extract the relevant Code using XPath (deprecated) adding following xpath statement:

  //xhtml:div[@id="mw-normal-catlinks"]

which gives me following structure:

?xml version="1.0" encoding="UTF-8"?>
http://www.w3.org/1999/xhtml">
    http://en.wikipedia.org/wiki/Help:Category" title="Help:Category">Categories: 
        
            http://en.wikipedia.org/wiki/Category:English_film_actresses" title="Category:English film actresses">English film actresses
        
        
            http://en.wikipedia.org/wiki/Category:English_television_actresses" title="Category:English television actresses">English television actresses
        
        
            http://en.wikipedia.org/wiki/Category:English_voice_actresses" title="Category:English voice actresses">English voice actresses
        
        
            http://en.wikipedia.org/wiki/Category:Living_people" title="Category:Living people">Living people
        
        
            http://en.wikipedia.org/wiki/Category:20th-century_English_actresses" title="Category:20th-century English actresses">20th-century English actresses
        
        
            http://en.wikipedia.org/wiki/Category:21st-century_English_actresses" title="Category:21st-century English actresses">21st-century English actresses
        
        
            http://en.wikipedia.org/wiki/Category:English_screen_actor_stubs" title="Category:English screen actor stubs">English screen actor stubs
        
        
            http://en.wikipedia.org/wiki/Category:British_voice_actor_stubs" title="Category:British voice actor stubs">British voice actor stubs

As adviced I used the Ungroup whereupon I can use the XPath (normal) node to use the xpath statement:

  //dns:div//dns:a

As Return Type: 'String Cell' and as Multiple tag options 'Multiple Columns'

That will add new columns for each category (between <li> and </li>), which is the best possible result.

Thor, regarding your Answer I was not able to select '<li>' as an Xpath element (Error: "Selected XML element is not a tag nor an attribute") using XPath (normal), do you know why that is?

Luckily it was enough to select the whole <a> space, which led to the desired output :)

I'm kind of a newbe, so I have no idea why I wasn't able to use the XPath node (normal) directly after the HtmlParser. And what exactly is the difference between the deprecated and the normal one?

I have all extensions and packages installed but when I search for the deprecated node in the node repository I won't find it. Searching for 'Xpath' only shows me the "normal one".

I got the deprecated one from an example workflow from another thread.

Of course I will attach my solution as an example workflow, for future interest. As input I used Wikipedias random page URL. Sometimes it happens that articles only have one category, in that case - of course - there won't be a multiple column output.

Thanks,

Manu

xpath_example_workflow.zip