I would like to create a web crawler to extract the equity information from web site.
Extract these columns from above web: Stock Code, Name of Listed Securities, Board Lot, Remarks
The site's equity information is in form of HTML Table, may I know how to extract it and transform into Knime data table then I can save it to csv file or database
I try the XPath node but not sure what to put to XPath query for data extraction. Appreciate if there's some example can illustrate how to extract HTML table from web.
I can give you a couple of pointers to get started using Palladian:
- Use the HttpRetriever node to request the page to parse
- Use an HtmlParser node to turn the page into an XML document
- Use an XPath node to extract what you need from the XML document
If you are not familiar with XPath, you can study one of the many available tutorials. Here is a simple one to get started:
A neat trick to use with the XPath node is the following. Once you XPath node is connected, open its configuration and click on any XML element inside the XML preview. The selected XPath will be shown in green under the XPath summary box. At this point you can simply click to Add XPath to add it. In this way you don't need to type the XPath query.
You may take advantage of the Multiple Rows option to retrieve all the value in each table column at once.
Hope this helps.
Now I manage to do what I want base on your suggestions. Thanks very much
I would like to extract data or connect from this website:
But as I´m new user, I´m having a little difficulty.
I had follow the steps above, but I think just works if I want to check something into the xml code and not a Table, am I right?
the XPath node, mentioned in step 3. of my original answer, has exactly the purpose of extracting the necessary information from the XML structure to a data table so you can work further on with them.
Hope this clarifies it, otherwise feel free to ask again.