How to import this HTML web to KNIME table?
Hi,
As @hmfa mentioned, the Webpage Retriever is probably your best bet to get the data into KNIME. Then you end up with your HTML as XML document. On this you can try this component from KNIME Hub: Table from XHTML – KNIME Community Hub to extract the table.
Kind regards,
Alexander
Hi @shafiqakib,
Could you share the HTML where you want to extract the table from? The node makes a few assumptions, such as that the data must be in elements without any child elements. If you could share the HTML, I could have a look how to adapt the component to your needs. You can also change the data in order to not share anything confidential, as long as the HTML structure stays the same.
Kind regards,
Alexander
This HTML that I am working on right now: https://exzellprofis.com/Customer_Data.htm
Hi @shafiqakib,
Thank you for the link. The problem is that this website builds the table dynamically from JavaScript. The Webpage Retriever only fetches the initial HTML and does not execute any JavaScript in the background. That means the fetched HTML does not contain the table yet and you won’t be able to read it.
If you have a lot of these tables and really need to fully automate it, you probably have to use browser automation like Selenium to achieve your goals. However, when a tiny bit of manual work to extract the data from the website is ok, I usually go a different route. I will describe it here.
In your browser, open the page with the table and then open the browser’s developer tools. I am using Chrome, so I am going to describe my approach based on that, but it works in a similar fashion in other browsers, too. In Chrome, you can right-click the page and select “Inspect” to open the dev tools. Then switch to the Console tab of the dev tools and enter the following code:
var trs = Array.from(document.getElementsByTagName("table")[0].querySelectorAll("tr")).filter(r => r.getAttribute("height") != "0");
var column_names = Array.from(trs[0].querySelectorAll("td")).map(td => td.textContent);
var rows = trs.slice(1).map(tr => Array.from(tr.querySelectorAll("td")).slice(0, -1).map(td => td.textContent));
rows.map((row, i) => row.reduce((m, c, i) => ({...m, [column_names[i]]: c}), {}));
Unfortunately, this is not the same for each table, as websites do them differently. Your table, for example, has neither <thead>
nor <tbody>
elements and also does not use <th>
elements to represent header cells. It uses <td>
for everything. Additionally, the table has an empty column on the right and an empty row at the end.
My code in the first line first gets a reference to the first table in the document (document.getElementsByTagName("table")[0]
), then inside that it gets all <tr>
elements, which represent table rows (.querySelectorAll("tr")
). It then filters out rows with a height attribute that has the value 0, because those rows have no content in your example.
In the second line, I am getting a list of text content from the first row. I use the map() function for it. It is an array function in JavaScript, but since all the DOM query functions return a NodeList object, I use Array.from() a lot for conversion.
In the third row, I am getting all rows as arrays, so that the format is [[1, 2, 3], [4, 5, 6], ...]
: an array of arrays, where the inner arrays are the cells of each row. Here I use the slice() function to remove the first row (trs.slice(1)
) and the last column (Array.from(tr.querySelectorAll("td")).slice(0, -1)
).
In the last line of the code, I am building a JSON object that KNIME understands. I keep the array of the rows, but instead of the inner arrays, I have objects where the keys are the column names and the values are the column values. This is done using the reduce() function. Because I do not assign the result of this line to a variable, it will be printed to the console. It should look somewhat like this:
You can see that it has extracted 100 rows and when you click on the small triangle on the left, you can expand the object and see what it has created.
[
{"Customer↵ Record No": '1', Order Date: '05/01/2016', Customer Type: '2', Avg No. of orders per mo: '24', Avg days Order to delivery time: '38', …}
{"Customer↵ Record No": '2', Order Date: '05/01/2016', Customer Type: '3', Avg No. of orders per mo: '36.4', Avg days Order to delivery time: '42', …}
[...]
]
Now right-click the object in the console like in the screenshot below and select “copy object”. This copies a JSON representation of the result into your clipboard.
And finally, paste the clipboard content into the first cell of the Table Creator node in the workflow I attach below. It will unpack the JSON and make a KNIME table from it.
JSON Table.knwf (13.7 KB)
I know this might be a bit harder to understand than what you have asked for, but maybe it does help you with similar tables in the future. Good luck!
Kind regards,
Alexander
WOW very well explanation. Thank you so much for helping me
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.