how do I scrap the data from a website

tejalgavate · July 15, 2020, 11:29am

Hi,

I am new to KNIME. Can anyone guide me how to scrap data from website?

sven-abx · July 15, 2020, 11:37am

welcome to knime. via https://hub.knime.com you can access some example workflows.
or you can use the python node.

br,
sven

tejalgavate · July 15, 2020, 1:40pm

Thanks.
Actually, I am, trying to extract the brand names, price, stock availability, etc from this website https://mumbaidutyfree.net
Can you show me what will be the correct path to do this?

julian.bunzel · July 16, 2020, 8:14am

Hey @tejalgavate,

first pointer would be the GET Request node or the Webpage Retriever node. Afterwards you would have to use the XPath node to extract the data (if the output of the GET Request node is xml) or do some extraction using Regex.
Having a look at the website, I think this task might get quite messy since it’s a dynamic page (e.g. you won’t see all the products on one page, since the number of products shown increases only if you try to scroll to the bottom of the page).

You can give it a try, but it doesn’t look too easy.

Cheers,

Julian

JanDuo · July 16, 2020, 9:22am

Hi @tejalgavate
Webscraping is not an easy thing (as indicated too by @julian.bunzel). I don’t want to scare you off from starting, but there are a few things to consider upfront.

It’s not only retrieving a webpage and interpreting its HTML code, but officially your own build scraper should adhere to the text mentioned in the robots.txt which is normally found at all websites. But https://mumbaidutyfree.net/robots.txt is very friendly in this respect because it allows almost anything.

If sites are more strict (and allow little) you can get banned from accessing it (based on your IP-address and/or user agent characteristics, i.e. OS and browser version).
Search for “proxy switcher”, “free proxy switcher” or something alike: this has become a business on it’s own and is all to provide webscrapers the possibility to keep on scraping while not being noticed by the admin of a site.
If you don’t have a proxy switcher you can use, you must try to mimic a human user. So do not fire dozens of http-requests per second to a domain, because it might raise an alarm.

And last: if you have a working scraper, you might notice after some time it stopped collecting data. The site might have received an update which changed the HTML structure which makes your coded XPath logic useless. Whatever you do, this is something which can always happen.

Scraping is still possible, but it takes some perserverance.

system · June 2, 2023, 9:42pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.