Taking the Logo off of different websites

brendencampana · September 13, 2021, 1:49pm

Hey Selenium Community,

I have a list of websites that I am looking to take the header image off of. Was thinking the best way would be to find each of the “head” elements and then go through and extract the first image found, but not sure the best node and way to set that up.

Thank you!

bruno29a · September 13, 2021, 2:29pm

Hi @brendencampana , not sure of the Selenium node is the best in this situation, but I could be wrong.

Alternatively, you could just retrieve the page with an HTTP request and then parse the html that’s retrieved and extract the image from there.

brendencampana · September 13, 2021, 2:33pm

Do you have another software or node that might work for this?

qqilihq · September 13, 2021, 2:49pm

Hi Brenden,

If you really want to more or less generic solution, you could extract all <img> elements, probably using some heuristics. Then I’d perform some filtering based on desired properties (e.g. file format, size, file name, etc.). I once did something like this to extract the teaser image of news articles. However, based on my experience, this will require quite some fine-tuning and probably never be 100% accurate anyways.

Other idea: Would the “Company Logo API” help? This allows you to get a logo (not necessarily extracted from the website) for a given domain. There’s even the option to define file format, size and color mode.

You can access this API e.g. using Palladian’s HTTP Retriever or some REST nodes:

Hope this helps!

bruno29a · September 13, 2021, 3:00pm

Hi @brendencampana , there a few nodes that you can use:

EDIT: Note: The Webpage Retriever and GET Request are part of the Knime REST Web Services. The HTTP Retriever is part of the Palladian Extension which you would have to install.

Here’s a quick example with Webpage Retriever and GET Request in case you don’t have the Palladian Extension, though I would recommend it:
Retrieve http content.knwf (10.9 KB)

system · April 21, 2023, 9:37pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.