I have a list of websites that I am looking to take the header image off of. Was thinking the best way would be to find each of the “head” elements and then go through and extract the first image found, but not sure the best node and way to set that up.
If you really want to more or less generic solution, you could extract all <img> elements, probably using some heuristics. Then I’d perform some filtering based on desired properties (e.g. file format, size, file name, etc.). I once did something like this to extract the teaser image of news articles. However, based on my experience, this will require quite some fine-tuning and probably never be 100% accurate anyways.
Other idea: Would the “Company Logo API” help? This allows you to get a logo (not necessarily extracted from the website) for a given domain. There’s even the option to define file format, size and color mode.
You can access this API e.g. using Palladian’s HTTP Retriever or some REST nodes:
EDIT: Note: The Webpage Retriever and GET Request are part of the Knime REST Web Services. The HTTP Retriever is part of the Palladian Extension which you would have to install.
Here’s a quick example with Webpage Retriever and GET Request in case you don’t have the Palladian Extension, though I would recommend it: Retrieve http content.knwf (10.9 KB)