I have been exploring how KNIME can be used to automatically extract event details from webpages and put them into a table or database. Specifically details like event name, place, time, date, costs, people, and general event information. I've explored using Palladian and openNLP NE taggers to gather this type of information.
Has anyone had success with this type of data extraction and databasing from webpages?
For pages that contain multiple events and related information, how could event related details be grouped together - relative proximity and/or page formatting ( CR, LF, spaces)?
Can KNIME/Palladian identify formatting on a webpage - such as bolded text that might identify an event name?
Would RegEx be the best way to identify phone numbers, email addresses, formatted dates?
in case your extracting the mentioned kind of information from webpages, I would not recommend going the NER way, unless there is a very good reason to do so (absolutely unstructured plain text, e.g.), as state-of-the-art NERs are not 100% accurate and thus introduce chance for potential errors. If you have a manageable amount of different sites, rather try to make use of XPath expressions (using KNIME's XML nodes) and exploit the HTML page's structure to extract the desired data (e.g. exploit that information is given in a certain position within a table or list, or headline, etc.).
A typical workflow to extract data using XPath would look like: URL input -> HttpRetriever -> HtmlParser -> XPath -> ...
Yes, using regular expressions can make sense in case you want to identify and filter certain types of information (e.g. phone numbers, address data, URLs, ...), but still I would recommend to narrow down potential information using XPath queries.
Do you have any first attempts already? If you're experiencing any specific issues feel free to get back (ideally with a sample workflow).
I've used several commercial webscraping software packages/services in an attempt to extract event data from webpages and have experienced varying results. All of the webscraping software that I have used so far require that I identify samples of the information and identify what type of information it is in advance (date, address, name,..). My goal is to extract event details from websites using a webbot approach.
My thinking is to use NER, regular expressions, XML, and other tools to make an automated attempt at identifying, grouping, and extracting event details. And then have the workflow flag webpages that are missing necessary detail(s) (i.e. event name, address, date, time,..). These flagged webpages would be reviewed by a human who could identify the missing information and feed the information back into the model/workflow (NER, regular expression) so that similar information could be identified automatically in evaluations of future webpages. The human evaluator would also review non-flagged webpages and update the models/workflow accordingly if errors in the automated evaluation were visually observed. This process would train the workflow to become more accurate in gathering information automatically. The process would reguire significant human interaction initially but would hopefully become more automatic over time.
I have not created a working sample workflow in KNIME yet. I have studied existing sample workflows for Named Entity tagging, dictionary based tagging, PMML, Palladian, and Stanford's Named Entity Tagger. I'm a novice with KNIME and am looking for previous examples or advice that could help me accelerate development of this process.
sorry for getting back late, quite busy currently.
Your project sounds interesting and quite ambitious :) Are you planning to crawl to "open web" or do you consider a fixed set of sources from where to extract your data? In the latter case, I would really really strongly recommend going the way of manually creating extraction rules for your individual sources, in case the amount is somewhat manageable. (we, the Palladian developers have actually an IE/IR background and also researched machine learning in the context of Web data extraction in a great depth, yet from our experience I would almost always recommend a rule-based way in case it is feasible.
Another lesson learned: When extracting data from the Web, exploiting the page's structure (i.e. DOM tree) is in general much more promising, than employing NLP extraction (i.e. NER). There are of course exceptions, e.g. in the chemical/bio domain, but for your use case, I suspect the task of e.g. identifying event names through an NER will be very hard/prone to error.
Depending on the kind of pages you want to crawl, some generic approach could be feasible. E.g. we once created a KMIME workflow for extracting products from different web shops. Therefore we exploited different cue words within the DOM element's class/id attributes and the actual text content in order to classify elements as relevant/irrelevant. While not 100% accurate, this approach worked pretty satisfactory, yet involved plenty of fine-tuning using manually checked testing data. (a fully automatically trained approach would not have been feasible here, because we did not have enough data for training) -- maybe such an approach could also make sense for your case?
Keep me updated about your progress and if you have any specific questions, feel free to get back :)
Thank you for your insight. The project would be to do both a crawl of the open web and also have a focus on known websites. Exploiting the page's sturcture would be a necessity - especially when extracting from a page/ or website listing multiple events. The structure on a multiple event listing is usually repetitive - I am thinking that NER , Regular Expressions, and/or cue words could be used to identify many known places, times/dates, names, numbers, locations, ... Then the workflow could consider (i.e. map) the DOM structure to identify unknown places, times/dates, names, numbers, locations, ... based on their relative position from the DOM strucuture of known values.
For identifying the event title, the workflow could consider the line at the beginning of the identified event details and/or the line that is bolded or in larger text - the events title are often presented this way.
For the crawl portion, consider the KNIME website; on the main page,there is a link title "Events". The proposed workflow would crawl to the main page of a website and try to identify any event details on the page. Then the workflow would look for a link with the title "events" or something similar. It would then follow the link to the "event" page and use the method described in the first paragraph of this note. It would identify dates, locations, event titles (at the beginning and in bold/larger text) and then follow the links in each listing for more details from the Eventbrite or Meetup listing.
It would not be a 100% reliable approach but I suspect it would capture a significant number of events and event details. The workflow would flag events that did not have enough detail (i.e. - missing a date or location or title) for review by a human. Once the missing details are filled in, then the workflow could incorporate their value (for future NER) and/or their location in the DOM tree. Known DOM trees could be stored for reference when the site is visited again. Over time the process would become more reliable.
The workflow could also use an API to pull data from sites that were known and permitted direct extraction.
Does this sound like a reasonable approach using KNIME and Palladian?
from a technical standpoint, KNIME and the Text Processing and Palladian nodes provide plenty of bulding blocks you're looking for. I'll give you some pointers:
- NERs are provided by the Text Processing and the Palladian nodes. In Palladian we have a text-classification-based NER which might be trained to your specific use case (enough training data is required).
- We have specialized extractors for Dates and Locations (DateExtractor + Location nodes). "Location" in this context is more from a geographical standpoint; although the nodes would be able to extract well-known venues such as "Moscone Center", the focus is clearly on geo entites, such as countries, cities, landscapes, … however, the gazetteer database can be extended if you have a database of venues which you want to consider. More details on the Geo nodes available here.
- You can exploit HTML pages’ DOM by using the Palladian HtmlParser and the XPath nodes. Tasks like identifying links or bold text is simply a matter of providing appropriate XPath expressions.
- The whole vision you're describing sounds more like a stand-alone system to me, involving components like a dedicated long-running crawler, custom extraction logic, a user interface for training/customization, etc. The KNIME-way sounds feasible for first steps, proof-of-concept of the idea ... but the more you're taking it into a "serious"/"produciton" direction, you will be required to custom coding.
- If the sources you're considering, provide some API, use it :) For example, I've just checked the NYT website and they have an event catalogue. If you analyze the page, you can see, that the data is pulled from some JSON-API which could be easily exploited (technically -- I'm not considering any legal aspects here :) )
Hope that helps,