In the digital era where the majority of information is made up of text-based data, text mining plays an important role for extracting useful information, providing patterns and insight from otherwise unstructured data. At this webinar you were able to join Julian Bunzel, data scientist at KNIME, to learn how to train your own customized named entity recognition model, and how to apply it to extract entities from text, and create entity relation networks.
The workflows used in Julian’s demonstration can be accessed here on the KNIME Hub.
If you’d like to read up what kinds of questions were asked at this webinar, find them listed below. Click on the arrow next to the question - to see the answer.
Could you please tell us the ways for exporting the trained model for using later?
There is a node called the Model Writer that exports the trained model to a file which you can use later. Simply connect the output of the Learner node to the Model Writer node.
How would you proceed if the information is not contained in one location (like PubMed) but distributed in many individual websites and blog posts?
Here’s an example workflow for retrieving data from Wikipedia, in this case images though: https://kni.me/w/jsdGDh4XCHndFuQg
Do you have examples for grabbing data from pdf and docx files?
Here is one example workflow with a Tika Parser: https://kni.me/w/IvbjJjYIIRMwBvnx
How would you start/proceed in KNIME if you DON'T have a good dictionary at the start for the named entities (which is a common case)?
We have examples workflows of text processing using also other than the dictionary based approach. Check this repository on the KNIME Hub for machine learning based approach and others. And here’s an example workflow for text classification with a neural network.
Can you set properties on edges for example weights to improve visualizations?
Were the regular expressions used in the demo in this webinar automatically generated based on the dictionary?
- Yes, the StanfordNLP NE Learner automatically generates the regular expressions to annotate the given documents to train the model.
- Regex-based/Wildcard-based tagging can be done using the Wildcard Tagger node.
How would you proceed to detect “white spaces” where no content or products exist? In other words if there is a discrepancy between publications and products.
It’s not entirely clear what is meant by this question, however:
- White spaces are detected by tokenizers which are used in different document conversion nodes (String to Document, Tika Parser etc. pp.).
- If, however, you mean white spaces in the sense of lack of information concerning the products:
- Cleaning up data is important to guarantee a good quality of the model. So products/drugs with low number of related publications should not be considered to be part of the training data because we might not get enough sample sentences. This can be seen in the preprocessing component in the second of the four workflows. I took the query term used to get the articles and tagged the related articles by using the Wildcard Tagger. Afterwards created a Bag-Of-Words and checked if the query term was in there.
If the website adds a new medication name, is this process automatically updated?
The first part of the workflow can be run again to get the latest medication names. To automatically update the medication name, we would need to schedule the execution which compares the content of the site to the revision we currently have and update it, if it changed. Scheduled execution is only possible with the KNIME Server.