Currently I’m working on a Textmining project with invoices. I’m able to filter values via keywords and extracting the using regex. Due to the fact that I’m a bit of a novice regarding Text and DL/ML I need a hint where to start. For example the workflow should be able to detect words with a similiar meaning like “Rechnungsbetrag” and “Gesamtsumme” and then extract the value. I know there is a Workflow with a NER-Model, but I don’t know if this workflow is the one to go for.
using a NER model could indeed help, but you would have to compile a list of words beforehand which will be used for training the NER model later. Depending on how diverse the set of words is that you want to detect, you might just stick to a Dictionary Tagger or Wildcard Tagger (or using RegEx as you did before) using the list of words.
However, you can still train an NER model and check how it performs. It’s pretty straightforward to set it up.
thanks for your reply. Regarding the list of words, I bet something like “Ust-ID”, “USt-id” and “Ust-id:” is something completly different and I have to write every variation?
if you simply want to use a Dictionary Tagger, you would have to list every variation. However, it can be set up as either case sensitive or case insensitive The Wildcard Tagger gives you a little bit more flexibility since you can provide patterns. So this could be easily done with these Taggers.
The Stanford NLP NE Learner can also be set up to be case insensitive. So if you want to build a NER model you don’t need to provide every variation of lower case and upper case of a word.