I want to extract job-names from a lot of job-offer-documents with the help of a local grammar. So I have created a local grammar with more than 100 rules. I wonder what is the best method to apply this rules in Knime. First I thought obout using the "Java Snippet (simple)" Node and several regular expressions, but I think this is not a good way because of the great number of regular expressions.
Does anybody know a more easy way to apply these rules to the job-offers?
The text processing nodes may be the way to go in the knime labs section using the regex filter node. As you have 100 or so rules, using a table row to variable loop start and loop end nodes would be useful.
I have three ideas to solve my problem with the big number of rules.
Is it possible in Knime to pack local alternatives like [wir suchen|suchen wir|sucht] together as one tag like "SEARCH" or [ab sofort|schnellstmöglich] as "FAST" and then combine these tags like this to simplify applying the rules:
- SEARCH "zur verstärkung unseres Teams" FAST
- SEARCH FAST
- ...
Or is it possible in Knime to pass a document from one regular expression to another until the job-name is recognized?
Or is it posible in Knime to applay a typ-2 grammar on the job-offers by using a specific parser?
As Simon suggests, you may be able to use the text processing extension in KNIME Labs. Specifically, the Dict Replacer might help you parse your grammar but I personally don't have any experience with this. If you post some example data I'd be willing to give it a try.
As far as I know there isn't a grammar parser, but if you have one (nooj?) we may be able to wire it up to KNIME using a java snippet node.
It would be great if you could give it a try with the Dict Replacer to show me an example how to use it.
NooJ allows to model a grammar on a graphical user interface and afterwards to generate the language with all its rules. Unfortunately NooJ crashes after generating 5.000 rules when I trie to do it with my grammar (http://share-me.de/grammar.png, http://share-me.de/grammar-nooj.nog). 5.000 rules are also too mutch to maintain - this is why I am searching for anoter way to use the grammar in knime.