Strange selection behaviour

Hi all,

I have written a workflow to extract the names of bacteria from a large dataset obtained from Pubmed. The extraction of the bacteria names is OK but also some chemical names (but not all) are selected.

Doc grabber > POS tagger > dictionary tagger > Bag of words > [preprocessing] > Reference Row filter > etc

The dictionary tagger and the Reference row filter read a list of bacteria names. In the dictionary tagger the bacterial names get the tag type POS and the tag value Unknown.
The result of the tagging and preprocesseing as displayed in the reference row filter shows the following result
 

1-alkyl-3-methyl-imidazolium[UNKNOWN(POS)]

		<table>
			<tbody>
				<tr>
					<td>
					<p>1-decyl-3-methyl-imidazolium[JJ(POS)]</p>

					<p>My question is why differs the tagging results for these two chemical names? The name 1-alkyl-3-methyl-imidazolium is not present in the list with the bacteria names.<br />
					<br />
					Kind regards,</p>

					<p>Ernst</p>
					</td>
				</tr>
			</tbody>
		</table>

		<p>&nbsp;</p>
		</td>
	</tr>
</tbody>

 

Hi Ernst,

POS tagging is based on a model (openNLP). It is possible that the model will append different tags to the same words, based on their position in the sentence. A word can e.g. be a noun in one sentence and a verb in an other sentence. In your case the POS model detected the  word 1-decyl-3-methyl-imidazolium as JJ (adjective) in one sentence (based on the preceding POS tags). In the other sentence it was not clear anymore so UNKNOWN was assigned.

To be able to differ between tags assigned by the POS tagger and tags assigned based on your dictionary I recommend to assign a different tag in the Dictionary Tagger. Select Pharma as Type and an appropriate value of the Pharma tag set.

You can better filter on with different tag types assigned.

Cheers, Kilian