Example of text processing of requirements documents - related requirements, categories, etc

I am looking for an example KNIME work flow to process a number of requirements documents. My objective is to take a number of requirements documents have a number of columns.  These columns would be similar to category, description, requirements source, notes, priority, etc.

What I would like to do is provide a number of methods to access and relate the requirements based on key works, source, etc. So for example if I have requirements that had the phase “display a map” I would like to see the relations to the sources, other requirements with the same phrase, and others in the same category.

                I would appreciate any assistance or links to an example or tutorial.

Thank you

Unfortunately I am not clear what you are after, but there are a number of example workflows which use the text mining facilities of KNIME. These are available from the KNIME Example Workflow Server pane on the right hand side. Connect to this, and text processing workflows are under category 9, there are 4 to use to see how the nodes work.

What is your source file, a table such as XLS or CSV, or a document like PDF or DOC ? They can all be processed and loaded but will be done differently.

You can identify key words using the tagger nodes in KNIME Labs/Text Processing/Enrichment node repository. If there are specific phrases and words you want tagged as part of your documents then you can specify a predined list using a txt file which you load into the second port of the Dictionary Tagger node.

There is more info at http://tech.knime.org/knime-text-processing-0

Thanks,

Simon.

Simon,

                I will try your suggestions on the enriuchment nodes and the dictionary tagger.

                Thank you for the response.  I will spend some more time with the examples but I wanted to answer your questions as to what I am after.

                I have a number of requirements documents that are in Excel which I export to a tab delimited file to import into KNIME.  I have gotten some of the desired results with the Table View and Pie Chart but they can be better.  The columns are:

Category – IE System Administration, Hardware, Virtualization, etc

Priority – 1..n

Capability – This is the text of the requirement.  IE: The system shall utilize commodity hardware.  The system shall provide a common map widget. Note: These are the decomposed requirements.

Requirements Source: The source document for the requirement.

Notes: Free Text.

There other columns but they are not relevant.

What I am trying to is use KNIME to show the relationships between the categories, phrases in the capability statement, and the requirements source primarily.

For example:

For example: If the capability column has the phrase “common map” I would like to know:

  • how many of requirements have that phrase and then be able to show a list and drill down
  • how many categories have the phrase “common map” and be able to drill down
  • how many of the source documents have that phrase and be able to drill down

I would like to either set up or extract the needed phrases to drive the other aspects.

I am using this as a test project for larger data sets that are similar to above but with many more columns and details.

Thank you.

What you want to do then is load in your CSV or XLS file using the XLS Reader or File Reader node.

You may want to group your documents first according to the Category. So use the GroupBy node, choose to Group By the Category column, and in the aggregation section, choose to Concatenate the Capability column. This way, you can have all the Capability text associated with the Category.

Then use the "Strings to Document" node in KNIME Labs/Text Processing.

In the node config, choose to have the Title as the Category column, and the Full Text as the Capability column. This way we can relate the words or phrases to your category column later.

You now need to tag the appropriate words in the document, you can do this with POS Tagger which identifies any "Parts of Speech" words such as Nouns, Verbs etc. However, if you want to identify specific phrases like "Common Map" then you will be best to use the "Dictionary Tagger" node and supply a hand written text file of phrases you want to be tagged into the second port. Select the tag  type you wish to apply to the words or terms, remember this for later.

To pull out all of the words, then use the "Bag of Words" node. Words or phrases will have a tag if it matched your list in the Dictionary Tagger. You can then filter all the Bag of Words to leave just your tagged words if you like using the Standard Named Entity Filter node choosing the Tag type you specified earlier to filter for.

If you wish you can also use the TF node to calculate the frequency to which the word or phrases you have left appears in each document, 1 being very high, 0 being very low.  If you then use the "Term to String" node to get all your phrases into a string cell,and the "Document Data Extractor" to pull out the title, you will then have all the phrases with their frequency and the category they came from. You can then use a graphing node to display the frequency of each phrase for each category.

You may also want to do a duplicate branch of the above workflow having the Requirements Source column being manipulated instead of the category column so you can analyse this in the same way.

To be really fancy, you can get KNIME to automatically generate the graphs for you using the BIRT reporting facilities, but this requires quite abit more learning.

Hope this helps and is towards what you are trying to achieve,

Simon.

Simon,

     Thank you very much.  This has gotten me to about an 80% solution so far and help identified the patterns I need to take the idea farther in the tool.  I need to explore the sequence and the concepts you have laid out for the indivudual nodes here.

V/R,

Don