Help configuring SVM for abstract screening:

Hello, I am new to KNIME and would like to ask some of the more experienced users if they can help me understand how to configure the SVM in KNIME.  

I work on projects that rely on selecting the correct medical abstracts for a given report and would like to help reduce the workload necessary by rejecting some of the records given a training set.  That training data would usually be in the form of 50 known useful records with approximately the same number of known non-useful records.  I would like to be able to use the SVM to input this training set and then get a report or some other output ranking the remaining (non-training) documents in potential relevance based on the training data.  

I'm attempting to use 009001 document classification example but am running into some errors after the 'category to class' node.  The colormanager gives me an error that "Column document class has no nominal values set".  I looked at the output table and it seems similar to the one for the example except that I have one input rather than 2 table readers.  Should I separate my data somehow to get this to work? At present I have a category of 1/0 or missing [1 being an included record, 0 being an excluded record and ? being missing] that would be prepared by an expert.  Ideally this small data set of reviewed records could then be used to inform the SVM which others are likely to be included.

Can somoene please try to walk me through the next steps, or suggest another approach (and perhaps example)?  

Thanks for any help I've been trying to figure this out for hours.

Bevos

Have you tried the Domain Calculator? Probably that can solve the nominal values problem. (Maybe you will also have to convert your training class column to String before domain calculation.)

Hi Aborg, thank you for the post.  I tried the method you described but I guess I am having a 'bigger' problem than I originally thought.   I've attached the workflow here for reference.

 

When I begin the workflow at the IO node, I have a tab delimited file containing a few columns (ID, Author, Year, Title, Abstract, Journal, and Include).  In the include field I have already screened a handful of records as either include, exclude and the majority are missing.  I would like to conserve these fields for use in the workflow specifically by telling the SVM Learner that I want it to predict 'Include' based on looking at the BoW content for each of the 'includes' compared to the 'excludes'.  

 

However, after I use the BoW node I get a two column output.  I looked at some of K. Theils posts and this seems to be working as intended and he states that you must have the nodes in the proper order.  Am I doing things in the wrong was (as shown by the attached workflow)? 

 

Does anyone know how best to accomplish this?

 

Thank you again for any help and I'm enjoying learning this wonderful utility,

Bevos

Hi Bevos,

the class information, which is in your case "include" or "exclude" should be set as category information in the documents. You can do this by specifying the related column as category column in the Strings to Document node dialog. This category (or class) information can be extracted later on using the Category to Class node to get back your target column for the SVM learner.

Attached you find a small example workflow, showing how to insert nominal class information as category and extracting it later on.

Cheers, Kilian