Guided Labeling for Document Classification

This workflow defines a fully automated web based application that will label your data using active learning. The workflow was designed for business analysts to easily go through documents to be labeled in any number of classes. In each iteration the user labels more documents and the model is trained using the already labeled instances. With every new iteration, the model proposes the most uncertain documents using the entropy scorer node. Once the user is happy with the performance achieved with the available labels, they can exit the loop and export the model to label the remaining instances.


This is a companion discussion topic for the original entry at https://kni.me/w/y5nhpbd1PP5F4WKH

Hello everyone,
I think it makes sense to post some of the views screenshot of this workflow.
As you can see they were taken from a web browser using the KNIME WebPortal:

You can use Tag Cloud terms to quickly filter documents to be labelled in the same way.
This screenshot is about labeling for sentiment analysis of movie reviews.

The workflow also works for multiclass classification like for example topic detection:

I am hiding few words on the screenshots to be politically correct :wink:

If you have any questions let me know!
Cheers
Paolo

Hi,
From the screenshots it looks like a really useful utility for labeling data so I downloaded the workflow and tried to execute it but I am running into problems. May be you could help me.
No dialog box or any of the screenshot that you show pops up and I get warnings. I installed all the plugins that KNIME asked me to install to run the workflow. The program ends with a red cross on the Deploy node showing the error “XGBoost Linear Ensemble Learner 2:357:0:311 The selected target column is no longer valid. Please select a valid column in the dialog.”
Please let me know what I am doing wrong as I would really like to use this workflow.

Some of the initial warnings that I get are given below:

WARN Rule-based Row Filter 2:365:1167:436:432 Line: 1: Not a column: Row0
$Row0$ = $TF abs$ => TRUE
^

Hi @junejo,
to use this workflow you need to execute the workflow iteration by iteration and label more and more documents. To get the full potential of this workflow deploy it on KNIME WebPortal which unfortunately comes with KNIME Server and it is not integrated yet in the KNIME Analytics Platform.

To troubleshoot the workflow however label documents start by right clicking on the “Label” Component and execute and open its view, type in some labels abbreviations and select “Apply and Close” at the bottom right corner of the view. Then do not execute the Loop end node just yet. Select the loop end and click twice using this button on the top toolbar of the KNIME Analytics Platform: 123

You should see the worflow performing one iteration and you should be able to open the view of the second iteration with the tag cloud present.

Read this article to know more:

Cheers
Paolo

2 Likes

Hey everyone,
this workflow was just updated to support both Active Learning and Weak Supervision with the new 4.1 Extensions.

Highlights:

The new Active Learning Loop is quite similar to the good old Recursive Loop just more intuitive to use.

Use Labeling Functions (or Rules) to label documents (example: if “great movie” in document body then sentiment label : “good”)

Labeling View added. Now to label your instances you need only buttons, no need of the Table Editor anymore.

Visualize the Labeling Functions in a Network View to measure how well they are correlated with the label provided by the user.

Train an XGBoost model using the probabilistic output of the Weak Label Predictor node.

3 Likes

Hi @paolotamag, Thank you very much for this great contribution! I have deployed the workflow on the server and uploaded my own csv file with text to be labelled. The worklow ends with an error as the CSV reader is not able to find the uploaded file. What might be wrong?

1 Like

image
image

Hi @pzkor,
Please let us know:

  • KNIME Analytics Platform version
  • KNIME Server version
  • whether you are using the old or the new WebPortal

old WebPortal header:
3

new WebPortal header:
4

  • How large your CSV file is
  • The column names in your CSV file

This can help reproduce the issue

Thank you @paolotamag. I am using Desktop version 4.2.2 and Server 4.11.1 and the new WebPortal. The file has two string columns: Title and Text and consists of 17,000 records (3.2Mb).

hi @pzkor thanks for letting us know.
We will now try to reproduce it and get back to you asap.

In the meantime please use the old webportal if available on your server and let us know if the issue is also happening there!

To do so simply open the workflow by removing the /webportal/space/ from the URL.

Cheers
Paolo

Hi @pzkor,
I was not able to reproduce your issue with this dataset on our internal server running the same version. I uploaded a similar csv file and it worked fine.
Can you please share with us a screenshot of the output flow variable from the file upload widget before the CSV Reader? That would help us tell which team has to look into this (frontend vs backend).

To take the screenshot go to the first Component in the workflow, locate this node File Upload then right click and open its output. Finally take a screenshot of the windows that opens. There are two file upload nodes, one for labels and one for documents make sure to locate the one which uploaded your file. That is trigger the error first, then move to your knime server remote view and locate the node and its output.

Thank you
Cheers
Paolo

Hi @paolotamag, apologies for the delay in coming back.

I finally had the chance to investigate the issue a bit. It would seem like the problem may have stemmed from the encoding of my csv files. After changing the OS default windows-1252 to ISO-8859-1 my text file, labels and weak rules are all successfully loaded. The only minor issue was that I had used numbers, 0-4 as the class abbreviations and they were assumed as integers when loaded in by the weak learning process. After changing those into alphabetic strings the upload was successful.

The only remaining issue seems to do with the layout. I get the following warning on our latest version Edge browser, but it doesn’t seem to affect the functionality:
“Sorry, a problem occurred:(See less)
Node: 334:0:1185:0:932 (Header)
Message:Error in script
TypeError: Cannot read property ‘style’ of null”

Many thanks for your help and again for the really useful contribution!

Oh, the colours of the labels don’t seem to match between the help table and the active learning prompt. The user won’t know whether to trust the colours or the class abbreaviation. image image

Hi @pzkor,
regarding the labels I am sorry about that. It looks like a bug in the workflow design.
I will make sure it gets fixed and provide you a quick fix here in this thread before we even update the workflow on the hub.

Regarding the header not appearing. This related to the fact your new webportal is blocking it for a security setting. I already know the fix. Same thing. I will update the workflow on the hub but before I do that I will provide the fixed workflow in here. This is issue should not happen if you use the old WebPortal.

Sorry about those issues and I will get back to you with the fixes asap.

Cheers
Paolo

1 Like

Hello again,
so I fixed the header here:
01_Guided_Labeling_for_Document_Classification-FIXED-HEADER.knwf (2.5 MB)

It should know display correctly also on the new KNIME WebPortal

Regarding the label buttons I am afraid you found a bug/missing feature. The labeling view node refuses to sort the buttons accordingly to anything I can understand (table domain or alphabetical order). For this reason I will open a ticket and get to the bottom of this.

What you can do now is to hide the legend so it is not confusing the user entirely.
To do this delete all those nodes so that from this:

you then attach the flow variable to the tag cloud instruction like this:

2020-10-08_12h05_53

In this case you will need the buttons abbreviations to be a bit more understandable than what you currently have!

Cheers
Paolo

1 Like

Thanks very much @paolotamag!

Hey @pzkor,

I had the same issue a few weeks ago and found a workaround.
You need to create a flow variable string array with the labels in order and set this variable for the possibleValues option in the flow variable tab of the Labeling view node.

For example you could use the Table Creator node, write each class abbrevation in one column (and in the correct order). Afterwards use the Create Collection column node to create a String array and then the Table Row to Variable node to convert the String Array in to a flow variable. Now you can connect the flow variable port to the flow var port of the Labeling View node and set the correct configuration in the flow variable tab.

I hope this helps. :slight_smile:

Best,
Julian

2 Likes

@pzkor I will update you soon with a workflow using @julian.bunzel fix.
Cheers
Paolo

1 Like

We just updated KNIME Hub with a new workflow with both layout header fixed for new webportal and the labeling order of buttons.

Please be advised that on the new webportal we are fixing the following issue:

The issue is affecting the Label component ciew that is displayed when you are currently in the recursive human-in-the-loop. Upon clicking next it might happen that the page will not refresh and look like this instead of updating with a retrained model and more documents to be labeled:

However if “next” is clicked again the page will reload correctly. You basically have to click next twice. This might be affecting the workflow on in the new webportal depending on your KNIME Server version, but the old webportal works fine in any case. We will fix this soon and we apologize for the inconvenience.

Cheers
Paolo

The issue that the new KNIME WebPortal shows when clicking Next within the human-in-the-loop view of Guided Labeling (and also mentioned in my last post) has been fixed. This fix should be published in the next KNIME Server Bug Fix release (KNIME Server 4.11.3 that will go out in the beginning of November). We apologize for the inconvenience you might have experienced.