Identify Coding Outliers Using Unsupervised Method

I need an advice. For the file like:
Specialty1, Provider1, Code1, Code_Frequency
Specialty1, Provider1, Code2, Code_Frequency
Specialty1, Provider1, Code3, Code_Frequency

Specialty1, Provider2, Code1, Code_Frequency

I’m looking for outliers who used too many Codes or too little codes for specified Specialty. Number of codes could be 100 per specialty and about 400 providers
I used DBScan for some close task but just 10 codes.
It would be nice to see an example with data preparation as some methods require
specific input format.
Also, I need a solution using just KNIME nodes (no R, Pyton)

Thank you
Igor

Hi Igor -

Maybe I’m oversimplifying your use case, but the first thing that came to mind was the Numeric Outliers node. It would allow you to group by specialty inherently. I know it’s a straight statistical approach as opposed to unsupervised clustering, but maybe it’s worth considering?

You can see it used in this workflow, along with DBSCAN and a couple of other methods.

2 Likes

Thank you, Scott. Will play with this.

Now I experiment with Numeric outliers node and got question if I do it correct and how to get some statistics.
Firs, I transformed input file like
Specialty Provider Code1, Code2, … CodeN
Specialty1, Provider1, Frequency1, Frequency2… FrequencyN
Specialty1, Provider2, Frequency1, Frequency2… FrequencyN
Specialty1, Provider3, Frequency1, Frequency2… FrequencyN

Specialty2, Provider4, Frequency1, Frequency2… FrequencyN

I need to get a list for every specialty combination of provider with codes they use as outliers.
On configuration I use Specialty as a group and checked for Compute statistics on groups.
Currently, I got a list of all provider (if all of them outliers?) and list of codes marked as outliers but without connection to specific providers. Say, Code14 has 2 outliers in a range. Also, some lower bounds are negative?
Please, correct me on what I do wrong.

Hi Igor -

I put together a dummy dataset and sample workflow for you. I kept the data in your original format, prior to transposing. There are 2 specialties, with 10 providers and 2 code IDs for each. Here’s the abbreviated structure:

2020-01-14%2012_34_34-Output%20table%20-%200_6%20-%20Excel%20Reader%20(XLS)

The workflow takes this data, calculates outliers, and also presents box plots for each combination of specialty and code:

2020-01-14%2012_39_46-KNIME%20Analytics%20Platform

In the end, we have a conditional box plot that shows the three outliers I included in the data, indicated by the X marks:

You will see that these outliers match up with those identified by the Numeric Outliers node:

2020-01-14%2012_42_31-Treated%20table%20-%200_10%20-%20Numeric%20Outliers%20(Calculate%20and)

If you want a count of the outliers by specialty and code, along with the lower and upper bounds calculated for each, you can get those from the 2nd output port:

Lower negative bounds are not uncommon - you can see how they are calculated in the node description. It just means you have a lot of values in your group (relatively) close to zero.

Since we didn’t have a simple Numeric Outliers workflow on the Hub, I’ve uploaded this one here for those who may stumble across this thread:

Hope this helps - let me know if you have any more questions. Cheers!

3 Likes

Thank you @ScottF , It looks I did an extra step for pivoting codes. Instead, could use grouping by code in Numeric Outliers node.

1 Like

@ScottF, I added to the example the way I did it. It’s because for my real data your way is not working at least for 3.7.2 version. With your data in 4.1 both give the same result.Numeric_Outliers_Simple_Example.knwf (61.7 KB)

3 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.