Converting Dataset to ML input

mbillion · November 16, 2017, 7:58pm

I have a data set,

Two million lines long each row is a different claim and column 1 is the patient ID. its 39K patient ID's with multiple lines each.

I want to convert this set to be used so I can predict which patients will be in the top6000 or Bottom categories

My understanding is that I have to create a set of with the 39K patient lines, with any number of columns that represent attributes and have a two attribute column for the algorithm to predict

I understand all the steps of creating the ML models once I have it in the form of 39K patients with the predicted categories

Is there any easy way to convert the two million line file into a file with unique rows for each patient representing categories of strong predictors of high cost from the data?

mladen · November 17, 2017, 12:42am

Can you be a bit more specific on how you want to preprocess your original file.

Tom_Hawkins · November 21, 2017, 2:16pm

Can you show a small example of your current data, so that we can see what you have now, and tell us what you would like to predict from it?

You can anonymise or fake the attributes if you want, it's the structure that's important for us to understand how to help.

ferry.abt · December 12, 2017, 2:38pm

Hey mbillion,

whether this is possible depends on what information you have, like mladen and Tom already said.

However, as far as I understand what you want to achieve I might be able to give some pointers.

You have list of claims. A claim consists of a patient ID and either some category of claim, or some descriptive text. In latter case, you need to do some pre-processing, like for example categorizing the claim by doing text mining.
Then, from the list of claims, you want to create a list of patients, where the features of a patient are the various claims.
To do this, you use a Pivoting node. Depending on what's interesting you might get boolean features (did a patient file a claim in this category) or maybe numeric features.

The next step, categorizing the patients, is too complex to explain it all in a short forum post. So just some key ideas: If you have data from the past, you can do supervised learning. If not, you could categorize the patients into different groups and learn over time whether your grouping was good and which group is the most expensive one.

Hope that helps a little.

Cheers,
Ferry