Regression Logistic

julio_marcos5 · March 5, 2021, 6:12pm

Guys

I’m a beginner and would like some help. I’m trying to reproduce the code below from python in Knime
but the results of train accuracy, dev accuracy, test accuracy and confusion matrix in logistic regression are not correct

Phyton country calling code

–Smote
col = ‘Category’
data = sample_data(df, col)

—Randomly distribute data into training, testing and validation classes. We use 60-20-20 distribution
un_training_x, training_y, un_testing_x, testing_y, un_validation_x, validation_y = split_random(data, percent_train=6
—Lets normalize our X data
training_x, testing_x, validation_x = normalize_data(un_training_x, un_testing_x, un_validation_x)
We can print the X data, to be sure that we have the normalized data in the range of -1 to 1
print(“X:”)
print_normalized_data(training_x, testing_x, validation_x)
print(“")
—Lets print the Y class, to be sure that we have a mix of positive and negative class
print(“Y”)
print_normalized_data(training_y, testing_y, validation_y)
print("”)

Logistic Regression
clf = sklearn.linear_model.LogisticRegressionCV(penalty=‘l2’, solver=‘lbfgs’,
print(clf.fit(training_x.T, training_y.T.reshape(training_x.shape[1],)))

LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
fit_intercept=True, intercept_scaling=1.0, max_iter=1000,
multi_class=‘ovr’, n_jobs=1, penalty=‘l2’, random_state=None,
refit=True, scoring=None, solver=‘lbfgs’, tol=0.0001,
verbose=1.0)

cc_dic = {}
cc_dic = analyze_results(training_x, training_y, validation_x, validation_y, testing_x, testing_y, “lr_sklearn”, None, clf, acc_dic)

Train accuracy: 96.61319073083779
Dev accuracy: 87.76595744680851
Test accuracy: 96.79144385026738
Confusion matrix of Testing Data:
[[86 5]
[ 1 95]]

Knime representation

elsamuel · March 5, 2021, 6:37pm

What does that mean exactly?

julio_marcos5 · March 5, 2021, 6:51pm

Python
Train accuracy: 96.61319073083779
Dev accuracy: 87.76595744680851
Test accuracy: 96.79144385026738
Confusion matrix of Testing Data:
[[86 5]
[ 1 95]]

knime

Train accuracy: I couldn’t reproduce
Dev accuracy: 88,8
Test accuracy: 94,7
Confusion matrix of Testing Data:
[[88 5]
[ 5 89]]

elsamuel · March 5, 2021, 7:22pm

Can you confirm that your train/dev/test splits result in 100% identical subdatasets in your Python and KNIME implementations?

julio_marcos5 · March 5, 2021, 9:52pm

Yes the amount is exactly the same for training, testing and dev. Would there be another way to do the split?

Thanks

elsamuel · March 5, 2021, 10:23pm

From a high level, the way I see it, this could be an issue with the configuration of the learner or predictor nodes, or it could be an issue with the data going into those nodes.

My initial hypothesis is that the underlying data is the issue, and my previous question was an attempt to get information from you to disprove that hypothesis.

I’ll rephrase the question I asked previously. Have you inspected the data in the training set produced by KNIME and confirmed that it’s exactly the same as the data in the training set produced by the Python script? Have you done the same exercise for the dev and test sets?

If the datasets are not the same, the nodes for data processing prior to the regression nodes can be investigated further to ensure that they exactly reproduce the Python datasets.

If the datasets are the same, then the problem is elsewhere. If this is the case, I’d confirm by taking the train/dev/test data from the Python implementation and running them through the learner nodes to see what the end result is. Then the learner and regression nodes can be tweaked.

Additionally, I’d be interested in knowing just how closely you need the results from the KNIME and Python approaches to match.

julio_marcos5 · March 5, 2021, 11:11pm

I did not perform these tests because split_random is used but I will use your idea.

I would like to get as close as possible because I am reproducing a paper.

Daniel_Weikert · March 6, 2021, 11:30am

Think @elsamuel is correct.
There could be a lot of differences.
E.g. the random seed used leads to different samples in training and testing even though you have 80/20 split in both the elements would be completely different.
Beside this the configuration of the algorithm could be different as there are a lot of settings you would need to check all predefined settings in sklearn and see whether the are the same in Knime

julio_marcos5 · March 6, 2021, 12:12pm

I agree with you there can be a lot of difference. So I start to think that it will be very difficult to reproduce some papers developed in python on Knime.
I will continue studying this paper on Knime and I will help with the forum

system · September 5, 2021, 12:13am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.