Multivariate Outlier Detection & Distance-based Missing Value Imputation

mychoi · February 26, 2025, 5:44am

In KNIME, the nodes for outlier detection and missing value imputation are the Numeric Outliers node and the Missing Value node, but these nodes cannot handle multivariate outlier detection and distance-based missing value imputation.

If there is a way to perform multivariate outlier detection and distance-based missing value imputation in KNIME, I would appreciate any ideas you can share.

Multivariate Outlier Detection

Outlier detection using Euclidean Distance
Outlier detection using Mahalanobis Distance

Distance-based Missing Value Imputation

KNN Imputer with distance-based mean imputation
Imputation using weighted mean

Thank you!

ActionAndi · February 26, 2025, 4:37pm

Hi,
have you tried to use a python script node to get it done?
E.g. for the Mahalanobis distance

import knime.scripting.io as knio

# This example script simply outputs the node's input table.
import numpy as np
import pandas as pd
from scipy.stats import chi2

# Example data
df = knio.input_tables[0].to_pandas()

data = df[['Col1','Col2']]

# Calculate the mean and covariance matrix
mean = data.mean()
cov_matrix = data.cov()

# Calculate the Mahalanobis distance for each point
mahalanobis_dist = data.apply(lambda row: np.sqrt((row - mean).T @ np.linalg.inv(cov_matrix) @ (row - mean)), axis=1)

# Determine outliers using a chi-squared distribution
threshold = chi2.ppf(0.95, df=data.shape[1])  # 95% confidence level
outliers = data[mahalanobis_dist > threshold]

# Output
knio.output_tables[0] = knio.Table.from_pandas(outliers)

system · May 27, 2025, 4:37pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.