Multivariate Outlier Detection & Distance-based Missing Value Imputation

In KNIME, the nodes for outlier detection and missing value imputation are the Numeric Outliers node and the Missing Value node, but these nodes cannot handle multivariate outlier detection and distance-based missing value imputation.

If there is a way to perform multivariate outlier detection and distance-based missing value imputation in KNIME, I would appreciate any ideas you can share.

  1. Multivariate Outlier Detection
  • Outlier detection using Euclidean Distance
  • Outlier detection using Mahalanobis Distance
  1. Distance-based Missing Value Imputation
  • KNN Imputer with distance-based mean imputation
  • Imputation using weighted mean

Thank you!

Hi,
have you tried to use a python script node to get it done?
E.g. for the Mahalanobis distance

import knime.scripting.io as knio

# This example script simply outputs the node's input table.
import numpy as np
import pandas as pd
from scipy.stats import chi2

# Example data
df = knio.input_tables[0].to_pandas()

data = df[['Col1','Col2']]

# Calculate the mean and covariance matrix
mean = data.mean()
cov_matrix = data.cov()

# Calculate the Mahalanobis distance for each point
mahalanobis_dist = data.apply(lambda row: np.sqrt((row - mean).T @ np.linalg.inv(cov_matrix) @ (row - mean)), axis=1)

# Determine outliers using a chi-squared distribution
threshold = chi2.ppf(0.95, df=data.shape[1])  # 95% confidence level
outliers = data[mahalanobis_dist > threshold]

# Output
knio.output_tables[0] = knio.Table.from_pandas(outliers)