In KNIME, the nodes for outlier detection and missing value imputation are the Numeric Outliers node and the Missing Value node, but these nodes cannot handle multivariate outlier detection and distance-based missing value imputation.
If there is a way to perform multivariate outlier detection and distance-based missing value imputation in KNIME, I would appreciate any ideas you can share.
Hi,
have you tried to use a python script node to get it done?
E.g. for the Mahalanobis distance
import knime.scripting.io as knio
# This example script simply outputs the node's input table.
import numpy as np
import pandas as pd
from scipy.stats import chi2
# Example data
df = knio.input_tables[0].to_pandas()
data = df[['Col1','Col2']]
# Calculate the mean and covariance matrix
mean = data.mean()
cov_matrix = data.cov()
# Calculate the Mahalanobis distance for each point
mahalanobis_dist = data.apply(lambda row: np.sqrt((row - mean).T @ np.linalg.inv(cov_matrix) @ (row - mean)), axis=1)
# Determine outliers using a chi-squared distribution
threshold = chi2.ppf(0.95, df=data.shape[1]) # 95% confidence level
outliers = data[mahalanobis_dist > threshold]
# Output
knio.output_tables[0] = knio.Table.from_pandas(outliers)