Insider Threat detection and Prediction

ashokkumar21 · March 10, 2025, 1:47pm

Working on CMU insider Threat datast version 6.2. Unable to find similar workflow . Dataset comprises of 8 csv files structure is as follows

File Name	Headers
decoy_file	decoy_file, pc
device.csv	id, date, user, pc, file_tree, activity
email.csv	id, date, user, pc, to, cc, bcc, from, activity, size, attachment, content
file.csv	id, date, user, pc, filename, activity, to removable media, from removable media content
logon.csv	id, date, user, PC, activity
LDAP.csv	employee name, user_id, email, role, projects, business_unit, functional_unit, department, team, supervisor
http.csv	id, date, user, pc, url, activity, content
pschrometic.csv	employe_name, user_id, O, C, E, A, N
Request help in formulating the workflow for prediction of a malicious user

thor_landstrom · March 18, 2025, 3:00pm

Hey @ashokkumar21,

Generally speaking, you can segment your workflow into sections.

Data input
Process Data
Train/Prediction

You mention the use of multiple files, and that can complicate your initial step, but for a purely threat prediciton; I would say you only need a couple of those like logon, file, email, http, LDAP. Basically anything involved with user activity as you would like to find any outliers in that. You can use the CSV reader for this.

You would then want to use the Joiner node and join on ‘id’ as that seems to be the data point linking the tables together. You could aggregate different things if you group by id and date like for example:

count(logon) – amount of logins each date
sum(email.size) – typical email size the user sends
count(email) – typical emails the user sends in a day
etc.

These are the features you want to be looking for and you can feed this into a couple different models to test which performs well, if you are not sure I would point you to using the AutoML component.

TL

system · June 16, 2025, 3:00pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.