Happy Wednesday, everybody! We’re back today with a Just KNIME It! challenge on big data processing.
You are a data scientist working at a regional airport authority that’s grappling with a familiar problem: unpredictable flight delays. You are tasked with designing a system to predict these delays before they happen, by building a solution powered by distributed computing. Can you help the regional airport authority predict flight delays with an accuracy above 80%?
Here is the challenge. Let’s use this thread to post our solutions to it, which should be uploaded to your public KNIME Hub spaces with tag JKISeason4-23 .
Need help with tags? To add tag JKISeason4-23 to your workflow, go to the description panel in KNIME Analytics Platform, click the pencil to edit it, and you will see the option for adding tags right there. Let us know if you have any problems!
I really love these kind of challenges, where you can use new nodes and concepts. I didn’t use the Spark nodes before and although in this use case I think it was a little slower than put it together in my RAM (with the native nodes) I think it’s a really useful knowledge how to use these nodes
My thinking:
Creating the Local Big Data Environment
Select just the relevant features (not identifiers)
Remove the one missing value from dep time
Partition the data (80-20, stratified for the target)
Handle other missing values (for string: Mode, for numeric: Median)
Train and predict the decision tree and random forest models
As both of them gave the same Accuracy and Cohen’s kappa (91,639%, 0,833%) I wrote out the decision tree model. It’s easier to understand, and the more complexity of the random forest didn’t add value to the model
I tried to visualize it, but I couldn’t magic the data back into KNIME (I know about the node, but it loaded forever, I tried with Spark Data sampling, but it is also just loaded forever, maybe I will complete this workflow with the visualization, but I didn’t have enough time now to rerun everything )