Solutions to “Just KNIME It!” Challenge 23 - Season 4

:sun_with_face: Happy Wednesday, everybody! We’re back today with a Just KNIME It! challenge on big data processing. :brain:

:airplane: You are a data scientist working at a regional airport authority that’s grappling with a familiar problem: unpredictable flight delays. You are tasked with designing a system to predict these delays before they happen, by building a solution powered by distributed computing. :computer: Can you help the regional airport authority predict flight delays with an accuracy above 80%?

Here is the challenge. Let’s use this thread to post our solutions to it, which should be uploaded to your public KNIME Hub spaces with tag JKISeason4-23 .

:sos: Need help with tags? To add tag JKISeason4-23 to your workflow, go to the description panel in KNIME Analytics Platform, click the pencil to edit it, and you will see the option for adding tags right there. :blush: Let us know if you have any problems!

My solution:

I really love these kind of challenges, where you can use new nodes and concepts. I didn’t use the Spark nodes before and although in this use case I think it was a little slower than put it together in my RAM (with the native nodes) I think it’s a really useful knowledge how to use these nodes :slight_smile:

My thinking:

  • Creating the Local Big Data Environment
  • Select just the relevant features (not identifiers)
  • Remove the one missing value from dep time
  • Partition the data (80-20, stratified for the target)
  • Handle other missing values (for string: Mode, for numeric: Median)
  • Train and predict the decision tree and random forest models
    • As both of them gave the same Accuracy and Cohen’s kappa (91,639%, 0,833%) I wrote out the decision tree model. It’s easier to understand, and the more complexity of the random forest didn’t add value to the model
    • I tried to visualize it, but I couldn’t magic the data back into KNIME (I know about the node, but it loaded forever, I tried with Spark Data sampling, but it is also just loaded forever, maybe I will complete this workflow with the visualization, but I didn’t have enough time now to rerun everything :frowning: )

My workflow:

As this is more a complex, “thinking” problem, if you have any suggestion, comment, please let me know! :slight_smile: