I am trying to run a Spark Predictor using a trained Spark Linear Regression Learner, but the Spark Predictor component seems to fail every time I execute it.
This is the error that is printed from KNIME’s console: Execute failed: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://JobServer/user/data-manager#1019598049]] after [3000 ms]. Sender[null] sent message of type “spark.jobserver.DataManagerActor$StoreData”. (RestoredThrowable)
After inspecting /var/log/spark2-job-server/spark-job-server.log, turns out there was this exception: ERROR ka.actor.OneForOneStrategy [] [akka://JobServer/user/data-manager] - /tmp/spark2-job-server/upload/knime-sparkModel7134388612124129131.tmp-2018-07-12T14_46_39.707+07_00.dat (No such file or directory) java.io.FileNotFoundException: /tmp/spark2-job-server/upload/knime-sparkModel7134388612124129131.tmp-2018-07-12T14_46_39.707+07_00.dat (No such file or directory)
If it helps, I also tried looking at /var/log/spark2-job-server//spark-job-server.log, but there were nothing inside the log file relating to the exception (the log just ends at “Linear Regression With SGD done” and “Job finished OK”
in /var/log/spark2-job-server/ are there any folders called “jobserver-~”?
If yes: They contain the actual logs for the Spark contexts (one context = one folder). Please locate the one for the context with the failed job and send it to me via private message, so I can take a look.
If not, then Spark Jobserver is not correctly set up. Have you already taken a look at the installation guide?
Yes, there is a jobserver-~ folder. What I meant to say in my previous post was that I took a look at /var/log/spark2-job-server/jobserver-~/spark-job-server.log, but nothing in the log file contains something that relates to the exception.
This time however, there are additional messages containing several “removed broadcast”.
It seems I have to send you a message first that you can reply to. Sorry for the confusion, I though this was possible.
I wasn’t quite sure what to do with this problem, but now someone else has run into a similar issue. After doing a bit of digging through the jobserver code, it would say that the directory
/tmp/spark2-job-server/upload
does not exist yet, but the directory
/tmp/spark2-job-server/
exists but belongs to the “wrong” Linux user. This would explain the FileNotFoundException with “No such file or directory”.
As per the installation guide, jobserver should run as its own Linux user, e.g. “spark-job-server”, and all the directories accessed by jobserver need to belong to that user.
I would propose to stop jobserver, then delete the /tmp/spark2-job-server/ directory, and then to start jobserver again. Afterwards run the predictor job again to see whether that fixes the issue.
Sorry for the long break as well.
I finally solved the issue by restarting the Spark jobserver, without deleting the “/tmp/spark2-job-server/” directory. This issue has happened several times, but restarting the jobserver always did the trick.