Solr, SolrJ, and Spark

I’m trying to use Solr queries inside a JavaRDD snippet. It seems to me there are two options:

  • using spark-solr to create a JavaRDD
  • using CloudSolrClient to perform a query.

The latter works in a common Java snippet, however fails with a “class not found” exception in the JavaRDD snippet. I have tried on a Spark 1.6.2 and a 2.3 job servers, KNIME 3.5.3 and 3.6 (nightly) using different versions of SolrJ and Spark-Solr, and set the options spark.jars, spark.driver.extraClassPath, and spark.executor.extraClassPath accordingly.

But I get more or less the same errors in both cases:

java.lang.NoClassDefFoundError: com/lucidworks/spark/rdd/SolrJavaRDD
at SparkJavaSnippetSource5be85e31d6a5b35d678c7…apply(…
and
java.lang.NoClassDefFoundError: org/apache/solr/common/params/SolrParams
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at org.knime.bigdata.spark1_6.jobs.scripting.java.JavaSnippetJob.runJob(JavaSnippetJob.java:69)

respectively.

Thanks in advance for any help.

Sven

Snippet code:

String urlZooKeeper = “host:2181/solr”;
CloudSolrClient solrClient = new CloudSolrClient.Builder()
.withZkHost(urlZooKeeper)
.build();
List result = new ArrayList<>();
try{
SolrQuery query = new SolrQuery();
query.setQuery(":");
query.setRows(100);
final QueryResponse response = solrClient.query(“table”, query);
final SolrDocumentList documents = response.getResults();
for(SolrDocument document : documents) {
RowBuilder rb = RowBuilder.emptyRow();
for(String s: document.getFieldNames()) {
rb.add((String) document.getFieldValue(s));
}
result.add(rb.build());
}
solrClient.close();
} catch (Exception e) {
e.printStackTrace();
}
return sc.parallelize(result);

and

String urlZooKeeper = “host:2181/solr”;
SolrJavaRDD solrRDD = SolrJavaRDD.get(urlZooKeeper, “table”, sc.sc());
JavaRDD resultsRDD = solrRDD.queryShards(":");
return resultsRDD.map(
new Function<SolrDocument, Row>() {
private static final long serialVersionUID = 1L;
@Override
public Row call(SolrDocument doc) {
RowBuilder rb = RowBuilder.emptyRow();
for (String s: doc.getFieldNames()) {
rb.add((String) doc.getFieldValue(s));
};
return rb.build();
}
}
);

Hi @sven_knime,
are those libraries present on the cluster and in the classpath of the Job-Server?
When you use the JavaRDD node, the JAVA code you are writing is executed on the cluster. Thus the libs have to be present there and the job server has to be able to find them.

Yes, I had maven put all depencies into a jar for spark solr 3.5.1 and 2.2.3.

Then, I declare the class paths for driver and executor in the configs of the spark context and add the jar to the context, too. The jars are loaded from HDFS to the job server.

The procedure works for another library in another case for spark jobs. And locally in a common java snippet it works for spark solr. However, the common snippet performs line to line, not “query-to-table” and I wrote a KNIME node using spark solr, which also works in principle. But the common KNIME output table I get cannot be connected to a Spark-RDD node, because it is incompatible to the Spark nodes.

Best regards,

Sven

Maybe the jars are loaded into the wrong directory. Is the destination directory present in the classpath of the job-server? Are the permission set correctly?

If you have a KNIME table you can use the “Table to Spark” node to load the Data into a Spark RDD on the Server.

It is exactly the same paths etc. as the other library that works. “Table to Spark” is what I want to avoid. Everything should run on the cluster.

Could you try to add the jars with the option

dependent-jar-uris: file://path/to/file,file://path/to/file2

In the Spark Create Context config. Please destroy and re-create the context afterwards.

Yes, it works. It seemed to have completely ignored my
sc.addJar(jar_string);
sc.setLocalProperty(“spark.driver.extraClassPath”, jar_string);
sc.setLocalProperty(“spark.executor.extraClassPath”, jar_string);
statements and also the options
spark.jars: hdfs:///…/my.jar
spark.driver.extraClassPath: my.jar
spark.executor.extraClassPath: my.jar
in the spark context node. Is this the expected behaviour?