Unable to write to Opensearch using Spark [BUG] #505

gsharma2907 · 2024-08-06T20:10:42Z

What is the bug?

A clear and concise description of the bug.

Unable to write to Opensearch using Spark. Receiving following exceptions

Traceback (most recent call last): File "/Users/zgaurash/Downloads/test.py", line 21, in <module> .save("pyspark_idx") File "/Users/zgaurash/.pyenv/versions/3.6.13/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 740, in save File "/Users/zgaurash/.pyenv/versions/3.6.13/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/Users/zgaurash/.pyenv/versions/3.6.13/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco File "/Users/zgaurash/.pyenv/versions/3.6.13/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o54.save. : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.opensearch.spark.sql.DefaultSource15 could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:395) at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720) at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.NoClassDefFoundError: scala/$less$colon$less at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2699) at java.lang.Class.getConstructor0(Class.java:3103) at java.lang.Class.newInstance(Class.java:412) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) ... 32 more Caused by: java.lang.ClassNotFoundException: scala.$less$colon$less at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 37 more

Spark version 3.2.3
Scala version 2.13

How can one reproduce the bug?

Steps to reproduce the behavior.

Create pyspark script:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json

from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("pysparkTest").getOrCreate()
df = sparkSession.createDataFrame([(1, "value1"), (2, "value2")], ["id", "value"])
df.show()
df.write
.format("org.opensearch.spark.sql")
.option("inferSchema", "true")
.option("opensearch.nodes", "127.0.0.1")
.option("opensearch.port", "9200")
.option("opensearch.net.http.auth.user", "admin")
.option("opensearch.net.http.auth.pass", "admin")
.option("opensearch.net.ssl", "true")
.option("opensearch.net.ssl.cert.allow.self.signed", "true")
.option("opensearch.batch.write.retry.count", "9")
.option("opensearch.http.retries", "9")
.option("opensearch.http.timeout", "18000")
.mode("append")
.save("pyspark_idx")

Create Opensearch instance in a docker container

docker run --name opensearch -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:2.7.0

Clone Opensearch-spark repo and create the snapshot.
Submit spark job

spark-submit --jars ~/opensearch-hadoop/spark/sql-30/build/libs/opensearch-spark-30_2.13-1.2.0-SNAPSHOT.jar test.py

What is the expected behavior?

A clear and concise description of what you expected to happen.

The connector should work and be able to read and write from opensearch. Also it should not be limited to specific versions of spark and scala.

What is your host/environment?

Operating system, version.

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

The text was updated successfully, but these errors were encountered:

Xtansia · 2024-08-07T00:37:06Z

@gsharma2907 A few questions/thoughts:

Are you able to confirm if you experience this with the actual released artifacts to make sure this isn't just a local build discrepancy?
Could you also please check with a known working older Spark 3.x version to see if this is limited to Spark 3.5 like [FEATURE] Add support for Apache Spark 3.5.1 (streaming) #496

Xtansia · 2025-01-14T00:45:44Z

This issue is due to a Scala version mismatch, you're using the Scala 2.13 version of opensearch-spark, but likely your Spark cluster is not the Scala 2.13 distribution. The default distribution for Spark 3.2 uses Scala 2.12.

gsharma2907 added bug Something isn't working untriaged labels Aug 6, 2024

Xtansia removed the untriaged label Aug 7, 2024

Marwen94 mentioned this issue Dec 8, 2024

[BUG] Most recent opensearch-spark jar causing errors with Pyspark #536

Closed

Xtansia closed this as not planned Won't fix, can't repro, duplicate, stale Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to write to Opensearch using Spark [BUG] #505

Unable to write to Opensearch using Spark [BUG] #505

gsharma2907 commented Aug 6, 2024 •

edited

Loading

Xtansia commented Aug 7, 2024

Xtansia commented Jan 14, 2025

Unable to write to Opensearch using Spark [BUG] #505

Unable to write to Opensearch using Spark [BUG] #505

Comments

gsharma2907 commented Aug 6, 2024 • edited Loading

What is the bug?

How can one reproduce the bug?

What is the expected behavior?

What is your host/environment?

Do you have any screenshots?

Do you have any additional context?

Xtansia commented Aug 7, 2024

Xtansia commented Jan 14, 2025

gsharma2907 commented Aug 6, 2024 •

edited

Loading