Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to write to Opensearch using Spark [BUG] #505

Closed
gsharma2907 opened this issue Aug 6, 2024 · 2 comments
Closed

Unable to write to Opensearch using Spark [BUG] #505

gsharma2907 opened this issue Aug 6, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@gsharma2907
Copy link

gsharma2907 commented Aug 6, 2024

What is the bug?

A clear and concise description of the bug.

Unable to write to Opensearch using Spark. Receiving following exceptions

Traceback (most recent call last): File "/Users/zgaurash/Downloads/test.py", line 21, in <module> .save("pyspark_idx") File "/Users/zgaurash/.pyenv/versions/3.6.13/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 740, in save File "/Users/zgaurash/.pyenv/versions/3.6.13/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/Users/zgaurash/.pyenv/versions/3.6.13/lib/python3.6/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco File "/Users/zgaurash/.pyenv/versions/3.6.13/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o54.save. : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.opensearch.spark.sql.DefaultSource15 could not be instantiated at java.util.ServiceLoader.fail(ServiceLoader.java:232) at java.util.ServiceLoader.access$100(ServiceLoader.java:185) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384) at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404) at java.util.ServiceLoader$1.next(ServiceLoader.java:480) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:46) at scala.collection.Iterator.foreach(Iterator.scala:943) at scala.collection.Iterator.foreach$(Iterator.scala:943) at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at scala.collection.AbstractIterable.foreach(Iterable.scala:56) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) at scala.collection.TraversableLike.filter(TraversableLike.scala:395) at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) at scala.collection.AbstractTraversable.filter(Traversable.scala:108) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652) at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720) at org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.NoClassDefFoundError: scala/$less$colon$less at java.lang.Class.getDeclaredConstructors0(Native Method) at java.lang.Class.privateGetDeclaredConstructors(Class.java:2699) at java.lang.Class.getConstructor0(Class.java:3103) at java.lang.Class.newInstance(Class.java:412) at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380) ... 32 more Caused by: java.lang.ClassNotFoundException: scala.$less$colon$less at java.net.URLClassLoader.findClass(URLClassLoader.java:387) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 37 more

Spark version 3.2.3
Scala version 2.13

How can one reproduce the bug?

Steps to reproduce the behavior.

  1. Create pyspark script:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json

from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName("pysparkTest").getOrCreate()
df = sparkSession.createDataFrame([(1, "value1"), (2, "value2")], ["id", "value"])
df.show()
df.write
.format("org.opensearch.spark.sql")
.option("inferSchema", "true")
.option("opensearch.nodes", "127.0.0.1")
.option("opensearch.port", "9200")
.option("opensearch.net.http.auth.user", "admin")
.option("opensearch.net.http.auth.pass", "admin")
.option("opensearch.net.ssl", "true")
.option("opensearch.net.ssl.cert.allow.self.signed", "true")
.option("opensearch.batch.write.retry.count", "9")
.option("opensearch.http.retries", "9")
.option("opensearch.http.timeout", "18000")
.mode("append")
.save("pyspark_idx")

  1. Create Opensearch instance in a docker container

docker run --name opensearch -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:2.7.0

  1. Clone Opensearch-spark repo and create the snapshot.
  2. Submit spark job

spark-submit --jars ~/opensearch-hadoop/spark/sql-30/build/libs/opensearch-spark-30_2.13-1.2.0-SNAPSHOT.jar test.py

What is the expected behavior?

A clear and concise description of what you expected to happen.

The connector should work and be able to read and write from opensearch. Also it should not be limited to specific versions of spark and scala.

What is your host/environment?

Operating system, version.

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

@gsharma2907 gsharma2907 added bug Something isn't working untriaged labels Aug 6, 2024
@Xtansia Xtansia removed the untriaged label Aug 7, 2024
@Xtansia
Copy link
Collaborator

Xtansia commented Aug 7, 2024

@gsharma2907 A few questions/thoughts:

  • Are you able to confirm if you experience this with the actual released artifacts to make sure this isn't just a local build discrepancy?
  • Could you also please check with a known working older Spark 3.x version to see if this is limited to Spark 3.5 like [FEATURE] Add support for Apache Spark 3.5.1 (streaming) #496

@Xtansia
Copy link
Collaborator

Xtansia commented Jan 14, 2025

This issue is due to a Scala version mismatch, you're using the Scala 2.13 version of opensearch-spark, but likely your Spark cluster is not the Scala 2.13 distribution. The default distribution for Spark 3.2 uses Scala 2.12.

@Xtansia Xtansia closed this as not planned Won't fix, can't repro, duplicate, stale Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants