Implement FlintJob to handle all query types in warmpool mode #979

saranrajnk · 2024-12-09T18:16:10Z

Description

This PR introduces support for FlintJob to handle all types of queries — interactive, streaming, and batch — with all data sources in warmpool mode. Additionally, FlintJob will also support non-warmpool mode for streaming and batch queries, configurable via a Spark configuration setting.

Changes:

FlintJob does not use sessionId for query execution.
The client assigns the query at runtime for all query types in warmpool mode and executes them.
FlintJob now supports interactive queries by reusing several functions from FlintREPL, with access modifiers updated to public.
Added configuration for non-warmpool mode to support streaming and batch queries.
Emits metrics success, failure and latency metrics

Related Issues

Check List

Updated documentation (docs/ppl-lang/README.md)
Implemented unit tests
Implemented tests for combination with other commands
New added source code should include a copyright header
Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

andy-k-improving · 2024-12-12T17:32:07Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+    logInfo(s"WarmpoolEnabled: ${warmpoolEnabled}")
+
+    if (!warmpoolEnabled) {
+      val jobType = sparkSession.conf.get("spark.flint.job.type", FlintJobType.BATCH)


Any particular reason to have the conf key hard-coded here?
We can probably do FlintSparkConf.JOB_TYPE.key, which similar to FlintSparkConf.WARMPOOL_ENABLED.key, on above.

The existing code was doing the same; I just wrapped it in an if block. I can modify this if needed.

andy-k-improving · 2024-12-12T17:32:45Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+      CustomLogging.logInfo(s"""Job type is: ${jobType}""")
+      sparkSession.conf.set(FlintSparkConf.JOB_TYPE.key, jobType)
+
+      val dataSource = conf.get("spark.flint.datasource.name", "")


Same for DATA_SOURCE_NAME

The existing code was doing the same; I just wrapped it in an if block. I can modify this if needed.

ykmr1224

Can you clarify and document how WarmPool is abstracted and can be enabled/disabled?

ykmr1224 · 2024-12-16T23:30:53Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+    val warmpoolEnabled = conf.get(FlintSparkConf.WARMPOOL_ENABLED.key, "false").toBoolean
+    logInfo(s"WarmpoolEnabled: ${warmpoolEnabled}")
+
+    if (!warmpoolEnabled) {


This introduced huge if/else block, which reduce the readability/maintainability a lot. Can you split the class for warmpool and original interactive job?

+1 on this, let's abstract the common interface and move from there.

ykmr1224 · 2024-12-16T23:38:54Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

-    val queryId = conf.get(FlintSparkConf.QUERY_ID.key, "")
+  }
+
+  def queryLoop(commandContext: CommandContext, segmentName: String): Unit = {


This looks similar to FlintREPL.queryLoop method, but modified. It would become very difficult to maintain since we need to very carefully maintain both to be consistent.
Can you add abstraction and avoid duplicates?
Can you add abstraction and avoid duplicates?

This redundancy is expected. After this PR is merged, FlintREPL will be deprecated. FlintJob will be the single point of entry for all types of queries.

noCharger · 2024-12-17T00:13:38Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJobExecutor.scala

+  def getSegmentName(sparkSession: SparkSession): String = {
+    val maxExecutorsCount =
+      sparkSession.conf.get(FlintSparkConf.MAX_EXECUTORS_COUNT.key, "unknown")
+    String.format("%se", maxExecutorsCount)
+  }


This segmentName is specific to warmpool logic; let us create abstractions on warmpool and record metrics via AOP.

noCharger · 2024-12-17T00:15:45Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintREPL.scala

@@ -610,15 +610,15 @@ object FlintREPL extends Logging with FlintJobExecutor {
    }
  }

-  private def handleCommandTimeout(
+  def handleCommandTimeout(


Both FlintJob and FlintREPL extends FlintJobExecutor, consider refactor common methods to FlintJobExecutor

noCharger · 2024-12-17T00:18:30Z

spark-sql-application/src/main/scala/org/apache/spark/sql/JobOperator.scala

@@ -32,7 +32,8 @@ case class JobOperator(
    dataSource: String,
    resultIndex: String,
    jobType: String,
-    streamingRunningCount: AtomicInteger)
+    streamingRunningCount: AtomicInteger,
+    statementContext: Map[String, Any] = Map.empty[String, Any])


what's the purpose of adding statementContext which belongs to FlintStatement model?

The getNextStatement call in FlintJob retrieves all the query information, including the statementContext. However, when the query is a streaming or batch query, we need to invoke the JobOperator. Currently, the JobOperator class only accepts the query, queryId, and other related information, but it does not include the statementContext. As a result, when the JobOperator calls executeStatement, we may encounter issues. To resolve this, the statementContext should be passed to the JobOperator.

Additionally, before calling executeStatement, the FlintStatement is constructed. Currently, the FlintStatement does not include the statementContext. However, with the introduction of the warmpool, it becomes necessary to include the statementContext in the FlintStatement, as there is client-side logic that depends on it.

noCharger · 2024-12-17T00:20:13Z

spark-sql-application/src/main/scala/org/apache/spark/sql/FlintJob.scala

+    val warmpoolEnabled = conf.get(FlintSparkConf.WARMPOOL_ENABLED.key, "false").toBoolean
+    logInfo(s"WarmpoolEnabled: ${warmpoolEnabled}")
+
+    if (!warmpoolEnabled) {


+1 on this, let's abstract the common interface and move from there.

Signed-off-by: Shri Saran Raj N <[email protected]>

ykmr1224 · 2025-01-07T16:49:08Z

spark-sql-application/src/main/scala/org/apache/spark/sql/WarmpoolJob.scala

+  val DEFAULT_QUERY_LOOP_EXECUTION_FREQUENCY = 100L
+}
+
+case class WarmpoolJob(


Looks it does not have test. Can you add tests to cover warmpool use cases?

Can I take this up in subsequent PRs ?

It is not a good practice to separate PR for implementation and unit test. It would also become harder to review the unit test separately.

ykmr1224 · 2025-01-07T17:02:59Z

spark-sql-application/src/main/scala/org/apache/spark/sql/WarmpoolJob.scala

+   * @param flintStatement
+   *   Flint statement
+   */
+  private def finalizeCommand(


Is it duplicate with FlintREPL? Can you check other duplicate and avoid it whenever possible?

This differs from the FlintREPL code, as I require new changes that can't be achieved with the same function used in FlintREPL. Therefore, some redundancy in this function is unavoidable.

Looks like the only difference is emit metrics and logging. If those are beneficial for Warmpool, should we generalize it and use for FlintREPL as well?

noCharger

Can we remove the concept of interactive / batch / streaming job for warm pool?

noCharger · 2025-01-10T17:56:33Z

spark-sql-application/src/main/scala/org/apache/spark/sql/WarmpoolJob.scala

+        "spark.flint.job.queryLoopExecutionFrequency",
+        DEFAULT_QUERY_LOOP_EXECUTION_FREQUENCY)
+
+    val sessionManager = instantiateSessionManager(sparkSession, resultIndexOption)


why do we need session manager for warmpool?

There are bunch of places where sessionManager needs to passed. The resultIndex for interactive query with custom datasource was being read from this sessionManagerImpl. So this has minimal dependency still.

Can we remove the concept of interactive / batch / streaming job for warm pool?

Why ? This classification is required for the WP logic. Also, It would be difficult to remove this at this point. Maybe we can pick it up later if required.

noCharger · 2025-01-10T17:58:03Z

spark-sql-application/src/main/scala/org/apache/spark/sql/WarmpoolJob.scala

+    }
+  }
+
+  def queryLoop(commandContext: CommandContext): Unit = {


why do we need the concept of query loop for warm pool?

Warmpool requires multiple iterations as well before running the actual query.

saranrajnk requested review from dai-chen, mengweieric, penghuo, seankao-az, anirudha, kaituo, YANG-DB, noCharger, LantaoJin and ykmr1224 as code owners December 9, 2024 18:16

saranrajnk force-pushed the nexus-wp-feat branch from 6747ab9 to 59aa26b Compare December 10, 2024 15:14

andy-k-improving reviewed Dec 12, 2024

View reviewed changes

ykmr1224 requested changes Dec 16, 2024

View reviewed changes

noCharger reviewed Dec 17, 2024

View reviewed changes

saranrajnk force-pushed the nexus-wp-feat branch 2 times, most recently from cd81203 to 044aeea Compare December 20, 2024 20:42

Implement FlintJob to support warmpool

adef5b6

Signed-off-by: Shri Saran Raj N <[email protected]>

saranrajnk force-pushed the nexus-wp-feat branch from 044aeea to adef5b6 Compare December 20, 2024 20:43

noCharger added the 0.7 label Jan 2, 2025

ykmr1224 reviewed Jan 7, 2025

View reviewed changes

noCharger requested changes Jan 10, 2025

View reviewed changes

adds unit tests

e195862

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement FlintJob to handle all query types in warmpool mode #979

Implement FlintJob to handle all query types in warmpool mode #979

saranrajnk commented Dec 9, 2024 •

edited

Loading

andy-k-improving Dec 12, 2024

saranrajnk Dec 13, 2024

andy-k-improving Dec 12, 2024

saranrajnk Dec 13, 2024

ykmr1224 left a comment

ykmr1224 Dec 16, 2024

noCharger Dec 17, 2024

saranrajnk Dec 18, 2024

ykmr1224 Dec 16, 2024

saranrajnk Dec 18, 2024

noCharger Dec 17, 2024

noCharger Dec 17, 2024

saranrajnk Dec 18, 2024

noCharger Dec 17, 2024

saranrajnk Dec 18, 2024

noCharger Dec 17, 2024

ykmr1224 Jan 7, 2025

saranrajnk Jan 7, 2025

ykmr1224 Jan 8, 2025

ykmr1224 Jan 7, 2025

saranrajnk Jan 7, 2025

ykmr1224 Jan 8, 2025

noCharger left a comment

noCharger Jan 10, 2025

saranrajnk Jan 13, 2025

saranrajnk Jan 13, 2025

noCharger Jan 10, 2025

saranrajnk Jan 13, 2025

Implement FlintJob to handle all query types in warmpool mode #979

Are you sure you want to change the base?

Implement FlintJob to handle all query types in warmpool mode #979

Conversation

saranrajnk commented Dec 9, 2024 • edited Loading

Description

Related Issues

Check List

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ykmr1224 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noCharger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saranrajnk commented Dec 9, 2024 •

edited

Loading