-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement FlintJob to handle all query types in warmpool mode #979
base: main
Are you sure you want to change the base?
Conversation
6747ab9
to
59aa26b
Compare
logInfo(s"WarmpoolEnabled: ${warmpoolEnabled}") | ||
|
||
if (!warmpoolEnabled) { | ||
val jobType = sparkSession.conf.get("spark.flint.job.type", FlintJobType.BATCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to have the conf key hard-coded here?
We can probably do FlintSparkConf.JOB_TYPE.key
, which similar to FlintSparkConf.WARMPOOL_ENABLED.key,
on above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code was doing the same; I just wrapped it in an if
block. I can modify this if needed.
CustomLogging.logInfo(s"""Job type is: ${jobType}""") | ||
sparkSession.conf.set(FlintSparkConf.JOB_TYPE.key, jobType) | ||
|
||
val dataSource = conf.get("spark.flint.datasource.name", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for DATA_SOURCE_NAME
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing code was doing the same; I just wrapped it in an if
block. I can modify this if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify and document how WarmPool is abstracted and can be enabled/disabled?
val warmpoolEnabled = conf.get(FlintSparkConf.WARMPOOL_ENABLED.key, "false").toBoolean | ||
logInfo(s"WarmpoolEnabled: ${warmpoolEnabled}") | ||
|
||
if (!warmpoolEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This introduced huge if/else block, which reduce the readability/maintainability a lot. Can you split the class for warmpool and original interactive job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this, let's abstract the common interface and move from there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
val queryId = conf.get(FlintSparkConf.QUERY_ID.key, "") | ||
} | ||
|
||
def queryLoop(commandContext: CommandContext, segmentName: String): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks similar to FlintREPL.queryLoop method, but modified. It would become very difficult to maintain since we need to very carefully maintain both to be consistent.
Can you add abstraction and avoid duplicates?
Can you add abstraction and avoid duplicates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This redundancy is expected. After this PR is merged, FlintREPL will be deprecated. FlintJob will be the single point of entry for all types of queries.
def getSegmentName(sparkSession: SparkSession): String = { | ||
val maxExecutorsCount = | ||
sparkSession.conf.get(FlintSparkConf.MAX_EXECUTORS_COUNT.key, "unknown") | ||
String.format("%se", maxExecutorsCount) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This segmentName
is specific to warmpool logic; let us create abstractions on warmpool and record metrics via AOP.
@@ -610,15 +610,15 @@ object FlintREPL extends Logging with FlintJobExecutor { | |||
} | |||
} | |||
|
|||
private def handleCommandTimeout( | |||
def handleCommandTimeout( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both FlintJob and FlintREPL extends FlintJobExecutor, consider refactor common methods to FlintJobExecutor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
@@ -32,7 +32,8 @@ case class JobOperator( | |||
dataSource: String, | |||
resultIndex: String, | |||
jobType: String, | |||
streamingRunningCount: AtomicInteger) | |||
streamingRunningCount: AtomicInteger, | |||
statementContext: Map[String, Any] = Map.empty[String, Any]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of adding statementContext which belongs to FlintStatement
model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The getNextStatement
call in FlintJob retrieves all the query information, including the statementContext
. However, when the query is a streaming or batch query, we need to invoke the JobOperator
. Currently, the JobOperator
class only accepts the query, queryId, and other related information, but it does not include the statementContext
. As a result, when the JobOperator
calls executeStatement
, we may encounter issues. To resolve this, the statementContext
should be passed to the JobOperator
.
Additionally, before calling executeStatement
, the FlintStatement
is constructed. Currently, the FlintStatement
does not include the statementContext
. However, with the introduction of the warmpool, it becomes necessary to include the statementContext
in the FlintStatement
, as there is client-side logic that depends on it.
val warmpoolEnabled = conf.get(FlintSparkConf.WARMPOOL_ENABLED.key, "false").toBoolean | ||
logInfo(s"WarmpoolEnabled: ${warmpoolEnabled}") | ||
|
||
if (!warmpoolEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this, let's abstract the common interface and move from there.
cd81203
to
044aeea
Compare
Signed-off-by: Shri Saran Raj N <[email protected]>
044aeea
to
adef5b6
Compare
val DEFAULT_QUERY_LOOP_EXECUTION_FREQUENCY = 100L | ||
} | ||
|
||
case class WarmpoolJob( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks it does not have test. Can you add tests to cover warmpool use cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I take this up in subsequent PRs ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not a good practice to separate PR for implementation and unit test. It would also become harder to review the unit test separately.
* @param flintStatement | ||
* Flint statement | ||
*/ | ||
private def finalizeCommand( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it duplicate with FlintREPL? Can you check other duplicate and avoid it whenever possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This differs from the FlintREPL code, as I require new changes that can't be achieved with the same function used in FlintREPL. Therefore, some redundancy in this function is unavoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the only difference is emit metrics and logging. If those are beneficial for Warmpool, should we generalize it and use for FlintREPL as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the concept of interactive / batch / streaming job for warm pool?
"spark.flint.job.queryLoopExecutionFrequency", | ||
DEFAULT_QUERY_LOOP_EXECUTION_FREQUENCY) | ||
|
||
val sessionManager = instantiateSessionManager(sparkSession, resultIndexOption) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need session manager for warmpool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are bunch of places where sessionManager needs to passed. The resultIndex for interactive query with custom datasource was being read from this sessionManagerImpl. So this has minimal dependency still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the concept of interactive / batch / streaming job for warm pool?
Why ? This classification is required for the WP logic. Also, It would be difficult to remove this at this point. Maybe we can pick it up later if required.
} | ||
} | ||
|
||
def queryLoop(commandContext: CommandContext): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the concept of query loop for warm pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warmpool requires multiple iterations as well before running the actual query.
Description
This PR introduces support for FlintJob to handle all types of queries — interactive, streaming, and batch — with all data sources in warmpool mode. Additionally, FlintJob will also support non-warmpool mode for streaming and batch queries, configurable via a Spark configuration setting.
Changes:
Related Issues
Check List
--signoff
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.