SparkApplication CRD improvements

# Description

The current implementation of the operator uses a Kubernetes Job workload running `spark-submit` in order to create Spark applications.

While this works fine for most cases, it has several drawbacks that cannot be easily addressed.

The primary reason for these drawbacks is the fact the Job pod adds a level of indirection between the operator and the applications such that the operator loses some control over parts of the resources owned by the application.

This architecture makes it unnecessarily complex to run applications in "client" mode. But this mode has many benefits over the "cluster" mode:
- It makes application monitoring more robust and reliable.
- It makes the application life cycle management more robust. This is particularly important for streaming applications.
- It makes application owned resource cleanup configurable and transparent.

> [!NOTE]
> In client mode, the driver runs on the same Pod as as the submit command and it requires a headless service where the executors can connect back to the driver. To create this service, the operator needs to know the name of the driver Pod. The operator cannot know the name of this Pod until it creates the Job object at which point it doesn't have control over any other resources (most notably executor Pods). This lack of control can then lead to application startup failures.

Further improvements:
- Allow users to delete application CRs  on condition (always, onSuccess) to free up  resources and make app names reusable.

## Proposal

__Summary__: Introduce a new version of the `SparkApplication` CRD without the `spec.job` field and that runs in client mode only.

1. How does it make monitoring more robust?

When the operator is in control of driver and executor pods, it can take all of them in consideration when updating the application status.

Currently only the driver pod is taken into consideration.

2. How does it make application life cycle more robust?

The operator can recognize failure states of both driver and executors and can act accordingly.

Currently the Job workload resubmits applications without much control from the operator.
In some cases, failure states are not propagated properly from executors to the Job pod, which leaves the application "hanging" in a failure state.

3. How does it make resource cleanup better?

All resources owned by the application are created by the operator and not by the submit job.
Therefore the operator knows exactly what it needs to clean up and when.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SparkApplication CRD improvements #659

Description

Proposal

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

SparkApplication CRD improvements #659

Description

Description

Proposal

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions