Skip to content

[SPARK-56966][SPARK-56967][CORE] Auto-create missing event log directories#56035

Open
sharma-0311 wants to merge 1 commit into
apache:masterfrom
sharma-0311:SPARK-56966-SPARK-56967-auto-create-event-log-dirs
Open

[SPARK-56966][SPARK-56967][CORE] Auto-create missing event log directories#56035
sharma-0311 wants to merge 1 commit into
apache:masterfrom
sharma-0311:SPARK-56966-SPARK-56967-auto-create-event-log-dirs

Conversation

@sharma-0311
Copy link
Copy Markdown

What changes were proposed in this pull request?

This PR fixes two related issues where Spark fails when configured log directories do not already exist.

Changes

  1. Auto-create spark.history.fs.logDirectory if missing before History Server startup.
  2. Auto-create spark.eventLog.dir if missing before event logging initialization.

Why are the changes needed?

Currently:

  • FsHistoryProvider fails when the configured history log directory does not exist.
  • EventLogFileWriter throws FileNotFoundException if the event log directory does not exist.

This behavior affects local filesystems as well as S3/Hadoop-backed filesystems.

The fix creates the directories automatically using Hadoop FileSystem.mkdirs() before validation proceeds.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit tests:

  • EventLogFileWritersSuite
  • FsHistoryProviderSuite

Tested automatic creation of missing directories during initialization.

…rver log directories if they do not exist

- FsHistoryProvider: when spark.history.fs.logDirectory does not exist,
  attempt to create it automatically via FileSystem.mkdirs instead of
  immediately failing. Falls back to a warning if creation fails.

- EventLogFileWriters: in requireLogBaseDirAsDirectory(), check for
  directory existence before getFileStatus and auto-create via
  FileSystem.mkdirs if missing. This prevents FileNotFoundException
  when spark.eventLog.dir (including S3 paths) has not been pre-created.
Copy link
Copy Markdown
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making a PR. However, there is a reason why we didn't do this.

The Apache Spark community considers this is as one of the security feature. Apache Spark History server should work on the read-only storage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants