Skip to content

Improvement: remove possible race-condition with Gitsync #732

@adwk67

Description

@adwk67

Description

Under certain conditions, gitsync calls can result in Airflow DAG-parsing errors that are not easily cleared as they persist in the metadata database. This should at least be documented (for cases where configuration suffices) but ideally fixed internally in the framework. This ticket will require some research work before implementing a solution.

Example

A combination of

  1. A DAG that uses submodules
  2. Gitsync calls not run regularly enough in relation to the DAG-processor calls

can result in submodule cache files (*.pyc) not being in a consistent state when Airflow - detecting changes to DAGs - starts to (re-)parse them. The parsing interval is defined by AIRFLOW__DAG_PROCESSOR__MIN_FILE_PROCESS_INTERVAL and defaults to 30s. The period between gitsync calls is defined by the gitsync resource field wait, which defaults to 20s. In this case the cache may be inconsistent when DAG processing starts. Documenting this should be sufficient for many situations/users.

Note

This problem does not seem to happen when the dag-processor process is part of the scheduler pod rather than being a standalone role.

Improving symlinks (does NOT fix the problem)

The /stackable/app/git-x folder looks like this:

drwxr-sr-x 9 stackable stackable 4096 Jan  8 17:01 .git
drwxr-sr-x 3 stackable stackable 4096 Jan  8 16:58 .worktrees
lrwxrwxrwx 1 stackable stackable   51 Jan  8 16:58 current -> .worktrees/933f524d2aac463b2e5904fe566af1a74b3ff378

with e.g. AIRFLOW__CORE__DAGS_FOLDER=/stackable/app/git-x/current/mount-dags-gitsync/dags_airflow3

current is flipped to a new worktree once the gitsync is complete, but if Airflow is watching current (or something under it) it is not insulated from any filesystem churn that is happening i.e. although the symlink updates are atomic, the file operations through it aren't.

An alternative could be to use the exechook parameter to flip a second symlink to the target DAG folder:

ln -sfn /stackable/app/git-x/current /stackable/app/airflow-dags
export AIRFLOW__CORE__DAGS_FOLDER=/stackable/app/airflow-dags/mount-dags-gitsync/dags_airflow3

Force python to empty its caches (work-around)

This solves the problem. Add this at the top of each DAG:

import importlib
import site
importlib.reload(site)
importlib.invalidate_caches()

Always run git-sync also in an initContainer

This will remove the initial race condition entirely, as the initial git-sync run will mean submodules will be found by Airflow's python process, but it will not deal with situations where submodules are added to the target repo in github. This seems to be a drawback with the python import lib system, and not a Airflow problem per-se.

Tasks (Updated following testing)

  • extend documentation to highlight possible problematic config combinations and advise around them
  • research if the second symlink approach will work (it doesn't help)
  • implement initContainer with one-time=true (as well as the sidecar - but this won't help with later changes in the repo)
  • implement in the gitsync component in operator-rs
  • verify the workaround (with logging etc.) that the race condition is due to the long-running Airflow python process not updating its cached submodule files
  • if so, update the airflow gitsync test (to add the dag-processor role) and the DAGs in the target repo to empty the cache
  • check demos to see if changes are needed and make them
  • roll out to airflow- and nifi-operators (not needed as there are no changes to the operator code)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Selected for Development

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions