refactor(core): add worker-pool for containerd event processing #2294

keshav78-78 · 2025-12-05T11:30:44Z

Purpose of PR?:

Refactor the containerd handler to process containerd events via a bounded worker-pool and introduce a small PID/namespace cache to improve event throughput and reduce latency under bursty container workloads, without changing existing behavior.

Fixes #2289

Does this PR introduce a breaking change?

No.
The external behavior of containerd handling, container maps, endpoints, and policy enforcement remains unchanged.
Only the internal execution model for containerd events is changed from synchronous to worker-pool based processing.

If the changes in this PR are manually verified, list down the scenarios covered::

Non-k8s mode with containerd:
- Started KubeArmor with:
  - -k8s=false
  - -logPath=stdout
  - CRI_SOCKET=unix:///run/k3s/containerd/containerd.sock (via env)
- Verified log line:
  - Started to monitor Containerd events (worker-pool mode)
Container lifecycle via containerd (ctr):
- Pulled test image:
  - sudo ctr --address /run/k3s/containerd/containerd.sock -n k8s.io images pull docker.io/library/alpine:latest
- Ran short-lived containers:
  - sudo ctr --address /run/k3s/containerd/containerd.sock -n k8s.io run docker.io/library/alpine:latest ct-test-1 sleep 3
  - Repeated runs to generate bursts of /tasks/start, /tasks/exit, /containers/delete.
- Observed in KubeArmor logs:
  - Worker handling:
    - [containerd-worker] handling event: topic=/tasks/start
    - [containerd-worker] handling event: topic=/tasks/exit
    - [containerd-worker] handling event: topic=/containers/delete
  - Container lifecycle:
    - Detected a container (added/...)
    - Detected a container (removed/...)
  - Queue/worker metrics:
    - containerd events: queued=<N> processed=<M> busy=<K>
PID / Namespace correctness:
- Confirmed that for started containers:
  - Pid, PidNS, MntNS are populated correctly via:
    - cached helper (getPrimaryPidAndNSCached) when eventpid == 0
    - /proc/<pid>/ns/{pid,mnt} lookup when eventpid != 0
- Verified that these values still match the previous behavior (PID/NS enumeration via containerd and /proc).
Stability & error handling:
- Verified that:
  - KubeArmor continues running while processing containerd events in workers.
  - No panics observed due to per-worker recover around processContainerdJob.
  - On exit events, UpdateContainerdContainer(..., "destroy") is only invoked when the exiting PID matches the tracked container PID (same logic as before).

Additional information for reviewer? :

This change is primarily a performance / scalability improvement for containerd event handling:
- Moves event processing from a single synchronous loop to an 8-worker bounded pool with a 200-size queue.
- Introduces a short-lived PID/NS cache (5s) to avoid repeated /proc namespace lookups when events arrive in bursts.
- Keeps all existing data structures and external behavior intact.
Concurrency model:
- Full parallel mode is used (no strict per-container ordering is enforced).
- This matches current expectations where strict ordering between containerd events for the same container is not required.
Debug / observability:
- Additional [containerd-worker] ... logs and periodic queue metrics were added to make it easier to profile and monitor behavior in real environments.
- I already reduced/remove it in this final pr

Checklist:

Bug fix. Fixes #
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update
PR Title follows the convention of <type>(<scope>): <subject>
Commit has unit tests
Commit has integration tests

keshav78-78 · 2025-12-06T11:30:59Z

Review Please- @Aryan-sharma11 , @rksharma95, @AryanBakliwal

Signed-off-by: Aryan-sharma11 <aryan1126.sharma@gmail.com> Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

Signed-off-by: Aryan-sharma11 <aryan1126.sharma@gmail.com> Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

Signed-off-by: Jones Jefferson <jones@accuknox.com> Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

rksharma95 · 2025-12-08T04:14:52Z

Hi @keshav78-78 thanks for your PR,

I have few things to discuss before adopting this worker-pool approach:

first, the only point where no. of events could be large is during startup when existing containers are synced and once it's done Imo we don't expect event burst in that case existing approach looks more simpler without any performance bottleneck.

can we measure the performance impact in both the scenarios?

throughput with and without worker pool.
cpu usage in both scenarios

keshav78-78 force-pushed the containerd-workerpool-clean-1 branch from 160c240 to 413b5d3 Compare December 5, 2025 12:59

Aryan-sharma11 and others added 7 commits December 7, 2025 08:26

refactor code

81ba311

Signed-off-by: Aryan-sharma11 <aryan1126.sharma@gmail.com> Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

refactor(containerd): add worker-pool & pid/ns cache

bc501cf

Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

mark v1.6.5 stable

7ee9fad

Signed-off-by: Aryan-sharma11 <aryan1126.sharma@gmail.com> Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

Update Helm Chart To v1.6.5

1cd5a47

Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

Add support for IPv6 host addresses

ce9c935

Signed-off-by: Jones Jefferson <jones@accuknox.com> Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

Fix go vulns

cf86c4f

Signed-off-by: Jones Jefferson <jones@accuknox.com> Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

Fix gosec (G104): handle Sscanf errors in containerd handler

c2268e9

Signed-off-by: Keshav Kapoor <keshav.10919051722@std.ggsipu.ac.in>

keshav78-78 force-pushed the containerd-workerpool-clean-1 branch from f360a68 to c2268e9 Compare December 7, 2025 08:27

Merge branch 'main' into containerd-workerpool-clean-1

869f048

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor(core): add worker-pool for containerd event processing #2294

refactor(core): add worker-pool for containerd event processing #2294

keshav78-78 commented Dec 5, 2025

Uh oh!

keshav78-78 commented Dec 6, 2025

Uh oh!

rksharma95 commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor(core): add worker-pool for containerd event processing #2294

Are you sure you want to change the base?

refactor(core): add worker-pool for containerd event processing #2294

Conversation

keshav78-78 commented Dec 5, 2025

Uh oh!

keshav78-78 commented Dec 6, 2025

Uh oh!

rksharma95 commented Dec 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants