Migrate from NGINX Ingress to Envoy Gateway by Mushtaq-BGA · Pull Request #81 · opea-project/Enterprise-Inference

Mushtaq-BGA · 2026-03-31T06:05:46Z

Replace NGINX Ingress with Envoy Gateway as cluster edge provider
Convert all ingress.yaml helm templates to HTTPRoute resources
Add GatewayClass, Gateway, and Envoy Backend support
Update deploy-ingress-controller.yml for Envoy Gateway deployment
Add Keycloak+APISIX HTTPRoute integration for model routing
Add envoy-gateway-deployment-guide and migration docs
Update observability, genai-gateway, and model chart templates
Support edge_provider config option (nginx/envoy) in inference-config.cfg"

vhpintel · 2026-03-31T15:56:39Z

core/inventory/inference-config.cfg

Automation scripts still require models, hugging_face_token, and cpu_or_gpu. We still need these configs right, Why are we removing these ?

Also, envoy changes shall not impact config and host yaml files? Why are these modified?

the host.yaml and all.yaml changes were accidentally committed with local test values. I've reverted both to their original generalized state. The only remaining config change is adding http_proxy/https_proxy/no_proxy fields to inference-config.cfg, which is needed for Envoy Gateway — the OCI helm chart pull (oci://docker.io/envoyproxy/gateway-helm) requires proxy settings in environments behind a proxy, and centralizing them in the config file keeps deployment consistent.

amberjain1 · 2026-04-01T02:46:06Z

core/helm-charts/genai-gateway-trace/charts/langfuse/values.yaml

      path: "/api/public/health"
      # -- Initial delay seconds for livenessProbe.
-      initialDelaySeconds: 20
+      initialDelaySeconds: 120


Do we really have to increase these probes so high? What were the cases where these were failing?

On single-node deployments, Langfuse was getting killed in a restart loop before it could finish initializing. The root cause is that Langfuse depends on ClickHouse + PostgreSQL + Redis all starting on the same node, and ClickHouse runs schema migrations on first boot which are CPU/IO-heavy.

With the original values (initialDelaySeconds: 20, failureThreshold: 3, periodSeconds: 10), k8s would kill the pod after ~50 seconds (20 + 3×10). On a single node where ClickHouse is still migrating, Langfuse's /api/public/health endpoint isn't reachable yet at that point.

Hence tried to reduce replica's of genai resources for single node testing

amberjain1 · 2026-04-01T02:53:41Z

core/helm-charts/vllm/templates/ingress.yaml

-# and an empty file will abort the edit. If an error occurs while saving this file will be
-# reopened with the relevant failures.
-{{- if .Values.ingress.enabled}}
+{{- if .Values.ingress.enabled }}


is values.ingress.enabled still valid for non eks?
Also for eks, do we not have the envoy option?

Yes, ingress.enabled is still valid for non-EKS. Here's the flow:

install-model.sh sets ingress_enabled=true when deploy_ingress_controller=yes (regardless of platform)
This is passed to the Helm chart as --set ingress.enabled=true
For non-EKS on-prem: ingress.yaml renders an HTTPRoute (Envoy Gateway) — this is the migration target
For EKS: ingress_eks.yaml renders a traditional Ingress (ALB) — gated by {{- if and .Values.ingress.enabled (eq .Values.platform "eks") }}
So the same ingress.enabled flag controls both paths — the template logic decides whether to create an HTTPRoute or an ALB Ingress based on platform.

amberjain1 · 2026-04-01T02:54:25Z

core/inventory/metadata/all.yml

-# http_proxy: ""
-# https_proxy: ""
+http_proxy: ""
+https_proxy: ""


Why are these uncommented?

my local proxy values got into this commit. I've reverted all.yml back to the original state with commented-out proxy lines and empty defaults:

amberjain1 · 2026-04-01T02:55:39Z

core/inventory/hosts.yaml

+    master1:
+      ansible_connection: local
+      ansible_user: general
+      ansible_become: true


Do not remove the generalizations

amberjain1 · 2026-04-01T02:57:28Z

core/inventory/inference-config.cfg

Also, envoy changes shall not impact config and host yaml files? Why are these modified?

amberjain1 · 2026-04-01T02:59:49Z

core/lib/system/precheck/readiness-check.sh

+    if [ ! -f "$HOMEDIR/inventory/hosts.yaml" ]; then
+        echo -e "${YELLOW}Inventory file not found — auto-generating hosts.yaml for single-node...${NC}"
+        bash "$HOMEDIR/scripts/generate-hosts.sh"
+    fi


Is this change needed overall? Definitely it is not part of envoy related changes?

amberjain1 · 2026-04-01T03:00:09Z

core/lib/system/setup-env.sh

+        echo -e "${YELLOW}No hosts.yaml found — auto-generating for single-node deployment...${NC}"
+        bash "$HOMEDIR/scripts/generate-hosts.sh"
+    fi
+


Same as above.

amberjain1 · 2026-04-01T03:01:26Z

core/playbooks/deploy-genai-gateway.yml

+        --set redis.primary.resources.limits.cpu=500m
+        --set redis.primary.resources.requests.memory=256Mi
+        --set redis.primary.resources.limits.memory=512Mi
+        {% endif %}


Why are we modifying the clickhouse and redis changes?
if not directly related with this change create a separate PR for this.

As i said in earlier comment, on single not i was getting resource issues, hence changed for single node.I can revert changes keep as it is and create sperate PR for this

amberjain1 · 2026-04-01T03:07:10Z

core/scripts/generate-vault-secrets.sh

Lets not change the file permissions

amberjain1 · 2026-04-01T03:09:31Z

docs/single-node-deployment.md

+<details>
+<summary>Optional: Manual override</summary>
+
+If you need to customize the inventory (e.g., use a different user or SSH key), you can still create the file manually:


It is good to copy host.yaml and config.yaml from example folder as it helps users to get comfortable on how to use.

yes, i will keep same as before

- Replace NGINX Ingress with Envoy Gateway as cluster edge provider - Convert all ingress.yaml helm templates to HTTPRoute resources - Add GatewayClass, Gateway, and EnvoyProxy configuration - Update deploy-ingress-controller.yml for Envoy Gateway deployment - Add Keycloak, Grafana, GenAI Gateway Trace HTTPRoute integration - Gate EKS ingress_eks.yaml templates with platform check - Rename run_ingress_nginx_playbook() to run_edge_gateway_playbook() - Add envoy-gateway-deployment-guide and migration docs - Update observability, genai-gateway, and model chart templates - Increase Langfuse probe thresholds for single-node stability - Add single-node ClickHouse/Redis resource limits - Update proxy handling in read-config-file.sh - Rename Gaudi references to Intel AI Accelerator in docs - Update OVMS model deploy guide with generic model routing

vhpintel · 2026-04-02T09:15:46Z

Tested happy path and Envoy gateway pods are healthy from a runtime perspective . We just need to change the config name for gateway instead of ingress.

vhpintel

LGTM

vhpintel reviewed Mar 31, 2026

View reviewed changes

amberjain1 reviewed Apr 1, 2026

View reviewed changes

Mushtaq-BGA force-pushed the main branch 2 times, most recently from 435fc81 to a2fdc15 Compare April 1, 2026 06:22

Mushtaq-BGA force-pushed the main branch from a7f0875 to 2e7d300 Compare April 1, 2026 06:26

Merge branch 'opea-project:main' into main

ee8ba9d

Mushtaq-BGA requested review from amberjain1 and vhpintel April 1, 2026 07:37

vhpintel approved these changes Apr 2, 2026

View reviewed changes

Conversation

Mushtaq-BGA commented Mar 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vhpintel commented Apr 2, 2026

Uh oh!

vhpintel left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants