Migrate from NGINX Ingress to Envoy Gateway#81
Migrate from NGINX Ingress to Envoy Gateway#81Mushtaq-BGA wants to merge 2 commits intoopea-project:mainfrom
Conversation
Mushtaq-BGA
commented
Mar 31, 2026
- Replace NGINX Ingress with Envoy Gateway as cluster edge provider
- Convert all ingress.yaml helm templates to HTTPRoute resources
- Add GatewayClass, Gateway, and Envoy Backend support
- Update deploy-ingress-controller.yml for Envoy Gateway deployment
- Add Keycloak+APISIX HTTPRoute integration for model routing
- Add envoy-gateway-deployment-guide and migration docs
- Update observability, genai-gateway, and model chart templates
- Support edge_provider config option (nginx/envoy) in inference-config.cfg"
There was a problem hiding this comment.
Automation scripts still require models, hugging_face_token, and cpu_or_gpu. We still need these configs right, Why are we removing these ?
There was a problem hiding this comment.
Also, envoy changes shall not impact config and host yaml files? Why are these modified?
There was a problem hiding this comment.
the host.yaml and all.yaml changes were accidentally committed with local test values. I've reverted both to their original generalized state. The only remaining config change is adding http_proxy/https_proxy/no_proxy fields to inference-config.cfg, which is needed for Envoy Gateway — the OCI helm chart pull (oci://docker.io/envoyproxy/gateway-helm) requires proxy settings in environments behind a proxy, and centralizing them in the config file keeps deployment consistent.
| path: "/api/public/health" | ||
| # -- Initial delay seconds for livenessProbe. | ||
| initialDelaySeconds: 20 | ||
| initialDelaySeconds: 120 |
There was a problem hiding this comment.
Do we really have to increase these probes so high? What were the cases where these were failing?
There was a problem hiding this comment.
On single-node deployments, Langfuse was getting killed in a restart loop before it could finish initializing. The root cause is that Langfuse depends on ClickHouse + PostgreSQL + Redis all starting on the same node, and ClickHouse runs schema migrations on first boot which are CPU/IO-heavy.
With the original values (initialDelaySeconds: 20, failureThreshold: 3, periodSeconds: 10), k8s would kill the pod after ~50 seconds (20 + 3×10). On a single node where ClickHouse is still migrating, Langfuse's /api/public/health endpoint isn't reachable yet at that point.
Hence tried to reduce replica's of genai resources for single node testing
| # and an empty file will abort the edit. If an error occurs while saving this file will be | ||
| # reopened with the relevant failures. | ||
| {{- if .Values.ingress.enabled}} | ||
| {{- if .Values.ingress.enabled }} |
There was a problem hiding this comment.
is values.ingress.enabled still valid for non eks?
Also for eks, do we not have the envoy option?
There was a problem hiding this comment.
Yes, ingress.enabled is still valid for non-EKS. Here's the flow:
install-model.sh sets ingress_enabled=true when deploy_ingress_controller=yes (regardless of platform)
This is passed to the Helm chart as --set ingress.enabled=true
For non-EKS on-prem: ingress.yaml renders an HTTPRoute (Envoy Gateway) — this is the migration target
For EKS: ingress_eks.yaml renders a traditional Ingress (ALB) — gated by {{- if and .Values.ingress.enabled (eq .Values.platform "eks") }}
So the same ingress.enabled flag controls both paths — the template logic decides whether to create an HTTPRoute or an ALB Ingress based on platform.
core/inventory/metadata/all.yml
Outdated
| # http_proxy: "" | ||
| # https_proxy: "" | ||
| http_proxy: "" | ||
| https_proxy: "" |
There was a problem hiding this comment.
Why are these uncommented?
There was a problem hiding this comment.
my local proxy values got into this commit. I've reverted all.yml back to the original state with commented-out proxy lines and empty defaults:
core/inventory/hosts.yaml
Outdated
| master1: | ||
| ansible_connection: local | ||
| ansible_user: general | ||
| ansible_become: true |
There was a problem hiding this comment.
Do not remove the generalizations
There was a problem hiding this comment.
Also, envoy changes shall not impact config and host yaml files? Why are these modified?
| if [ ! -f "$HOMEDIR/inventory/hosts.yaml" ]; then | ||
| echo -e "${YELLOW}Inventory file not found — auto-generating hosts.yaml for single-node...${NC}" | ||
| bash "$HOMEDIR/scripts/generate-hosts.sh" | ||
| fi |
There was a problem hiding this comment.
Is this change needed overall? Definitely it is not part of envoy related changes?
| echo -e "${YELLOW}No hosts.yaml found — auto-generating for single-node deployment...${NC}" | ||
| bash "$HOMEDIR/scripts/generate-hosts.sh" | ||
| fi | ||
|
|
| --set redis.primary.resources.limits.cpu=500m | ||
| --set redis.primary.resources.requests.memory=256Mi | ||
| --set redis.primary.resources.limits.memory=512Mi | ||
| {% endif %} |
There was a problem hiding this comment.
Why are we modifying the clickhouse and redis changes?
if not directly related with this change create a separate PR for this.
There was a problem hiding this comment.
As i said in earlier comment, on single not i was getting resource issues, hence changed for single node.I can revert changes keep as it is and create sperate PR for this
There was a problem hiding this comment.
Lets not change the file permissions
docs/single-node-deployment.md
Outdated
| <details> | ||
| <summary>Optional: Manual override</summary> | ||
|
|
||
| If you need to customize the inventory (e.g., use a different user or SSH key), you can still create the file manually: |
There was a problem hiding this comment.
It is good to copy host.yaml and config.yaml from example folder as it helps users to get comfortable on how to use.
There was a problem hiding this comment.
yes, i will keep same as before
435fc81 to
a2fdc15
Compare
- Replace NGINX Ingress with Envoy Gateway as cluster edge provider - Convert all ingress.yaml helm templates to HTTPRoute resources - Add GatewayClass, Gateway, and EnvoyProxy configuration - Update deploy-ingress-controller.yml for Envoy Gateway deployment - Add Keycloak, Grafana, GenAI Gateway Trace HTTPRoute integration - Gate EKS ingress_eks.yaml templates with platform check - Rename run_ingress_nginx_playbook() to run_edge_gateway_playbook() - Add envoy-gateway-deployment-guide and migration docs - Update observability, genai-gateway, and model chart templates - Increase Langfuse probe thresholds for single-node stability - Add single-node ClickHouse/Redis resource limits - Update proxy handling in read-config-file.sh - Rename Gaudi references to Intel AI Accelerator in docs - Update OVMS model deploy guide with generic model routing
|
Tested happy path and Envoy gateway pods are healthy from a runtime perspective . We just need to change the config name for gateway instead of ingress. |