Skip to content

Migrate from NGINX Ingress to Envoy Gateway#81

Open
Mushtaq-BGA wants to merge 2 commits intoopea-project:mainfrom
Mushtaq-BGA:main
Open

Migrate from NGINX Ingress to Envoy Gateway#81
Mushtaq-BGA wants to merge 2 commits intoopea-project:mainfrom
Mushtaq-BGA:main

Conversation

@Mushtaq-BGA
Copy link
Copy Markdown

  • Replace NGINX Ingress with Envoy Gateway as cluster edge provider
  • Convert all ingress.yaml helm templates to HTTPRoute resources
  • Add GatewayClass, Gateway, and Envoy Backend support
  • Update deploy-ingress-controller.yml for Envoy Gateway deployment
  • Add Keycloak+APISIX HTTPRoute integration for model routing
  • Add envoy-gateway-deployment-guide and migration docs
  • Update observability, genai-gateway, and model chart templates
  • Support edge_provider config option (nginx/envoy) in inference-config.cfg"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automation scripts still require models, hugging_face_token, and cpu_or_gpu. We still need these configs right, Why are we removing these ?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, envoy changes shall not impact config and host yaml files? Why are these modified?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the host.yaml and all.yaml changes were accidentally committed with local test values. I've reverted both to their original generalized state. The only remaining config change is adding http_proxy/https_proxy/no_proxy fields to inference-config.cfg, which is needed for Envoy Gateway — the OCI helm chart pull (oci://docker.io/envoyproxy/gateway-helm) requires proxy settings in environments behind a proxy, and centralizing them in the config file keeps deployment consistent.

path: "/api/public/health"
# -- Initial delay seconds for livenessProbe.
initialDelaySeconds: 20
initialDelaySeconds: 120
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really have to increase these probes so high? What were the cases where these were failing?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On single-node deployments, Langfuse was getting killed in a restart loop before it could finish initializing. The root cause is that Langfuse depends on ClickHouse + PostgreSQL + Redis all starting on the same node, and ClickHouse runs schema migrations on first boot which are CPU/IO-heavy.

With the original values (initialDelaySeconds: 20, failureThreshold: 3, periodSeconds: 10), k8s would kill the pod after ~50 seconds (20 + 3×10). On a single node where ClickHouse is still migrating, Langfuse's /api/public/health endpoint isn't reachable yet at that point.

Hence tried to reduce replica's of genai resources for single node testing

# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
{{- if .Values.ingress.enabled}}
{{- if .Values.ingress.enabled }}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is values.ingress.enabled still valid for non eks?
Also for eks, do we not have the envoy option?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ingress.enabled is still valid for non-EKS. Here's the flow:

install-model.sh sets ingress_enabled=true when deploy_ingress_controller=yes (regardless of platform)
This is passed to the Helm chart as --set ingress.enabled=true
For non-EKS on-prem: ingress.yaml renders an HTTPRoute (Envoy Gateway) — this is the migration target
For EKS: ingress_eks.yaml renders a traditional Ingress (ALB) — gated by {{- if and .Values.ingress.enabled (eq .Values.platform "eks") }}
So the same ingress.enabled flag controls both paths — the template logic decides whether to create an HTTPRoute or an ALB Ingress based on platform.

# http_proxy: ""
# https_proxy: ""
http_proxy: ""
https_proxy: ""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these uncommented?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my local proxy values got into this commit. I've reverted all.yml back to the original state with commented-out proxy lines and empty defaults:

master1:
ansible_connection: local
ansible_user: general
ansible_become: true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not remove the generalizations

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, envoy changes shall not impact config and host yaml files? Why are these modified?

if [ ! -f "$HOMEDIR/inventory/hosts.yaml" ]; then
echo -e "${YELLOW}Inventory file not found — auto-generating hosts.yaml for single-node...${NC}"
bash "$HOMEDIR/scripts/generate-hosts.sh"
fi
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed overall? Definitely it is not part of envoy related changes?

echo -e "${YELLOW}No hosts.yaml found — auto-generating for single-node deployment...${NC}"
bash "$HOMEDIR/scripts/generate-hosts.sh"
fi

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

--set redis.primary.resources.limits.cpu=500m
--set redis.primary.resources.requests.memory=256Mi
--set redis.primary.resources.limits.memory=512Mi
{% endif %}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we modifying the clickhouse and redis changes?
if not directly related with this change create a separate PR for this.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i said in earlier comment, on single not i was getting resource issues, hence changed for single node.I can revert changes keep as it is and create sperate PR for this

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not change the file permissions

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

<details>
<summary>Optional: Manual override</summary>

If you need to customize the inventory (e.g., use a different user or SSH key), you can still create the file manually:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is good to copy host.yaml and config.yaml from example folder as it helps users to get comfortable on how to use.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i will keep same as before

@Mushtaq-BGA Mushtaq-BGA force-pushed the main branch 2 times, most recently from 435fc81 to a2fdc15 Compare April 1, 2026 06:22
- Replace NGINX Ingress with Envoy Gateway as cluster edge provider
- Convert all ingress.yaml helm templates to HTTPRoute resources
- Add GatewayClass, Gateway, and EnvoyProxy configuration
- Update deploy-ingress-controller.yml for Envoy Gateway deployment
- Add Keycloak, Grafana, GenAI Gateway Trace HTTPRoute integration
- Gate EKS ingress_eks.yaml templates with platform check
- Rename run_ingress_nginx_playbook() to run_edge_gateway_playbook()
- Add envoy-gateway-deployment-guide and migration docs
- Update observability, genai-gateway, and model chart templates
- Increase Langfuse probe thresholds for single-node stability
- Add single-node ClickHouse/Redis resource limits
- Update proxy handling in read-config-file.sh
- Rename Gaudi references to Intel AI Accelerator in docs
- Update OVMS model deploy guide with generic model routing
@vhpintel
Copy link
Copy Markdown
Contributor

vhpintel commented Apr 2, 2026

Tested happy path and Envoy gateway pods are healthy from a runtime perspective . We just need to change the config name for gateway instead of ingress.

Copy link
Copy Markdown
Contributor

@vhpintel vhpintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants