Here’s a detailed breakdown of the common root causes behind Kubernetes LoadBalancer problems — organized by layer of the system, from infrastructure to application. Each category maps back to the issues from the previous list.
🏗️ 1. Infrastructure & Cloud Provider Causes
| Cause |
Explanation |
| Cloud API limits |
Each provider (AWS, GCP, Azure) enforces quotas for active load balancers or public IPs. Hitting limits leaves services in Pending state. |
| IAM permission gaps |
The Kubernetes Cloud Controller Manager (CCM) can’t call APIs like CreateLoadBalancer, causing silent or retry-loop failures. |
| Incorrect cloud tags |
Nodes missing kubernetes.io/cluster/<name>=owned tag → CCM fails to attach backends. |
| Regional vs zonal LB mismatch |
Nodes deployed in different zones than the LB region prevent registration. |
| Unmanaged network routes |
Provider VPC routing tables not updated for the service subnet. |
| Cloud controller crashes |
Outdated or mismatched cloud-controller-manager images lead to reconciliation bugs. |
| Security group misconfigurations |
Missing inbound rules for NodePorts, blocking health checks or client connections. |
🌐 2. Network Stack & CNI-Level Causes
| Cause |
Explanation |
| Overlay/underlay mismatch |
VXLAN or Calico encapsulation conflicts with external routes; packets drop at the host boundary. |
| MTU inconsistencies |
Large packets fragmented or dropped across overlay networks. |
| ARP/NDP suppression |
MetalLB can’t announce VIPs because kernel ARP settings block them. |
| conntrack saturation |
Stateful connections overflow Linux conntrack tables → random drops and “stuck” TCP sessions. |
| Reverse path filtering |
Kernel rp_filter=1 drops return traffic that appears asymmetric. |
| BGP session flapping |
In MetalLB, unstable BGP peer state due to router misconfig or missing keepalives. |
| No masquerade for pod egress |
External LBs can’t reach pod IPs directly when masquerading disabled. |
| Unreachable node IPs |
Nodes behind NAT or private subnets cannot receive LB health checks. |
⚙️ 3. Service & kube-proxy Causes
| Cause |
Explanation |
| kube-proxy mode mismatch |
IPVS, iptables, or userspace modes behave differently; inconsistent conntrack rules. |
| Stale iptables rules |
kube-proxy crashed or desynced from API state → packets misrouted. |
| ExternalTrafficPolicy misused |
Local skips SNAT but drops traffic if no pod on that node → partial reachability. |
| NodePort range firewall blocked |
External firewalls not opened for 30000–32767. |
| Service CIDR overlap |
Overlapping with cluster or VPC CIDR → routing ambiguity. |
| Multiple services sharing same selector |
LBs front unintended pods due to selector collision. |
🧱 4. Configuration & Annotation Causes
| Cause |
Explanation |
| Missing annotations |
Cloud-specific options (service.beta.kubernetes.io/aws-load-balancer-type) required for correct LB provisioning. |
| Wrong health check port/path |
LB fails health probes, marking backends unhealthy. |
| Invalid SSL cert references |
Ingress or LB configured with missing Secret → SSL handshake errors. |
| Session affinity misused |
Sticky sessions overloading a single node. |
Typo in loadBalancerClass |
Cluster ignores Service spec, falls back to default controller. |
| Default backend missing |
Ingress with unmatched host/path sends traffic nowhere. |
Improper loadBalancerSourceRanges |
Restrictive IP filters block clients unexpectedly. |
| Incorrect subnet pool |
Assigned static IP not available in the network. |
🧩 5. Ingress & Application Layer Causes
| Cause |
Explanation |
| IngressClass not defined |
Controller doesn’t reconcile objects without class label. |
| Path rewrite misrules |
Traefik/NGINX rewriting causes wrong upstream URLs. |
| TLS secret rotation |
cert-manager updates secret but ingress controller not reloaded. |
| Redirect loops (HTTP↔HTTPS) |
Both app and ingress enforcing redirection. |
| Hostname mismatch |
Client requests www.app.com, but cert CN=app.com → TLS failure. |
| Backend readiness issues |
Liveness/readiness probes fail before pod is healthy. |
| Timeout configuration mismatch |
App idle timeout < LB keep-alive timeout → premature disconnects. |
| Ingress sync delay |
Controller’s sync interval too long; stale config in NGINX pods. |
🧠 6. Operational & Maintenance Causes
| Cause |
Explanation |
| Version drift |
Kubelet, kube-proxy, and CCM running mismatched versions. |
| Stale endpoints |
Pods deleted, but endpoint objects not updated → blackhole traffic. |
| Resource pressure |
Nodes OOM-killing kube-proxy or metallb-speaker pods. |
| Misleading “Pending” state |
CCM waiting on async API callback but users interpret it as crash. |
| CrashLoopBackOff in controller pods |
Missing RBAC roles to patch Service status. |
| Improper node taints |
LB avoids nodes labeled “NoSchedule” or “NoExecute”. |
| DNS propagation delays |
External IP allocated but DNS record still stale. |
| Static IP reuse conflicts |
Multiple LBs trying to claim same external IP. |
| Metrics ignored |
No Prometheus alerting on kube_service_created_seconds, leading to unnoticed slow creations. |
| Load test misconfiguration |
Synthetic clients reuse connections, hiding LB routing bugs. |
🧰 7. Root Software Causes (Source-Level)
These issues stem from how Kubernetes components internally reconcile and manage services:
| Component |
Source Code Responsibility |
pkg/controller/service/service_controller.go |
Cloud load balancer reconciliation logic; errors here often show as “pending” services. |
pkg/proxy/ipvs/proxier.go & iptables/proxier.go |
Handles kube-proxy rule sync; bugs or latency here break traffic paths. |
pkg/cloudprovider/providers/* |
Provider-specific code paths (AWS, Azure, GCP) managing LB creation/deletion. |
pkg/apis/core/v1/types.go |
Defines ServiceTypeLoadBalancer behavior; certain defaults cause unexpected changes when omitted. |
pkg/controller/endpointslice |
Endpoints not updating → traffic to terminated pods. |
MetalLB (internal/speaker/arp.go, internal/bgp_controller.go) |
ARP/NDP or BGP advertisement bugs cause “flapping IPs.” |
⚡ 8. Human & Process Causes
| Cause |
Explanation |
| No understanding of cloud controller flow |
Teams skip studying how CCM provisions cloud LBs → blame kube-proxy. |
| No resource cleanup |
Deleted namespaces leave orphan LBs consuming IPs. |
| Unclear team boundaries |
NetOps vs DevOps → firewall rules never updated. |
| Manual edits to cloud infra |
External modifications to LB rules fight with CCM reconciliation. |
| Ignoring warning events |
kubectl describe svc emits useful clues often overlooked. |
| No observability |
Missing metrics for LB provision latency or failure rates. |
| Skipping provider documentation |
AWS ALB vs NLB vs ELB differences misapplied. |
| Testing only with localhost clients |
Never testing from external internet path. |
| Improper CI/CD teardown |
Automated tests leave residual load balancers. |
| No version pinning |
Helm or CCM upgrades break annotation compatibility. |
🔍 Summary of Common Root Themes
- Network misalignment between cluster overlay and underlay (CNI, routes, NAT).
- Cloud controller integration issues (permissions, API limits, tags).
- Configuration drift in service annotations or ingress manifests.
- Health check failures due to port/path mismatch or readiness delay.
- Operational neglect — ignoring events, metrics, and cleanup.
- Lack of end-to-end visibility — no tracing from LB → node → pod.