johanneskueber.com

Bare-Metal LoadBalancer Services on Talos with Cilium L2 Announcements

This article documents how to replace MetalLB with Cilium’s built-in L2 announcement feature, including the IPAM pool, the announcement policy, the supporting Cilium values, and what to verify on the wire.


1. MetalLB’s role

On a bare-metal cluster, a Service of type: LoadBalancer is meaningless until something assigns it an external IP and answers ARP for that IP on the local segment. The classic answer is MetalLB in L2 mode: a controller allocates an IP from a pool; a speaker DaemonSet replies to ARP requests, advertising via gratuitous ARP after leader election.

MetalLB works, but it is a second control plane to maintain — separate CRDs, Helm chart, and RBAC. When the cluster’s CNI is Cilium, every primitive MetalLB provides is already inside Cilium.


2. Cilium components

Three Cilium subsystems combine to replace MetalLB:

  1. LoadBalancer IPAM. A CiliumLoadBalancerIPPool declares one or more CIDRs from which Cilium allocates IPs to LoadBalancer Services.
  2. L2 announcements. A CiliumL2AnnouncementPolicy selects which nodes answer ARP/NDP for which IPs on which interfaces, with leader election among the selected nodes.
  3. kube-proxy replacement. Cilium’s eBPF datapath services Service IPs without kube-proxy. Required for L2 announcements to behave consistently — otherwise kube-proxy and Cilium race for ownership of the forwarding decision.

The first two are CRDs applied as ordinary manifests. The third is a Helm value on the Cilium install.


3. Cilium Helm values

The Cilium values for this setup:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# cilium-values.yaml
ipv4NativeRoutingCIDR: "10.100.0.0/16"
autoDirectNodeRoutes: true
routingMode: native

k8sServiceHost: "localhost"
k8sServicePort: "7445"

kubeProxyReplacement: true

encryption:
  enabled: true
  type: wireguard

ipam:
  operator:
    clusterPoolIPv4PodCIDRList: "10.100.0.0/16"

bpf:
  masquerade: false
  datapathMode: veth

bandwidthManager:
  enabled: true
  bbr: true

l2announcements:
  enabled: true

envoy:
  enabled: false

hubble:
  enabled: true
  relay: { enabled: true }
  ui:    { enabled: true, replicas: 1 }

operator:
  replicas: 2

# Talos integration
cgroup:
  autoMount: { enabled: false }
  hostRoot: "/sys/fs/cgroup"

Field reference for the keys relevant here:

  • kubeProxyReplacement: true — disables kube-proxy and lets Cilium service all ClusterIP / NodePort / LoadBalancer traffic in eBPF. Required for L2 announcements. On Talos this also requires removing kube-proxy from the machine config; see section 7.
  • l2announcements.enabled: true — turns on the controller that watches CiliumL2AnnouncementPolicy resources and the responder that replies to ARP. Off by default because it relies on leader election, which adds API-server load.
  • routingMode: native + autoDirectNodeRoutes: true — pods are routable on the underlay between nodes without overlay encapsulation. L2 announcements work in either mode but native routing keeps the data path shorter.
  • encryption.type: wireguard — pod-to-pod traffic is encrypted between nodes. Independent of L2 announcements but worth noting because it pairs naturally with the same eBPF datapath.
  • k8sServiceHost: localhost + k8sServicePort: 7445 — Talos’s kube-apiserver-loadbalancer config. Cilium reaches the API server via the local Talos LB, which itself does not depend on Cilium — this avoids a circular dependency at bootstrap.

4. The IP pool

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
apiVersion: cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
  name: web-pool
spec:
  blocks:
    - cidr: 192.168.105.128/27
  serviceSelector:
    matchExpressions:
      - { key: io.cilium/lb-ipam-ips, operator: DoesNotExist }

Field reference:

  • blocks[].cidr — a single CIDR or a start/stop pair. Multiple blocks are allowed. The pool must be on the same L2 segment as the nodes that will announce it; ARP cannot traverse a router.
  • serviceSelector — restricts which Services pull from this pool. The expression above means “any Service that does not request a specific IP via the io.cilium/lb-ipam-ips annotation”. Omitting serviceSelector makes the pool match every Service.
  • For multiple pools (public vs internal), use label selectors on the Services and matching serviceSelector matchLabels.

Services consume from the pool either implicitly (type: LoadBalancer + no selector mismatch) or by requesting a specific IP via loadBalancerIP (deprecated) or the io.cilium/lb-ipam-ips: "192.168.105.130" annotation.


5. The announcement policy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
apiVersion: cilium.io/v2alpha1
kind: CiliumL2AnnouncementPolicy
metadata:
  name: l2-announcement-policy
spec:
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist
  interfaces:
    - ens20
  externalIPs: false
  loadBalancerIPs: true

Field reference:

  • nodeSelector — restricts which nodes participate in announcing. L2 announcements should only come from the worker nodes that run the Envoy gateway DaemonSet handling the traffic; a node that answers ARP but has no local backend will draw the traffic and then drop it. The selector above (node-role.kubernetes.io/control-plane DoesNotExist) ensures control-plane nodes are ignored during announcements, leaving only the workers to respond. On a homelab with combined control-plane/worker nodes, drop the selector.
  • interfaces — list of interface names on which to answer ARP. On a multi-NIC node, the wrong interface answers ARP into the wrong segment. ens20 here is the Talos VM’s data network on VLAN 105.
  • externalIPs: false / loadBalancerIPs: true — declares what kinds of Service IPs to announce. externalIPs is the (less commonly used) Service spec.externalIPs field; loadBalancerIPs is status.loadBalancer.ingress[].ip. Most setups want only the latter.

Leader election among the selected nodes happens via Kubernetes Leases — exactly one node answers ARP for any given IP at a time. On leader change, Cilium emits a gratuitous ARP to push the new MAC into adjacent ARP caches; clients usually fail over in under a second.


6. Verification

Pool allocations:

1
kubectl get ciliumloadbalancerippool web-pool -o yaml | yq .status

status.conditions[] should show cilium.io/PoolConflict: False and cilium.io/NoUnassignedIPs reflecting the remaining capacity.

Service got an IP:

1
kubectl get svc -A -o wide | awk '$5 ~ /^192\.168\.105\./'

ARP works from a client on the segment:

1
2
3
4
ip neigh flush all
ping -c1 192.168.105.130
ip neigh show 192.168.105.130
# 192.168.105.130 dev wlp4s0 lladdr aa:bb:cc:dd:ee:ff REACHABLE

The MAC address shown is the announcing node’s data-NIC MAC. Repeat after kubectl delete pod -n kube-system -l app.kubernetes.io/name=cilium-agent on the current leader to confirm failover: the MAC changes within a few seconds and traffic continues.

Hubble flow inspection:

1
hubble observe --to-ip 192.168.105.130 --last 50

Confirms eBPF-level visibility of the inbound traffic; useful when troubleshooting whether the issue is ARP, eBPF service load-balancing, or backend pod readiness.


7. Talos configuration

On Talos, kube-proxy is part of the cluster manifest, not a separately deployable workload. Disabling it requires a machine-config patch:

1
2
3
4
5
6
cluster:
  proxy:
    disabled: true
  network:
    cni:
      name: none

Apply via talosctl apply-config and reboot. Cilium then installs as the only CNI and provides kube-proxy replacement. Confirm with:

1
2
kubectl get pods -A | grep -E 'kube-proxy|cilium'
# only cilium pods should appear

l2announcements is incompatible with kube-proxy running alongside Cilium — both would try to install the same IP-to-MAC entries.