Skip to content

fix(encapsulation): route Cilium IPIP traffic through VxLAN overlay#6

Merged
kvaps merged 3 commits intomainfrom
fix/cilium-ipip-overlay
Feb 15, 2026
Merged

fix(encapsulation): route Cilium IPIP traffic through VxLAN overlay#6
kvaps merged 3 commits intomainfrom
fix/cilium-ipip-overlay

Conversation

@kvaps
Copy link
Member

@kvaps kvaps commented Feb 14, 2026

Summary

  • Rewrite Cilium encapsulator to create IPIP tunnels instead of using cilium_host interface directly, preventing routing loops when --compatibility=cilium with --encapsulation=always or --encapsulation=crosssubnet
  • Each node autodiscovers its cilium_host IP and advertises it via kilo.squat.ai/cilium-internal-ip annotation so other nodes can route IPIP outer packets through Cilium's VxLAN overlay
  • Add LocalIP() method to Encapsulator interface for overlay IP autodiscovery

Test plan

  • Deploy with --compatibility=cilium --encapsulation=always and verify IPIP routes use Cilium internal IPs as gateways
  • Verify cross-node pod traffic flows through VxLAN without routing loops
  • Test with --compatibility=cilium --encapsulation=crosssubnet
  • Verify non-Cilium encapsulators (IPIP, Flannel, Noop) are unaffected

Summary by CodeRabbit

Release Notes

  • Refactor

    • Refactored Cilium encapsulation to use IP routing-based tunnel management for improved reliability.
    • Updated gateway routing logic to incorporate internal IP tracking across network nodes.
    • Improved node representation to better support Cilium network integration.
  • Enhancement

    • Enhanced network routing with support for Cilium internal IP annotations.
    • Strengthened node equality checks to account for internal networking parameters.

Rewrite Cilium encapsulator to create IPIP tunnels instead of using
cilium_host interface directly. Each node autodiscovers its cilium_host
IP and advertises it via kilo.squat.ai/cilium-internal-ip annotation,
allowing other nodes to route IPIP outer packets through Cilium's VxLAN
overlay and preventing routing loops.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 14, 2026

📝 Walkthrough

Walkthrough

The PR adds support for tracking and utilizing Cilium internal IPs throughout the mesh networking system. Changes include updating the Encapsulator interface signature to accept a ciliumInternalIP parameter, adding a LocalIP() method, refactoring Cilium encapsulation to use IPIP tunnels, propagating Cilium internal IP through mesh nodes and segments, extending Kubernetes backend to handle cilium internal IP annotations, and updating route gw calculations to include the new internal IP parameter.

Changes

Cohort / File(s) Summary
Encapsulation Interface Update
pkg/encapsulation/encapsulation.go
Extended Encapsulator interface: Gw signature updated from 3 to 4 IP parameters; added new LocalIP() net.IP method.
Cilium IPIP Tunnel Refactor
pkg/encapsulation/cilium.go
Replaced netlink-based interface tracking with IPIP tunnel lifecycle via iproute; Init creates/brings up tunnel, CleanUp deletes addresses and removes interface; Gw returns ciliumIP when available; added LocalIP, Index, Rules, Set, Strategy methods; removed netlink scaffolding.
Encapsulation Implementation Updates
pkg/encapsulation/flannel.go, pkg/encapsulation/ipip.go, pkg/encapsulation/noop.go
Updated Gw method signatures to accept additional net.IP parameter; added no-op LocalIP() methods returning nil.
Mesh Node Data Model Extensions
pkg/mesh/backend.go, pkg/mesh/mesh.go
Added CiliumInternalIP field to Node struct; initialized in handleLocal via m.enc.LocalIP(); updated nodesAreEqual to compare CiliumInternalIP values.
Mesh Topology Cilium IP Tracking
pkg/mesh/topology.go, pkg/mesh/topology_test.go
Introduced ciliumInternalIPs field in segment struct; populated from node.CiliumInternalIP during topology construction; updated test fixtures with nil-initialized ciliumInternalIPs slices.
Kubernetes Backend Annotation Support
pkg/k8s/backend.go
Added ciliumInternalIPAnnotationKey constant; reads/writes CiliumInternalIP to/from node annotations; populates mesh.Node.CiliumInternalIP during K8s Node translation.
Route Gateway Calculation Updates
pkg/mesh/routes.go
Updated all Gw invocations to include ciliumInternalIP parameter from segment.ciliumInternalIPs; changed signature usage from (gatewayIP, privateIP, cidr) to (gatewayIP, privateIP, ciliumInternalIP, cidr).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Cilium's internal routes now flow so bright,
With tunnel IPs tracked through each segment's height,
From K8s annotations to mesh nodes we go,
LocalIPs and Gw calls make pathways show! 🛣️✨

🚥 Pre-merge checks | ✅ 3 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 60.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The pull request title accurately summarizes the main change: routing Cilium IPIP traffic through VxLAN overlay, which is the core objective of refactoring the Cilium encapsulator.
Merge Conflict Detection ✅ Passed ✅ No merge conflicts detected when merging into main

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/cilium-ipip-overlay

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link

Summary of Changes

Hello @kvaps, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the Cilium encapsulation mechanism to resolve routing loop issues when IPIP tunnels are used in conjunction with Cilium's networking. By introducing a dedicated IPIP tunnel for Cilium traffic and leveraging Cilium's VxLAN overlay for routing, it ensures seamless cross-node pod communication. The changes also enable nodes to advertise their internal Cilium IP, allowing for more intelligent gateway selection and robust network topology management.

Highlights

  • Cilium Encapsulation Rework: The Cilium encapsulator has been rewritten to create IPIP tunnels, preventing routing loops when using Cilium with always or crosssubnet encapsulation strategies. This ensures IPIP outer packets are routed through Cilium's VxLAN overlay.
  • Cilium Internal IP Discovery and Advertisement: Each node now automatically discovers its cilium_host interface IP and advertises it via a new Kubernetes annotation (kilo.squat.ai/cilium-internal-ip). This allows other nodes to correctly route IPIP traffic through the Cilium VxLAN overlay.
  • Encapsulator Interface Enhancement: A new LocalIP() method has been added to the Encapsulator interface to facilitate overlay IP autodiscovery. The Gw method signature was also updated across all encapsulators to accept the ciliumIP.
  • IPIP Tunnel Management: The cilium encapsulator now includes logic for creating, initializing, setting addresses, and cleaning up IPIP tunnel interfaces, along with generating necessary iptables rules for IPIP traffic.
  • Node Metadata Update: The mesh.Node structure and Kubernetes backend now support storing and retrieving the CiliumInternalIP for each node, which is then used in topology and route calculations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/encapsulation/cilium.go
    • Refactored the cilium struct, removing netlink subscription and sync.Mutex for interface management.
    • Implemented CleanUp to remove created IPIP devices.
    • Updated Gw method to prioritize ciliumIP for gateway selection.
    • Added LocalIP method to return the cilium_host interface IP.
    • Implemented Init to create and bring up a new IPIP tunnel interface.
    • Added Rules method to generate iptables rules for IPIP traffic acceptance and dropping.
    • Implemented Set method to assign an IP address to the IPIP tunnel interface.
  • pkg/encapsulation/encapsulation.go
    • Added LocalIP() method to the Encapsulator interface.
    • Modified the Gw method signature to include a ciliumIP parameter.
  • pkg/encapsulation/flannel.go
    • Added a no-op LocalIP() method.
    • Modified the Gw method signature to align with the updated Encapsulator interface.
  • pkg/encapsulation/ipip.go
    • Added a no-op LocalIP() method.
    • Modified the Gw method signature to align with the updated Encapsulator interface.
  • pkg/encapsulation/noop.go
    • Added a no-op LocalIP() method.
    • Modified the Gw method signature to align with the updated Encapsulator interface.
  • pkg/k8s/backend.go
    • Defined a new constant ciliumInternalIPAnnotationKey for Kubernetes annotations.
    • Updated the Set method to store the CiliumInternalIP in node annotations.
    • Modified translateNode to parse and retrieve the CiliumInternalIP from node annotations.
  • pkg/mesh/backend.go
    • Added a CiliumInternalIP field of type net.IP to the Node struct.
  • pkg/mesh/mesh.go
    • Updated the handleLocal method to populate the CiliumInternalIP field of the local node using enc.LocalIP().
    • Modified nodesAreEqual to include comparison of CiliumInternalIP for node equality checks.
  • pkg/mesh/routes.go
    • Updated calls to the enc.Gw method in Routes to pass the ciliumInternalIPs for correct gateway determination in route calculations.
  • pkg/mesh/topology.go
    • Added ciliumInternalIPs slice to the segment struct to store Cilium internal IPs of peers.
    • Updated NewTopology to collect and populate the ciliumInternalIPs for each segment.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the Cilium encapsulator to use IPIP tunnels routed through Cilium's VxLAN overlay, which is a significant improvement for compatibility. It introduces a mechanism for nodes to discover and advertise their Cilium internal IP via Kubernetes annotations. The changes are well-distributed across the codebase and appear to correctly implement the intended logic. I have identified one critical issue that could lead to a panic and one medium-severity issue related to error handling that should be addressed.

return a.Key.String() == b.Key.String() &&
ipNetsEqual(a.WireGuardIP, b.WireGuardIP) &&
ipNetsEqual(a.InternalIP, b.InternalIP) &&
a.CiliumInternalIP.Equal(b.CiliumInternalIP) &&

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Calling .Equal() on a nil net.IP slice will cause a panic. The CiliumInternalIP field can be nil, for example, if a node does not have the Cilium internal IP annotation. To prevent a panic, you should use a nil-safe comparison method, such as bytes.Equal.

Suggested change
a.CiliumInternalIP.Equal(b.CiliumInternalIP) &&
bytes.Equal(a.CiliumInternalIP, b.CiliumInternalIP) &&

Comment on lines 40 to 42
if err := iproute.DeleteAddresses(c.iface); err != nil {
return nil
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error handling in this function is inconsistent. If iproute.DeleteAddresses fails, the error is swallowed, and iproute.RemoveInterface is not called. However, if iproute.RemoveInterface fails, the error is returned. Swallowing the error from DeleteAddresses can hide underlying issues during cleanup. It's better to return the error to the caller for logging and diagnostics.

Suggested change
if err := iproute.DeleteAddresses(c.iface); err != nil {
return nil
}
if err := iproute.DeleteAddresses(c.iface); err != nil {
return err
}

kvaps and others added 2 commits February 14, 2026 11:50
Align constant block formatting for gofmt, add ciliumInternalIPs
to expected topology test segments, use bytes.Equal for nil-safe
CiliumInternalIP comparison, and return error from CleanUp.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
staticcheck SA1021 requires net.IP.Equal for IP comparison.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps kvaps marked this pull request as ready for review February 15, 2026 21:24
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/k8s/backend.go (1)

132-146: ⚠️ Potential issue | 🟠 Major

Missing cleanup of ciliumInternalIPAnnotationKey annotation.

The Set method writes ciliumInternalIPAnnotationKey (line 246), but CleanUp does not remove it. This leaves a stale annotation on the Kubernetes node object when the agent shuts down.

Proposed fix
 patch := []byte("[" + strings.Join([]string{
     fmt.Sprintf(jsonRemovePatch, path.Join("/metadata", "annotations", strings.Replace(endpointAnnotationKey, "/", jsonPatchSlash, 1))),
     fmt.Sprintf(jsonRemovePatch, path.Join("/metadata", "annotations", strings.Replace(internalIPAnnotationKey, "/", jsonPatchSlash, 1))),
     fmt.Sprintf(jsonRemovePatch, path.Join("/metadata", "annotations", strings.Replace(keyAnnotationKey, "/", jsonPatchSlash, 1))),
     fmt.Sprintf(jsonRemovePatch, path.Join("/metadata", "annotations", strings.Replace(lastSeenAnnotationKey, "/", jsonPatchSlash, 1))),
     fmt.Sprintf(jsonRemovePatch, path.Join("/metadata", "annotations", strings.Replace(wireGuardIPAnnotationKey, "/", jsonPatchSlash, 1))),
     fmt.Sprintf(jsonRemovePatch, path.Join("/metadata", "annotations", strings.Replace(discoveredEndpointsKey, "/", jsonPatchSlash, 1))),
     fmt.Sprintf(jsonRemovePatch, path.Join("/metadata", "annotations", strings.Replace(granularityKey, "/", jsonPatchSlash, 1))),
+    fmt.Sprintf(jsonRemovePatch, path.Join("/metadata", "annotations", strings.Replace(ciliumInternalIPAnnotationKey, "/", jsonPatchSlash, 1))),
 }, ",") + "]")
pkg/encapsulation/ipip.go (1)

36-41: ⚠️ Potential issue | 🔴 Critical

Bug: CleanUp swallows DeleteAddresses error and skips RemoveInterface on failure.

When DeleteAddresses fails, the method returns nil (line 38) instead of the error. This both hides the failure and skips the RemoveInterface call. Conversely, when DeleteAddresses succeeds, execution falls through to RemoveInterface — which is the correct happy path, but the error branch is wrong.

🐛 Proposed fix
 func (i *ipip) CleanUp() error {
 	if err := iproute.DeleteAddresses(i.iface); err != nil {
-		return nil
+		return err
 	}
 	return iproute.RemoveInterface(i.iface)
 }
🤖 Fix all issues with AI agents
In `@pkg/mesh/routes.go`:
- Line 43: The three slices on segment (privateIPs, cidrs, ciliumInternalIPs)
can get out of sync because privateIPs is only appended when node.InternalIP !=
nil; fix the population so all three slices keep the same length (append a
nil/empty placeholder to privateIPs when InternalIP is absent) or change
iteration to use the minimum length of the three slices to avoid OOB; also guard
the use of segment.leader when indexing privateIPs in the enc.Gw(...) call
(involving updateEndpoint and segment.leader) by ensuring segment.leader <
len(privateIPs) or by selecting the corresponding placeholder element so the
enc.Gw invocation cannot index past privateIPs.
🧹 Nitpick comments (2)
pkg/encapsulation/cilium.go (1)

38-44: Early return in CleanUp skips interface removal if address deletion fails.

If iproute.DeleteAddresses fails, the IPIP interface itself is never removed. Consider cleaning up the interface regardless of address deletion errors.

Proposed fix
 func (c *cilium) CleanUp() error {
-	if err := iproute.DeleteAddresses(c.iface); err != nil {
-		return err
-	}
-	return iproute.RemoveInterface(c.iface)
+	var errs []error
+	if err := iproute.DeleteAddresses(c.iface); err != nil {
+		errs = append(errs, err)
+	}
+	if err := iproute.RemoveInterface(c.iface); err != nil {
+		errs = append(errs, err)
+	}
+	return errors.Join(errs...)
 }
pkg/encapsulation/encapsulation.go (1)

47-58: Consider adding parameter names to the Gw signature for clarity.

The Gw method now takes four parameters of similar types (net.IP, net.IP, net.IP, *net.IPNet), making it hard to understand each parameter's purpose from the interface alone. Named parameters improve self-documentation.

Suggested improvement
-	Gw(net.IP, net.IP, net.IP, *net.IPNet) net.IP
+	Gw(src, private, ciliumIP net.IP, subnet *net.IPNet) net.IP

for _, segment := range t.segments {
if segment.location == t.location {
gw = enc.Gw(t.updateEndpoint(segment.endpoint, segment.key, &segment.persistentKeepalive).IP(), segment.privateIPs[segment.leader], segment.cidrs[segment.leader])
gw = enc.Gw(t.updateEndpoint(segment.endpoint, segment.key, &segment.persistentKeepalive).IP(), segment.privateIPs[segment.leader], segment.ciliumInternalIPs[segment.leader], segment.cidrs[segment.leader])
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

rg -n "ciliumInternalIPs" --type go -C 3

Repository: cozystack/kilo

Length of output: 15502


🏁 Script executed:

sed -n '157,185p' pkg/mesh/topology.go

Repository: cozystack/kilo

Length of output: 1160


🏁 Script executed:

sed -n '40,75p' pkg/mesh/routes.go

Repository: cozystack/kilo

Length of output: 1701


🏁 Script executed:

sed -n '150,175p' pkg/mesh/routes.go

Repository: cozystack/kilo

Length of output: 1264


Fix slice length mismatch between privateIPs and cidrs/ciliumInternalIPs in segment population.

The privateIPs slice is populated conditionally (only when node.InternalIP != nil), but cidrs and ciliumInternalIPs are always populated. This causes privateIPs to have fewer elements than the other slices, leading to index out of bounds panics when routes.go loops over segment.cidrs and accesses segment.privateIPs[i] at indices where privateIPs was never appended (lines 61 and 156). Additionally, segment.leader indexing on line 43 could exceed privateIPs bounds. Ensure all three slices maintain the same length, or adjust the loop bounds to match the shortest slice.

🤖 Prompt for AI Agents
In `@pkg/mesh/routes.go` at line 43, The three slices on segment (privateIPs,
cidrs, ciliumInternalIPs) can get out of sync because privateIPs is only
appended when node.InternalIP != nil; fix the population so all three slices
keep the same length (append a nil/empty placeholder to privateIPs when
InternalIP is absent) or change iteration to use the minimum length of the three
slices to avoid OOB; also guard the use of segment.leader when indexing
privateIPs in the enc.Gw(...) call (involving updateEndpoint and segment.leader)
by ensuring segment.leader < len(privateIPs) or by selecting the corresponding
placeholder element so the enc.Gw invocation cannot index past privateIPs.

@kvaps kvaps merged commit 4c81d44 into main Feb 15, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant