Contents

TCP Flow Collision Over Multiple LoadBalancers

TCP Flow Collision Over Multiple LoadBalancers

What happens

In typical high-traffic secenarios, server clusters ( e.g, game servers or chat platforms) must handle billions of concurrent TCP connections. To achieve this, they require deployment of distributed infrastructure componets including Kubernetes pods and virtual machines. For traffic aggregatin, a gateway layer - typically implemented as multiple load balancers - is essential to establish an effecient framework.

T’o understand load balancers(LBs), refer to resources like Wikipedia. Two common LB implementations are reverse proxy and DNAT(Destionation Network Address Translation). The TCP flow collision problem occurs specifically in DNAT-based load balancers.

Consider this simulation of two flow:

In this scenario:

  1. Flow1 and Flow2 originate from the same client (IP1:PORT1)
  2. Both target different gateway endpoints (GWIP1:GWPORT1 vs GWIP2:GWPORT2)
  3. The client’s TCP stack reuses the source port since destinations differ
  4. With high connection volumes, port reuse probability increases significantly
  5. Both gateways perform DNAT, rewriting destination to VMIP1:VMPORT1
  6. When hashed to the same backend VM, the flows share identical 5-tuples: (SrcIP=IP1, SrcPort=PORT1, DstIP=VMIP1, DstPort=VMPORT1, Protocol=TCP)
  7. This 5-tuple collision causes flow misidentification at the server.

When Collisions Occur

TCP flow collisions result from specific coincidences in 4-tuple matching. Their impact varies by deployment scenario:

Client Type Collision Risk Recommendation
Direct user clients (Low connection volume per IP) Negligible (< 0.001%) Acceptable risk Implement standard TCP retry mechanisms
Gateway-concentrated traffic (CDN/anti-DDoS egress points) High (> 1% possible) Requires mitigation Especially when few gateway IPs funnel massive traffic

TCP Flow Collision Probability Formula

**The results arefrom DeepSeek-R1. To simplify the results, we consider that all flows come from the same client. Multiple flows are evenly distributed by multiple clients, resulting in the same outcome. **

The result has not been proven by probability theory. To verify the accuracy of the formula, I simulate the situation by a simple golang program. The results are close and for reference only.

Where:

  • N = Total concurrent TCP flows
  • R = Number of real servers
  • S = Source port range size (typically 64,512)
  • L = Number of load balancers (each with unique VIP)
  • C = Number of distinct client IPs

Key Assumptions:

  1. Each LB has a unique Virtual IP (VIP)
  2. Flows are evenly distributed across LBs ($\frac{N}{L}$ flows per LB)
  3. Real servers see original client IP/port after DNAT
  4. Collisions occur when two flows to the same real server have identical:
    (client IP, client port, server IP, server port)

Formula Derivation:

  1. Per-LB flow distribution:
    Probability a flow goes to LB $i$: $P(LB_i) = \frac{1}{L}$

  2. Collision condition:
    Two flows collide when:

    • Same client IP/port (probability $\frac{1}{C \cdot S}$)
    • Different LBs (probability $\frac{L-1}{L}$)
    • Same real server (probability $\frac{1}{R}$)
  3. Simplified conservative model:

$$ P_{\text{collision}} \approx \left(1 - \left(1 - \frac{1}{R \cdot S}\right)^{N \cdot \frac{L-1}{L}}\right) \times 100% $$

Simulation Result

1
2
3
4
5
6
7
N(10000) R(10) S(60000) L(1) C(1) Simulation collision: 0.000%, Formula collision: 0.000%
N(10000) R(10) S(60000) L(2) C(1) Simulation collision: 0.820%, Formula collision: 0.830%
N(100000) R(10) S(60000) L(2) C(1) Simulation collision: 8.358%, Formula collision: 7.996%
N(100000) R(20) S(60000) L(2) C(1) Simulation collision: 4.198%, Formula collision: 4.081%
N(100000) R(30) S(60000) L(2) C(1) Simulation collision: 2.762%, Formula collision: 2.740%
N(100000) R(30) S(60000) L(3) C(1) Simulation collision: 3.700%, Formula collision: 3.636%
N(200000) R(30) S(60000) L(3) C(2) Simulation collision: 3.623%, Formula collision: 3.636%

Mitigation Strategies

Isolate Upstream Gateways

  • Mechanism: Assign dedicated real servers per gateway
  • Effect: Eliminates cross-gateway 4-tuple collisions
  • Deployment: Requires gateway-aware load balancing

Increase Client IP Diversity

From the formula last section, the percentage is relatively by total concurrent TCP flows.

  • Increase the client IPs, use a larger IP pool
  • Enable IPv6 to expand client IP space

Optimize Load Balancers Count

Increasing the number of load balancers may raise the collision rate. Conversely, fewer load balancers reduce collision occurrences.

Switch Load Balancers to Fullnat mode

Migrating load balancers to FullNat mode will eliminate flow collisions fundamentally, though it demands higher performance and resource capabilities from LB services.