From TCP Sockets to Unix Sockets: A Caddy Performance Case Study
A recent GitHub issue #6751 in the Caddy server repository revealed a counterintuitive performance bottleneck: despite maintaining low CPU usage (1-5%), a multi-layer reverse proxy setup experienced severe throughput degradation. This investigation uncovered a critical lesson—low CPU usage doesn't guarantee performance. The culprit? Network stack overhead hiding beneath the surface. Here's what was discovered and how it was resolved.
The Problem
A user reported significant performance degradation when implementing multiple layers of reverse proxies in Caddy v2.8.4. The setup consisted of a chain of reverse proxies:
- Port 8081: Serving static files
- Port 8082: Proxying to 8081
- Port 8083: Proxying to 8082
- Port 8084: Proxying to 8083
When testing with a 1000MB file download, the performance metrics showed a clear pattern of degradation:
Multi-Threading Performance Impact
- Direct file server (8081): ~300 Mbps with 5 threads
- First proxy layer (8082): ~60 Mbps with 5 threads
- Second proxy layer (8083): ~16 Mbps with 5 threads
- Third proxy layer (8084): ~16 Mbps with 5 threads
What made this particularly interesting was that the server's CPU usage remained surprisingly low (1-5%), suggesting that the bottleneck wasn't in processing power.
When to Use Unix Sockets vs TCP
Before diving into the investigation, it's worth knowing when this optimization applies:
| Criterion | Unix Sockets | TCP Sockets |
|---|---|---|
| Latency-sensitive (local services) | ✅ Use Unix sockets | ❌ Avoid |
| Same machine communication | ✅ Preferred | ⚠️ Only if required |
| Remote services | ❌ Cannot use | ✅ Required |
| Load balancing across machines | ❌ Not suitable | ✅ Required |
| Filesystem permissions needed | ✅ Leverages OS permissions | ❌ Not applicable |
| Container orchestration | ⚠️ Complex (mount volumes) | ✅ Simpler with env vars |
| Development/testing | ✅ Faster local iteration | ⚠️ Adds network latency |
| High-throughput local proxying | ✅ 80%+ overhead reduction | ❌ Network stack overhead |
Rule of thumb: If your reverse proxy backend is on the same machine, Unix sockets will give you a significant performance boost. If it's remote, TCP is your only option.
The Solution
The breakthrough came when testing with Unix sockets instead of TCP connections. By modifying the Caddyfile to use Unix sockets for inter-process communication, the performance issues were completely resolved. Here's what the optimized configuration looked like:
:8081 {
bind 0.0.0.0 unix//dev/shm/8081.sock
file_server browse
root * /opt/www
}
:8082 {
bind 0.0.0.0 unix//dev/shm/8082.sock
reverse_proxy unix//dev/shm/8081.sock
}
:8083 {
bind 0.0.0.0 unix//dev/shm/8083.sock
reverse_proxy unix//dev/shm/8082.sock
}
:8084 {
reverse_proxy unix//dev/shm/8083.sock
}
Key Takeaways
- TCP connection overhead can significantly impact performance in multi-layer reverse proxy setups
- Unix sockets provide a more efficient alternative for local inter-process communication
- Low CPU usage doesn't always mean optimal performance - network stack overhead can be the bottleneck
- When dealing with multiple local reverse proxies, consider using Unix sockets instead of TCP connections
The Investigation
The investigation, led by Caddy maintainers including Matt Holt, involved:
- Gathering system metrics
- Analyzing CPU and memory profiles
- Testing different network configurations
- Examining kernel settings
Table 1: System Metrics
| Commands | Why It Is Relevant to Debugging | Output and Conclusion |
|---|---|---|
ulimit -a | Checks system limits such as maximum number of open files and other resource constraints that could impact performance. | No bottlenecks identified in file descriptors or resource limits. |
sysctl -p | Confirms network-related kernel parameters such as buffer sizes, default queuing discipline, and TCP congestion control settings. | net.core.rmem_max = 2097152net.core.wmem_max = 2097152net.core.default_qdisc = fqnet.ipv4.tcp_congestion_control = bbrSettings were optimized for high-speed networking. TCP congestion control was correctly set to bbr. |
| General hardware specs (CPU, RAM, NIC, etc.) | baseline hardware information | Verified adequate resources (1 Core Ryzen 5950X, 1024MB RAM, 10Gbps NIC). No resource-related constraints. |
Table 2: Profile Analysis
| Commands | Why It Is Relevant to Debugging | Output and Conclusion |
|---|---|---|
| Attempted to collect goroutine profiles | Helps identify bottlenecks or inefficiencies in goroutines that may be causing performance issues. | Could not identify significant bottlenecks in goroutines. |
| Accessed CPU Profile via browser | Provides CPU usage details to determine if high CPU usage is a factor affecting performance. | No high CPU usage detected. CPU load was between 1-5%. |
wget http://127.0.0.1:2019/debug/pprof/profile?seconds=1000 | Downloads detailed CPU profiles for offline analysis. | Profiles downloaded successfully. Further analysis confirmed no CPU bottlenecks or inefficiencies. |
| Collected heap profiles | Helps analyze memory usage and potential leaks in the application. | Memory usage was within acceptable limits, with no indication of memory leaks. |
Table 3: Network Testing
| Commands | Why It Is Relevant to Debugging | Output and Conclusion |
|---|---|---|
| Tests from multiple locations (Singapore, Los Angeles, Seoul) | Evaluates network performance across different regions to identify geographical bottlenecks. | Performance was consistent across all regions. |
| Tests with different file sizes (100MiB, 1000MiB) | Determines if performance issues are related to file size or payload. | No significant performance variance with different file sizes. |
curl -o /dev/null http://host.domain:port/1000MiB | Single-threaded test evaluates download performance under minimal concurrency. | acceptable network speed |
echo 1 1 1 1 1 | xargs -n1 -P5 curl -s -o /dev/null http://host.domain:port/1000MiB | Multi-threaded test assesses network performance under concurrent load. |
Table 4: Kernel Analysis
| Commands | Why It Is Relevant to Debugging | Output and Conclusion |
|---|---|---|
| Checked systemd service file settings | Confirms that the maximum number of open files is sufficient for high-concurrency workloads. | VerifiedLimitNOFILE=1048576. No issues found. |
