When Your API Becomes a Punching Bag
I’ve seen it happen more than once: a service goes live, everything looks fine, then traffic spikes — or someone starts hammering the endpoints — and suddenly response times climb, databases struggle, and users start seeing errors. The worst part? It’s entirely preventable.
Rate limiting is one of the first things I wire into any production API now. Not as an afterthought, not a library I bolt on without understanding — but something I know well enough to implement from scratch. The Token Bucket algorithm is my go-to: it’s simple, flexible, and handles real-world traffic patterns far better than fixed window approaches.
If you’re building APIs in Go and haven’t added rate limiting yet, this tutorial walks you through building it from the ground up — about 100 lines of code that you’ll actually understand.
The Problem with Unprotected APIs
Without rate limiting, a single misconfigured client can bring down your service. I’ve watched it happen with a script that was missing a sleep call — not even a real attack, just someone’s automation hitting a webhook in a tight loop. The server started queueing requests, memory climbed, and within a minute legitimate users were timing out.
Rate limiting solves three distinct problems:
- DoS protection — a single IP can’t monopolize your server resources
- Fair usage — one heavy client can’t degrade service for everyone else
- Cost control — downstream services (databases, external APIs) don’t get flooded
Understanding the Token Bucket Algorithm
Picture a bucket that holds tokens. Every incoming request consumes one token. Tokens refill at a fixed rate. If the bucket is empty when a request arrives, that request gets rejected.
Two parameters define everything:
- Capacity — the maximum tokens the bucket holds (controls burst size)
- Refill rate — tokens added per second (controls sustained throughput)
Token Bucket has a real edge over fixed window counting. A fixed window says “100 requests per minute” — nothing stops a client from sending all 100 in the first second. Token Bucket handles legitimate bursts gracefully while still enforcing a sustained rate limit. A user who sends 10 quick requests after being idle gets through; someone hitting the API 1000 times in a loop gets throttled.
Why Build It Instead of Using a Library?
You could reach for golang.org/x/time/rate — it’s solid and battle-tested. But understanding what’s under the hood matters. When something breaks in prod at 3am, “I used a library” doesn’t help you debug it. Building it yourself takes 30 minutes and you’ll never be confused about the behavior again. I once traced a production rate limiting bug to a subtle clock issue — knowing the implementation meant a 10-minute fix instead of a night spent reading library source.
Building the Rate Limiter in Go
The Core Data Structure
Start with the bucket itself:
package ratelimiter
import (
"sync"
"time"
)
type TokenBucket struct {
capacity float64
tokens float64
refillRate float64 // tokens per second
lastRefill time.Time
lastAccess time.Time
mu sync.Mutex
}
func NewTokenBucket(capacity float64, refillRate float64) *TokenBucket {
now := time.Now()
return &TokenBucket{
capacity: capacity,
tokens: capacity, // start full
refillRate: refillRate,
lastRefill: now,
lastAccess: now,
}
}
I’m using float64 for tokens instead of integers. This makes refill math cleaner — you’re not dealing with rounding errors when tokens accumulate fractionally between requests.
The Allow() Method
Two operations happen on every call: refill tokens based on elapsed time, then attempt to consume one:
func (tb *TokenBucket) Allow() bool {
tb.mu.Lock()
defer tb.mu.Unlock()
now := time.Now()
elapsed := now.Sub(tb.lastRefill).Seconds()
// Add tokens based on elapsed time
tb.tokens += elapsed * tb.refillRate
if tb.tokens > tb.capacity {
tb.tokens = tb.capacity
}
tb.lastRefill = now
tb.lastAccess = now
// Try to consume a token
if tb.tokens >= 1 {
tb.tokens--
return true
}
return false
}
The mutex is non-negotiable. Go’s HTTP server handles each request in its own goroutine, and without synchronization you’ll get race conditions corrupting the token count. I hit this on an early version — stress testing showed erratic behavior under concurrent load that was nearly impossible to reproduce in isolation.
Per-Client Rate Limiting
A single global bucket throttles your entire API at once, which isn’t what you want. You need one bucket per client — keyed by IP address or API key:
type RateLimiter struct {
clients map[string]*TokenBucket
mu sync.RWMutex
capacity float64
refillRate float64
}
func NewRateLimiter(capacity, refillRate float64) *RateLimiter {
rl := &RateLimiter{
clients: make(map[string]*TokenBucket),
capacity: capacity,
refillRate: refillRate,
}
go rl.cleanup()
return rl
}
func (rl *RateLimiter) getBucket(clientID string) *TokenBucket {
rl.mu.RLock()
bucket, exists := rl.clients[clientID]
rl.mu.RUnlock()
if exists {
return bucket
}
rl.mu.Lock()
defer rl.mu.Unlock()
// Double-check after acquiring write lock
if bucket, exists = rl.clients[clientID]; exists {
return bucket
}
bucket = NewTokenBucket(rl.capacity, rl.refillRate)
rl.clients[clientID] = bucket
return bucket
}
func (rl *RateLimiter) Allow(clientID string) bool {
return rl.getBucket(clientID).Allow()
}
Note the double-check pattern inside getBucket(). Between releasing the read lock and acquiring the write lock, another goroutine could have created the same bucket. Skip it and you’ll see intermittent duplicate allocations under high traffic — it’s the standard Go approach for lazy initialization under concurrent access.
Cleaning Up Stale Buckets
Left unchecked, the client map grows by one entry per unique caller — on a public API facing thousands of unique IPs, that’s unbounded memory growth. A background goroutine handles eviction:
func (rl *RateLimiter) cleanup() {
ticker := time.NewTicker(5 * time.Minute)
for range ticker.C {
rl.mu.Lock()
for id, bucket := range rl.clients {
bucket.mu.Lock()
if time.Since(bucket.lastAccess) > 10*time.Minute {
delete(rl.clients, id)
}
bucket.mu.Unlock()
}
rl.mu.Unlock()
}
}
HTTP Middleware
Drop this into your middleware chain and you’re done:
func RateLimitMiddleware(rl *RateLimiter) func(http.Handler) http.Handler {
return func(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// In production behind a proxy, use X-Forwarded-For
clientIP := r.RemoteAddr
if !rl.Allow(clientIP) {
w.Header().Set("Retry-After", "1")
w.Header().Set("X-RateLimit-Limit", "10")
http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
return
}
next.ServeHTTP(w, r)
})
}
}
func main() {
// 10 token burst, 2 tokens/second sustained
limiter := NewRateLimiter(10, 2)
mux := http.NewServeMux()
mux.HandleFunc("/api/data", handleData)
handler := RateLimitMiddleware(limiter)(mux)
http.ListenAndServe(":8080", handler)
}
Testing Burst and Throttle Behavior
Don’t skip the tests. Verifying burst behavior up front saves you from tuning blindly in production:
func TestTokenBucket_BurstThenThrottle(t *testing.T) {
// 5 token capacity, 1 token/second refill
bucket := NewTokenBucket(5, 1)
// Burst: 5 consecutive requests should all go through
for i := 0; i < 5; i++ {
if !bucket.Allow() {
t.Fatalf("expected request %d to be allowed", i+1)
}
}
// 6th should be rejected — bucket is empty
if bucket.Allow() {
t.Fatal("expected 6th request to be rejected")
}
// Wait for 1 token to refill
time.Sleep(1100 * time.Millisecond)
if !bucket.Allow() {
t.Fatal("expected request to be allowed after refill")
}
}
Tuning and Production Tips
Choosing Parameters That Make Sense
Getting the numbers wrong is the most common mistake. My starting point for public APIs:
- Set capacity to 2–3× your expected legitimate burst (users sometimes retry quickly after errors)
- Set refill rate to your sustained requests-per-second budget
- Authenticated endpoints: more generous — 50 token capacity, 10/sec refill
- Unauthenticated or sensitive endpoints: strict — 5 token capacity, 1/sec refill
Distributed Rate Limiting
The in-memory implementation works perfectly for a single server. Scale horizontally and the problem surfaces immediately: each server has its own bucket state, so a client can hit 10 req/sec on each of 5 servers for 50 req/sec total. For distributed setups, the token state moves to Redis using atomic Lua scripts. The algorithm transfers directly — same math, different storage backend. That’s worth a separate deep dive once you’ve outgrown a single node.
Headers Tell Clients What’s Happening
A 429 response without context is frustrating to debug. Always return:
X-RateLimit-Limit— your bucket capacityX-RateLimit-Remaining— current token count (read before consuming)Retry-After— seconds until the next token is available
Well-behaved clients will back off automatically when they see these headers. Badly behaved ones at least give you something to point to in the logs.
Where to Go From Here
The implementation above is around 100 lines of Go, handles concurrent access correctly, cleans up stale state automatically, and plugs into standard HTTP middleware. Small enough to read in one sitting — which matters when you need to adapt it, debug it at 3am, or walk the next engineer through every line.
Rate limiting belongs in your first commit, not your postmortem. Start with the defaults here, run load tests against your actual endpoints, and tune from real traffic numbers rather than guesses.

