Building a Rate Limiter from Scratch in Go: Token Bucket Algorithm for API Protection

Table of Contents

When Your API Becomes a Punching Bag

I’ve seen it happen more than once: a service goes live, everything looks fine, then traffic spikes — or someone starts hammering the endpoints — and suddenly response times climb, databases struggle, and users start seeing errors. The worst part? It’s entirely preventable.

Rate limiting is one of the first things I wire into any production API now. Not as an afterthought, not a library I bolt on without understanding — but something I know well enough to implement from scratch. The Token Bucket algorithm is my go-to: it’s simple, flexible, and handles real-world traffic patterns far better than fixed window approaches.

If you’re building APIs in Go and haven’t added rate limiting yet, this tutorial walks you through building it from the ground up — about 100 lines of code that you’ll actually understand.

The Problem with Unprotected APIs

Without rate limiting, a single misconfigured client can bring down your service. I’ve watched it happen with a script that was missing a sleep call — not even a real attack, just someone’s automation hitting a webhook in a tight loop. The server started queueing requests, memory climbed, and within a minute legitimate users were timing out.

Rate limiting solves three distinct problems:

DoS protection — a single IP can’t monopolize your server resources
Fair usage — one heavy client can’t degrade service for everyone else
Cost control — downstream services (databases, external APIs) don’t get flooded

Understanding the Token Bucket Algorithm

Picture a bucket that holds tokens. Every incoming request consumes one token. Tokens refill at a fixed rate. If the bucket is empty when a request arrives, that request gets rejected.

Two parameters define everything:

Capacity — the maximum tokens the bucket holds (controls burst size)
Refill rate — tokens added per second (controls sustained throughput)

Token Bucket has a real edge over fixed window counting. A fixed window says “100 requests per minute” — nothing stops a client from sending all 100 in the first second. Token Bucket handles legitimate bursts gracefully while still enforcing a sustained rate limit. A user who sends 10 quick requests after being idle gets through; someone hitting the API 1000 times in a loop gets throttled.

Why Build It Instead of Using a Library?

You could reach for golang.org/x/time/rate — it’s solid and battle-tested. But understanding what’s under the hood matters. When something breaks in prod at 3am, “I used a library” doesn’t help you debug it. Building it yourself takes 30 minutes and you’ll never be confused about the behavior again. I once traced a production rate limiting bug to a subtle clock issue — knowing the implementation meant a 10-minute fix instead of a night spent reading library source.

Building the Rate Limiter in Go

The Core Data Structure

Start with the bucket itself:

package ratelimiter

import (
    "sync"
    "time"
)

type TokenBucket struct {
    capacity   float64
    tokens     float64
    refillRate float64 // tokens per second
    lastRefill time.Time
    lastAccess time.Time
    mu         sync.Mutex
}

func NewTokenBucket(capacity float64, refillRate float64) *TokenBucket {
    now := time.Now()
    return &TokenBucket{
        capacity:   capacity,
        tokens:     capacity, // start full
        refillRate: refillRate,
        lastRefill: now,
        lastAccess: now,
    }
}

I’m using float64 for tokens instead of integers. This makes refill math cleaner — you’re not dealing with rounding errors when tokens accumulate fractionally between requests.

The Allow() Method

Two operations happen on every call: refill tokens based on elapsed time, then attempt to consume one:

func (tb *TokenBucket) Allow() bool {
    tb.mu.Lock()
    defer tb.mu.Unlock()

    now := time.Now()
    elapsed := now.Sub(tb.lastRefill).Seconds()

    // Add tokens based on elapsed time
    tb.tokens += elapsed * tb.refillRate
    if tb.tokens > tb.capacity {
        tb.tokens = tb.capacity
    }
    tb.lastRefill = now
    tb.lastAccess = now

    // Try to consume a token
    if tb.tokens >= 1 {
        tb.tokens--
        return true
    }
    return false
}

The mutex is non-negotiable. Go’s HTTP server handles each request in its own goroutine, and without synchronization you’ll get race conditions corrupting the token count. I hit this on an early version — stress testing showed erratic behavior under concurrent load that was nearly impossible to reproduce in isolation.

Per-Client Rate Limiting

A single global bucket throttles your entire API at once, which isn’t what you want. You need one bucket per client — keyed by IP address or API key:

type RateLimiter struct {
    clients    map[string]*TokenBucket
    mu         sync.RWMutex
    capacity   float64
    refillRate float64
}

func NewRateLimiter(capacity, refillRate float64) *RateLimiter {
    rl := &RateLimiter{
        clients:    make(map[string]*TokenBucket),
        capacity:   capacity,
        refillRate: refillRate,
    }
    go rl.cleanup()
    return rl
}

func (rl *RateLimiter) getBucket(clientID string) *TokenBucket {
    rl.mu.RLock()
    bucket, exists := rl.clients[clientID]
    rl.mu.RUnlock()

    if exists {
        return bucket
    }

    rl.mu.Lock()
    defer rl.mu.Unlock()
    // Double-check after acquiring write lock
    if bucket, exists = rl.clients[clientID]; exists {
        return bucket
    }
    bucket = NewTokenBucket(rl.capacity, rl.refillRate)
    rl.clients[clientID] = bucket
    return bucket
}

func (rl *RateLimiter) Allow(clientID string) bool {
    return rl.getBucket(clientID).Allow()
}

Note the double-check pattern inside getBucket(). Between releasing the read lock and acquiring the write lock, another goroutine could have created the same bucket. Skip it and you’ll see intermittent duplicate allocations under high traffic — it’s the standard Go approach for lazy initialization under concurrent access.

Cleaning Up Stale Buckets

Left unchecked, the client map grows by one entry per unique caller — on a public API facing thousands of unique IPs, that’s unbounded memory growth. A background goroutine handles eviction:

func (rl *RateLimiter) cleanup() {
    ticker := time.NewTicker(5 * time.Minute)
    for range ticker.C {
        rl.mu.Lock()
        for id, bucket := range rl.clients {
            bucket.mu.Lock()
            if time.Since(bucket.lastAccess) > 10*time.Minute {
                delete(rl.clients, id)
            }
            bucket.mu.Unlock()
        }
        rl.mu.Unlock()
    }
}

HTTP Middleware

Drop this into your middleware chain and you’re done:

func RateLimitMiddleware(rl *RateLimiter) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // In production behind a proxy, use X-Forwarded-For
            clientIP := r.RemoteAddr

            if !rl.Allow(clientIP) {
                w.Header().Set("Retry-After", "1")
                w.Header().Set("X-RateLimit-Limit", "10")
                http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
                return
            }
            next.ServeHTTP(w, r)
        })
    }
}

func main() {
    // 10 token burst, 2 tokens/second sustained
    limiter := NewRateLimiter(10, 2)

    mux := http.NewServeMux()
    mux.HandleFunc("/api/data", handleData)

    handler := RateLimitMiddleware(limiter)(mux)
    http.ListenAndServe(":8080", handler)
}

Testing Burst and Throttle Behavior

Don’t skip the tests. Verifying burst behavior up front saves you from tuning blindly in production:

func TestTokenBucket_BurstThenThrottle(t *testing.T) {
    // 5 token capacity, 1 token/second refill
    bucket := NewTokenBucket(5, 1)

    // Burst: 5 consecutive requests should all go through
    for i := 0; i < 5; i++ {
        if !bucket.Allow() {
            t.Fatalf("expected request %d to be allowed", i+1)
        }
    }

    // 6th should be rejected — bucket is empty
    if bucket.Allow() {
        t.Fatal("expected 6th request to be rejected")
    }

    // Wait for 1 token to refill
    time.Sleep(1100 * time.Millisecond)
    if !bucket.Allow() {
        t.Fatal("expected request to be allowed after refill")
    }
}

Tuning and Production Tips

Choosing Parameters That Make Sense

Getting the numbers wrong is the most common mistake. My starting point for public APIs:

Set capacity to 2–3× your expected legitimate burst (users sometimes retry quickly after errors)
Set refill rate to your sustained requests-per-second budget
Authenticated endpoints: more generous — 50 token capacity, 10/sec refill
Unauthenticated or sensitive endpoints: strict — 5 token capacity, 1/sec refill

Distributed Rate Limiting

The in-memory implementation works perfectly for a single server. Scale horizontally and the problem surfaces immediately: each server has its own bucket state, so a client can hit 10 req/sec on each of 5 servers for 50 req/sec total. For distributed setups, the token state moves to Redis using atomic Lua scripts. The algorithm transfers directly — same math, different storage backend. That’s worth a separate deep dive once you’ve outgrown a single node.

Headers Tell Clients What’s Happening

A 429 response without context is frustrating to debug. Always return:

X-RateLimit-Limit — your bucket capacity
X-RateLimit-Remaining — current token count (read before consuming)
Retry-After — seconds until the next token is available

Well-behaved clients will back off automatically when they see these headers. Badly behaved ones at least give you something to point to in the logs.

Where to Go From Here

The implementation above is around 100 lines of Go, handles concurrent access correctly, cleans up stale state automatically, and plugs into standard HTTP middleware. Small enough to read in one sitting — which matters when you need to adapt it, debug it at 3am, or walk the next engineer through every line.

Rate limiting belongs in your first commit, not your postmortem. Start with the defaults here, run load tests against your actual endpoints, and tune from real traffic numbers rather than guesses.