How I Cut Our OpenAI Bill by 50% Using the Batch API

AI tutorial - IT technology blog
AI tutorial - IT technology blog

The $5,000 Monthly Bill That Didn’t Need to Happen

Last quarter, my team hit a wall. We were tasked with categorizing 2.4 million customer feedback entries for a major retail client. Using the standard OpenAI chat.completions endpoint, our projected costs sat at a painful $5,100. Worse, we were hitting rate limits (TPM) every few minutes, causing our ingestion pipeline to stall constantly.

The project wasn’t time-sensitive. We didn’t need responses in 500 milliseconds; we just needed the data processed within a 24-hour window. By moving the workload to the OpenAI Batch API, our bill dropped to exactly $2,550. The rate limit headaches vanished instantly. If you are scaling AI features on a budget, this is the most important architectural shift you can make.

The “Immediacy Premium” is Killing Your Margins

Most developers default to synchronous API calls. You send a request, wait for the response, and move to the next task. This is the right choice for a chatbot. However, it is a massive liability for background tasks like data enrichment, large-scale SEO content generation, or historical analysis.

OpenAI charges a premium for instant responses because they must reserve compute resources for you immediately. Synchronous calls are also bound by strict Tier limits. If you try to push 1,000,000 prompts through a standard pipe, you’ll spend more time debugging 429: Too Many Requests errors than actually generating value. Batching solves this by letting OpenAI run your jobs during their “quiet” hours.

Comparing the Options: Sync vs. Batch

Before refactoring our pipeline, I weighed three different strategies for handling our 2.4 million rows:

  • Standard Synchronous API: Full price ($5.00 per 1M tokens for GPT-4o). High friction due to rate limits. Instant results.
  • Multi-threading/AsyncIO: Better throughput, but still full price. It requires complex retry logic and exponential backoff to handle rate limit bursts.
  • OpenAI Batch API: 50% discount ($2.50 per 1M tokens for GPT-4o). Massive, separate rate limit pools. Results arrive within 24 hours.

Implementation: The Batch Workflow

The Batch API doesn’t use a standard JSON POST body for the request itself. Instead, you upload a .jsonl file where each line is an individual request. OpenAI processes these in the background whenever they have spare capacity.

1. Creating the Request File (.jsonl)

Every request needs a custom_id. This is vital because the output file won’t necessarily be in the same order as your input. You’ll use this ID to map results back to your database.

{"custom_id": "user-rev-101", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "system", "content": "Categorize sentiment."}, {"role": "user", "content": "Great product!"}]}}
{"custom_id": "user-rev-102", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "gpt-4o-mini", "messages": [{"role": "system", "content": "Categorize sentiment."}, {"role": "user", "content": "Arrived broken."}]}}

2. Uploading and Triggering the Job

Once your file is ready, use the Python SDK to send it to OpenAI. Here is the exact boilerplate I use in production:

import openai

client = openai.OpenAI(api_key="your_api_key")

# Upload the JSONL file to OpenAI's servers
batch_file = client.files.create(
  file=open("tasks.jsonl", "rb"),
  purpose="batch"
)

# Start the batch processing job
batch_job = client.batches.create(
  input_file_id=batch_file.id,
  endpoint="/v1/chat/completions",
  completion_window="24h"
)

print(f"Job started: {batch_job.id}")

3. Retrieving the Data

You don’t need to poll the API every second. I usually set up a simple script to check the status every 15 minutes. Once the status is completed, you can download the results file.

# Check if the job is done
job_status = client.batches.retrieve(batch_job.id)

if job_status.status == "completed":
    # Fetch the actual response data
    results = client.files.content(job_status.output_file_id)
    with open("final_results.jsonl", "wb") as f:
        f.write(results.read())
    print("Processing complete. Results saved.")

Hard-Won Lessons from Production

After processing tens of millions of tokens, I’ve noticed patterns that the documentation skips. First, the 24-hour window is a conservative estimate. Most of my batches containing 50,000 requests finish in under 45 minutes. However, never promise your stakeholders a sub-hour turnaround for batch jobs.

Second, validation is your best friend. If a single line in your 500MB JSONL file has a missing comma, the entire batch job might fail before it even starts. I now use a local Pydantic script to validate every line before I hit the upload button. It saves hours of frustration.

Handling Partial Success

Not every request in a batch will succeed. If a specific prompt triggers a content filter or an internal server error, OpenAI will still provide a result for the other 9,999 requests. Your post-processing code must check the status_code inside the output file for every custom_id.

Cost-Benefit Breakdown

Metric Standard API (GPT-4o) Batch API (GPT-4o)
Price per 1M Input Tokens $5.00 $2.50
Price per 1M Output Tokens $15.00 $7.50
Rate Limits Shared/Restrictive Dedicated/High-Volume
Turnaround Time Immediate Up to 24 Hours

Final Thoughts: Hot vs. Cold Architecture

Adopting the Batch API changed how I architect AI systems. I now split tasks into two buckets. “Hot” tasks, like a user asking a question in a chat box, stay on the synchronous API. “Cold” tasks, like generating weekly reports or tagging a database, move to the Batch API. This simple split allowed us to scale our processing power by 10x without increasing our monthly budget. If you are still using synchronous calls for bulk data, you are essentially leaving half your budget on the table.

Share: