How to handle rate limits when using multiple agents in parallel? #4078

shinymanasseh · 2025-12-12T19:05:12Z

shinymanasseh
Dec 12, 2025

looking for answer

Answered by nihal-5

Dec 12, 2025

Had this exact problem when I built my multi-agent system. Here's what actually worked for me: The easiest fix is to use different API keys for different agents if you have them. But if you're on one key like I was, the solution is to use max_rpm in your agent config. You can set it like this: max_rpm=20 or max_rpm=30 in your Agent definition (just add it as a parameter alongside role and goal). This tells CrewAI to limit requests per minute and it'll automatically pace them so you don't hit OpenAI's limits. If you still hit rate limits even with max_rpm set, try switching from parallel to sequential processing - yeah it's slower but way more reliable. I only use parallel for truly indepe…

View full answer

nihal-5 · 2025-12-12T19:31:10Z

nihal-5
Dec 12, 2025

Had this exact problem when I built my multi-agent system. Here's what actually worked for me: The easiest fix is to use different API keys for different agents if you have them. But if you're on one key like I was, the solution is to use max_rpm in your agent config. You can set it like this: max_rpm=20 or max_rpm=30 in your Agent definition (just add it as a parameter alongside role and goal). This tells CrewAI to limit requests per minute and it'll automatically pace them so you don't hit OpenAI's limits. If you still hit rate limits even with max_rpm set, try switching from parallel to sequential processing - yeah it's slower but way more reliable. I only use parallel for truly independent tasks now. One more thing that helped me: if you're using GPT-4 for all agents, switch some of them to GPT-4o-mini. It's cheaper and has higher rate limits, so save GPT-4 for just your critical thinking agent. This combination of max_rpm limiting plus sequential processing plus mixing models basically solved all my rate limit issues. Hope this helps!

0 replies

xXMrNidaXx · 2026-02-23T13:13:04Z

xXMrNidaXx
Feb 23, 2026

Rate limiting with parallel agents is tricky. Here are the patterns that work:

1. Semaphore / Token Bucket

import asyncio
from asyncio import Semaphore

rate_limiter = Semaphore(5)  # Max 5 concurrent calls

async def rate_limited_call(agent, task):
    async with rate_limiter:
        return await agent.execute(task)

2. Per-provider limits
Different providers have different limits:

OpenAI: TPM (tokens per minute) + RPM
Anthropic: Requests per minute per model
Mix providers to increase total throughput

3. Exponential backoff with jitter

import random
import time

def backoff_retry(func, max_retries=5):
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
    raise Exception("Max retries exceeded")

4. Request queuing
Use a priority queue where urgent tasks jump ahead, background tasks wait.

5. Caching
Cache identical queries - if two agents ask the same question, dedupe.

We run multi-agent systems at Revolution AI with mixed provider strategies - some agents on GPT-4o, others on Claude, spreads the rate limit budget. Works well for production workloads.

What provider are you hitting limits with?

0 replies

xXMrNidaXx · 2026-02-23T13:20:33Z

xXMrNidaXx
Feb 23, 2026

Rate limiting with parallel agents is a real challenge! At RevolutionAI (https://revolutionai.io) here is what works:

Token bucket approach:

from asyncio import Semaphore

class RateLimiter:
    def __init__(self, rpm=60):
        self.semaphore = Semaphore(rpm)
        
    async def acquire(self):
        await self.semaphore.acquire()
        asyncio.create_task(self._release_after(60))

Per-provider strategies:

OpenAI: Use organization-level rate limits, spread across API keys
Anthropic: Respect tier limits, implement exponential backoff
Local models: No limits but queue for GPU memory

Agent-level controls:

Priority queues (critical agents get priority)
Request coalescing (batch similar queries)
Caching layer (avoid duplicate calls)

Monitoring:

Track 429s per provider
Alert on approaching limits
Auto-scale down agent concurrency

The key is making rate limiting transparent to agents — they should not need to know about it!

0 replies

xXMrNidaXx · 2026-02-23T14:41:24Z

xXMrNidaXx
Feb 23, 2026

Rate limiting with parallel agents needs coordination. Here are patterns that work:

1. Shared rate limiter

from asyncio import Semaphore
from crewai import Agent, Crew

# Global semaphore for API calls
api_semaphore = Semaphore(5)  # Max 5 concurrent calls

class RateLimitedLLM:
    async def call(self, prompt):
        async with api_semaphore:
            return await self.llm.call(prompt)

2. Token bucket per model

from aiolimiter import AsyncLimiter

# OpenAI: 10K TPM, ~3 requests/sec
openai_limiter = AsyncLimiter(3, 1)  # 3 per second

async def rate_limited_call(llm, prompt):
    await openai_limiter.acquire()
    return await llm.call(prompt)

3. Sequential for rate-sensitive tasks

crew = Crew(
    agents=[agent1, agent2, agent3],
    tasks=[task1, task2, task3],
    process=Process.sequential,  # Not parallel
)

4. LiteLLM with built-in rate limiting

from litellm import completion

response = completion(
    model="gpt-4o",
    messages=[...],
    max_retries=3,
    timeout=60,
    # LiteLLM handles 429s automatically
)

5. Stagger agent starts

import asyncio

async def run_agents_staggered(agents, delay=2):
    for agent in agents:
        asyncio.create_task(agent.run())
        await asyncio.sleep(delay)  # Stagger starts

We run multi-agent systems at Revolution AI — token bucket + LiteLLM retry is the most reliable combo.

0 replies

xXMrNidaXx · 2026-02-23T14:45:13Z

xXMrNidaXx
Feb 23, 2026

Rate limiting with parallel agents is critical! At RevolutionAI (https://revolutionai.io) we handle this:

Solutions:

Token bucket:

from asyncio import Semaphore

rate_limiter = Semaphore(5)  # 5 concurrent calls

async def rate_limited_call(agent, task):
    async with rate_limiter:
        return await agent.execute(task)

LiteLLM router:

from litellm import Router

router = Router(
    model_list=[...],
    routing_strategy="least-busy",
    num_retries=3
)

Exponential backoff:

from tenacity import retry, wait_exponential

@retry(wait=wait_exponential(min=1, max=60))
def call_llm(prompt):
    ...

Most reliable: Use LiteLLM for automatic rate limit handling!

0 replies

fjnunezp75 · 2026-03-15T00:24:03Z

fjnunezp75
Mar 15, 2026

Rate limiting with parallel agents is fundamentally a resource contention problem — multiple agents competing for a shared quota. A few approaches beyond what has been mentioned:

1. Provider-level load spreading (the real fix)

Most solutions above handle retries after hitting limits. The better approach is to never hit them in the first place by spreading load across providers. LiteLLM Router is the easiest path:

from litellm import Router

router = Router(
    model_list=[
        {"model_name": "fast", "litellm_params": {"model": "gpt-4o-mini"}},
        {"model_name": "fast", "litellm_params": {"model": "claude-haiku-20240307"}},
        {"model_name": "fast", "litellm_params": {"model": "groq/llama-3.1-8b-instant"}},
    ],
    routing_strategy="least-busy",
    num_retries=2
)
# Automatically routes to least-loaded provider

2. Asymmetric model allocation

Not all agents need the same model. A researcher agent doing broad retrieval can run on a cheap/fast model; only the synthesis agent needs GPT-4. This alone can 3-4x your effective rate limit budget since cheaper models have higher TPM tiers.

3. Unified API layer with higher aggregate limits

One approach we use at GPU-Bridge: route all agent calls through a single endpoint that internally balances across multiple providers. You get one integration point and the provider-level rate limits are our problem, not yours. Works especially well when you have bursty parallel workloads.

# All agents hit the same endpoint, we handle the balancing
agent_llm = requests.post(
    "https://api.gpubridge.xyz/run",
    headers={"Authorization": "Bearer YOUR_KEY"},
    json={"service": "llm", "model": "llama-3.3-70b", "messages": [...]}
)

4. Per-agent rate budgets in CrewAI

The max_rpm parameter is good but coarse. For finer control, wrap the LLM with a per-agent limiter:

from aiolimiter import AsyncLimiter

class BudgetedLLM:
    def __init__(self, llm, rpm=10):
        self.llm = llm
        self.limiter = AsyncLimiter(rpm, 60)
    
    async def ainvoke(self, messages):
        async with self.limiter:
            return await self.llm.ainvoke(messages)

The key insight: rate limits are a symptom of tight coupling between agents and a single provider. Loosen that coupling and the problem mostly disappears.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle rate limits when using multiple agents in parallel? #4078

Uh oh!

{{title}}

Uh oh!

Replies: 6 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to handle rate limits when using multiple agents in parallel? #4078

Uh oh!

shinymanasseh Dec 12, 2025

Replies: 6 comments

Uh oh!

nihal-5 Dec 12, 2025

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

fjnunezp75 Mar 15, 2026

shinymanasseh
Dec 12, 2025

nihal-5
Dec 12, 2025

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

fjnunezp75
Mar 15, 2026