The Developer's Guide to Background Jobs and Task Queues
Why Your Request Handler Should Not Do Everything
A user clicks “Generate Report.” Your server starts crunching numbers, querying the database, formatting a PDF, and attaching it to an email. Thirty seconds later, the request times out. The user refreshes, triggers the whole process again, and now you have two reports being generated simultaneously.
This is what happens when you try to do too much inside a single HTTP request. Background jobs solve this problem by moving slow, unreliable, or resource-heavy work out of the request/response cycle entirely.
If you have ever built resilient APIs, you already understand the importance of keeping request handlers fast and predictable. Background jobs are the other half of that equation.
When You Need Background Jobs
Not every operation belongs in a queue. Here is a practical framework for deciding.
Move to a background job when:
- The operation takes more than 1-2 seconds
- It involves a third-party API that might be slow or rate-limited
- It can fail and needs automatic retries
- The user does not need the result immediately
- It is resource-intensive (CPU, memory, or I/O)
- It needs to run on a schedule
Keep it inline when:
- The user needs the result before the page renders
- The operation is fast and reliable (under 200ms)
- Failure means the user action should fail too (like a payment charge)
| Operation | Inline or Background? | Why |
|---|---|---|
| Sending a welcome email | Background | User does not need to wait; email services can be slow |
| Resizing an uploaded image | Background | CPU-intensive; user can see a placeholder while it processes |
| Validating a form | Inline | User needs immediate feedback |
| Generating a monthly invoice PDF | Background | Can take several seconds; deliver via email or dashboard notification |
| Checking stock availability | Inline | User needs the answer before proceeding to checkout |
| Syncing data to a CRM | Background | Third-party API; can retry if it fails |
The Architecture of a Task Queue
Every task queue system has three core components.
Producers
The part of your application that creates jobs. When a user signs up, your registration handler produces a “send welcome email” job and places it on the queue.
// Producer: enqueue a job from your API handler
await emailQueue.add('send-welcome', {
userId: user.id,
email: user.email,
template: 'welcome'
});
The Queue (Broker)
The queue stores jobs until a worker is ready to process them. Redis ↗ is the most popular backing store for task queues in the Node.js and Python ecosystems. RabbitMQ ↗ is a dedicated message broker that offers more advanced routing and delivery guarantees.
Workers (Consumers)
Separate processes that pull jobs from the queue and execute them. Workers run independently of your web server, which means a slow job does not block your API.
// Worker: process jobs from the queue
const worker = new Worker('send-welcome', async (job) => {
const { userId, email, template } = job.data;
await sendEmail(email, template);
await markEmailSent(userId, 'welcome');
});
Choosing the Right Tool
The best tool depends on your language, infrastructure, and how much you want to manage yourself.
Self-hosted options
| Tool | Language | Broker | Best For |
|---|---|---|---|
| BullMQ ↗ | Node.js / TypeScript | Redis | Most Node.js applications; great DX, strong community |
| Celery ↗ | Python | Redis or RabbitMQ | Django and Flask apps; mature and battle-tested |
| Sidekiq | Ruby | Redis | Rails applications; known for performance |
| Hangfire | C# | SQL Server or Redis | .NET applications |
Managed cloud options
| Service | Provider | Best For |
|---|---|---|
| Amazon SQS ↗ | AWS | Serverless architectures; massive scale |
| Google Cloud Tasks ↗ | GCP | HTTP-based task execution |
| Cloudflare Queues | Cloudflare | Edge-first architectures with Workers |
If you are already running Redis for caching or sessions, BullMQ or Celery with a Redis broker is the lowest-friction option. If you need advanced routing, fan-out, or strict message ordering, RabbitMQ is worth the operational overhead.
Common Patterns
Fire and forget
The simplest pattern. Produce a job and move on. The worker processes it whenever it gets to it. Good for non-critical operations like analytics events, audit logs, or cache warming.
await analyticsQueue.add('page-view', {
page: '/pricing',
userId: session.userId,
timestamp: Date.now()
});
// Response sent immediately; analytics processed later
Request, then poll
For operations where the user eventually needs the result, return a job ID and let the client poll for completion. This keeps the initial response fast while still delivering the result.
// API endpoint: start report generation
const job = await reportQueue.add('monthly-report', {
accountId: req.params.id,
month: '2026-03'
});
res.json({ jobId: job.id, status: 'processing' });
// Separate endpoint: check job status
const job = await reportQueue.getJob(req.params.jobId);
res.json({
status: await job.getState(),
result: job.returnvalue
});
Scheduled jobs (cron-style)
Run jobs on a recurring schedule. Useful for daily report generation, data cleanup, subscription renewals, and similar tasks. Most task queue libraries support this natively.
await reportQueue.add('daily-digest', { type: 'daily' }, {
repeat: { cron: '0 7 * * *' } // Every day at 07:00
});
Priority queues
Not all jobs are equally urgent. A password reset email should process before a weekly newsletter. Priority queues let you assign urgency levels so important jobs jump ahead.
// High priority: password reset
await emailQueue.add('password-reset', data, { priority: 1 });
// Low priority: marketing newsletter
await emailQueue.add('newsletter', data, { priority: 10 });
Handling Failures Gracefully
Jobs fail. APIs time out, databases go down, and third-party services return errors. A good task queue strategy assumes failure and plans for it.
Retry with exponential backoff
Instead of retrying immediately (which hammers an already struggling service), increase the delay between each attempt.
const worker = new Worker('sync-crm', processCrmSync, {
settings: {
backoffStrategy: (attemptsMade) => {
return Math.min(1000 * Math.pow(2, attemptsMade), 30000);
}
}
});
This gives the failing service time to recover. First retry after 2 seconds, then 4, then 8, capping at 30 seconds.
Dead letter queues
After a job exhausts all its retries, it moves to a dead letter queue (DLQ). This prevents failed jobs from blocking the main queue while preserving them for inspection and manual replay.
Think of the DLQ as a holding area for jobs that need human attention. Good logging and observability make debugging these failures far easier.
Idempotency
If a job might run more than once (due to retries or infrastructure hiccups), make sure running it twice produces the same result as running it once. This is idempotency, and it is essential for background jobs.
async function processPaymentReminder(job) {
const { invoiceId } = job.data;
// Check if reminder already sent (idempotency guard)
const existing = await db.reminders.findOne({
invoiceId,
type: 'payment',
sentAt: { $exists: true }
});
if (existing) return { skipped: true };
await sendReminder(invoiceId);
await db.reminders.updateOne(
{ invoiceId, type: 'payment' },
{ $set: { sentAt: new Date() } },
{ upsert: true }
);
}
Scaling Workers
One of the biggest advantages of background jobs is that you can scale workers independently of your web server. If your queue is backing up, add more workers. If traffic is quiet, scale down.
Concurrency
Most task queue libraries let you control how many jobs a single worker processes simultaneously.
const worker = new Worker('image-resize', processImage, {
concurrency: 5 // Process 5 images at once
});
Be careful with concurrency. CPU-bound tasks (image processing, PDF generation) benefit from fewer concurrent jobs per worker, with more worker instances instead. I/O-bound tasks (API calls, email sending) can safely run higher concurrency on a single worker.
Separate queues for different workloads
Do not put everything in one queue. A slow report generation job should not block time-sensitive email delivery. Use separate queues with dedicated workers.
| Queue | Concurrency | Workers | SLA |
|---|---|---|---|
email-critical | 10 | 2 | Under 30 seconds |
email-marketing | 5 | 1 | Under 5 minutes |
report-generation | 2 | 1 | Under 1 hour |
data-sync | 3 | 1 | Best effort |
This maps naturally to microservice boundaries if your architecture is heading in that direction.
Monitoring and Observability
A queue you cannot see into is a queue that will surprise you. At minimum, track these metrics:
- Queue depth: how many jobs are waiting. A growing queue means workers cannot keep up.
- Processing time: how long each job type takes. Sudden increases signal a problem.
- Failure rate: what percentage of jobs fail. Set alerts for spikes.
- Worker utilisation: are workers idle or maxed out? This guides scaling decisions.
BullMQ provides a built-in dashboard through Bull Board. Celery has Flower. For custom setups, export metrics to your existing monitoring stack.
If you are running background jobs in a Docker environment, make sure your container orchestration handles worker health checks and restarts. A crashed worker that nobody notices is worse than no worker at all.
Deploying Workers Alongside Your Application
Workers need to be part of your CI/CD pipeline, just like your web server. A common mistake is deploying application code without restarting workers, which leaves them running stale job handlers.
Graceful shutdown
When deploying new code, workers should finish their current jobs before shutting down. Killing a worker mid-job can leave data in an inconsistent state.
process.on('SIGTERM', async () => {
await worker.close(); // Finishes current jobs, stops accepting new ones
process.exit(0);
});
Infrastructure as code
Define your workers in the same infrastructure configuration as the rest of your stack. This ensures they scale together and share the same deployment lifecycle.
Getting Started: A Practical Checklist
If you are adding background jobs to an existing application, here is a step-by-step approach:
- Identify the bottleneck. Find the slowest operations in your request handlers. These are your first candidates for background processing.
- Pick a tool. If you are using Node.js and already have Redis, start with BullMQ. Python with Django or Flask? Celery is the standard choice.
- Start with one queue. Do not over-engineer. Move one operation to a background job, deploy it, and observe how it behaves in production.
- Add retries and a dead letter queue. Make sure failed jobs do not disappear silently.
- Add monitoring. Even a simple dashboard showing queue depth and failure rate is better than nothing.
- Scale when you need to. Add concurrency, then add workers, then add separate queues. In that order.
Background jobs are one of those patterns that, once you start using them, you wonder how you managed without them. They make your applications faster for users, more resilient to failures, and easier to scale. Start small, keep your jobs idempotent, and make sure you can see what is happening in your queues.
Frequently asked questions
What is a background job?
A background job is a unit of work that runs outside the main request/response cycle of your application. Instead of making the user wait for a slow operation to complete, you hand the work off to a separate process that runs independently. Common examples include sending emails, generating reports, processing file uploads, and syncing data with third-party APIs.
When should I use a task queue instead of processing inline?
Use a task queue when the operation takes more than a couple of seconds, when it can fail and needs retries, when it involves a third-party service with rate limits, or when you need to schedule work for a specific time. If the user does not need the result immediately, it is almost always better to offload it to a queue.
What is the difference between a message queue and a task queue?
A message queue is a general-purpose system for sending messages between services. A task queue is built on top of a message queue and adds features specifically for job processing, such as retries, scheduling, prioritisation, and progress tracking. Tools like BullMQ and Celery are task queues; RabbitMQ and Redis Streams are message queues.
How do I handle failed background jobs?
Implement exponential backoff for retries, set a maximum retry count, and route permanently failed jobs to a dead letter queue for manual inspection. Log enough context with each job that you can debug failures without reproducing them. Monitoring and alerting on failure rates is essential for catching problems early.
Can I use background jobs with serverless functions?
Yes. Cloud providers offer managed queue services that integrate with serverless. AWS SQS with Lambda, Google Cloud Tasks with Cloud Functions, and Cloudflare Queues with Workers all support this pattern. The trade-off is that you have less control over concurrency and execution environment compared to running your own workers.
Enjoyed this article? Get more developer tips straight to your inbox.
Comments
Join the conversation. Share your experience or ask a question below.
No comments yet. Be the first to share your thoughts.