Author - Biniyam Gebreyohannes

Dynamic Batching
Inference Engine

A high-concurrency inference serving system that uses asynchronous request queueing and dynamic batching to maximize throughput while preserving low tail latency.

View on GitHub →
async producer-consumer dynamic batching PyTorch FastAPI + uvloop CPU / GPU Docker

01 The Problem

Naive inference serving — one request in, one forward pass out — wastes compute. Accelerators perform dramatically better on batched workloads. But batching means waiting, and waiting means latency.

Under bursty traffic, an inference system must balance hardware utilization against responsiveness. Batch too aggressively and P99 latency spikes. Batch too conservatively and the hardware sits idle. DBIE implements the queueing, scheduling, and execution machinery to explore this tradeoff directly.

The core tension: Larger batches improve throughput and resource utilization, but increase waiting time and tail latency. Every component in DBIE is built around this tradeoff.

02 Architecture

DBIE is a four-stage producer-consumer pipeline. HTTP handlers produce requests. A scheduler forms batches. A runner executes inference. Futures return results.

Produce HTTP handlers validate and enqueue requests into a bounded async queue
Schedule FIFO or adaptive scheduler decides when to flush a batch
Execute Model runner processes the batch in a ThreadPoolExecutor
Return Results demuxed, futures resolved thread-safely on event loop
%%{init: {'theme': 'dark'}}%% graph TB subgraph CLIENTS ["CLIENTS"] C1["Client 1"] C2["Client 2"] CN["Client N"] end subgraph API ["API LAYER"] VAL["Validation"] ROC["Request + Future"] BP{"Queue Full?"} R429["429"] end subgraph QUEUE ["QUEUE"] AQ["asyncio.Queue maxsize=1000"] end subgraph SCHED ["SCHEDULER"] FIFO["FIFO"] ADAPT["Adaptive"] end subgraph EXEC ["EXECUTOR"] BB["Batch Builder"] MR["Model Runner"] CPU["CPU"] GPU["GPU"] DEMUX["Demux Results"] end subgraph OBS ["OBSERVABILITY"] MC["Metrics"] HE["/health"] ME["/metrics"] end C1 & C2 & CN --> VAL --> ROC --> BP BP -->|yes| R429 BP -->|no| AQ AQ --> FIFO & ADAPT FIFO & ADAPT --> BB --> MR MR --> CPU & GPU CPU & GPU --> DEMUX --> C1 & C2 & CN MR -.-> MC MC --- HE & ME

03 Request Lifecycle

One request, from HTTP ingress through batched inference and back.

01
Client sends POST /infer
Flat array of floats matching the model's input dimension.
02
Payload validation
Dimension mismatch returns 422 immediately. No queue space wasted.
03
Request object creation
UUID, monotonic timestamp, tensor payload, and an asyncio.Future for result delivery.
04
Queue insertion
put_nowait() into bounded queue. Full queue = HTTP 429. Never blocks.
05
Scheduler forms batch
Waits for deadline (anchored to head request arrival time), then drains queue.
06
Batch assembly
torch.stack() combines payloads. Shape mismatch resolves all futures with error.
07
Executor dispatch
run_in_executor() moves blocking inference to a thread. Event loop stays free.
08
Model forward pass
Single batched inference. CUDA sync on GPU path.
09
Result demux + future resolution
outputs[i] delivered via call_soon_threadsafe() to each request's future.
10
HTTP response
Handler awaits future, returns JSON with result and measured latency.

04 Scheduler Design

The scheduler answers one question: when should I stop waiting and flush the current batch?

FIFO Scheduler

  • Fixed batch size and wait timeout
  • Deadline anchored to oldest request arrival
  • Simple, predictable baseline strategy
  • Works well under steady-state load
  • Cannot adapt to traffic changes

Adaptive Scheduler

  • EMA of queue depth and inter-arrival time
  • Dynamically adjusts batch size + wait window
  • Reduces batch under P99 latency pressure
  • Hysteresis prevents oscillation (N signals)
  • Grows batches under bursty traffic

05 Systems Decisions

Non-blocking queue insertion

Handlers use put_nowait() and return 429 on overflow. Blocking on a full queue would stall the event loop and freeze all in-flight requests.

Executor isolation

PyTorch forward passes are synchronous and CPU-bound. run_in_executor() moves them to a thread so the event loop can continue accepting requests.

Single executor worker

Multiple threads cause OpenMP/MKL over-subscription. A single worker avoids context switching overhead and gets cleaner throughput.

Thread-safe future resolution

Future.set_result() from an executor thread is unsafe in Python 3.10+. call_soon_threadsafe() schedules resolution on the event loop thread.

Single ASGI worker

Multiple uvicorn workers split traffic across independent queues and models, destroying batching efficiency. DBIE runs one worker by design.

Timeout on futures

asyncio.wait_for() bounds how long a handler waits. Prevents memory leaks from disconnected clients whose futures would never be read.

06 Failure Modes

Issue Mitigation
Queue overflow Immediate 429 via non-blocking put. Never stalls the event loop.
Tensor shape mismatch torch.stack wrapped in try/except. All futures in the batch receive the error.
Client disconnect Configurable timeout on future await prevents orphaned memory.
Cold-start latency Warm-up batches run before the server accepts traffic.
Scheduler oscillation Hysteresis requires N consecutive signals before changing batch size.
Docker health during warmup start_period: 30s prevents kill-restart loops.
GPU memory fragmentation torch.cuda.empty_cache() called every 100 batches.

07 Deployment

Single-process FastAPI on uvicorn. Environment-variable-driven configuration. Multi-stage Docker builds for CPU and GPU variants.

# CPU
python -m dbie

# GPU
INFERENCE_DEVICE=cuda:0 python -m dbie

# Adaptive scheduler
SCHEDULER_STRATEGY=adaptive MAX_WAIT_MS=30 python -m dbie

# Docker
docker compose up inference-cpu
Variable Default
MAX_BATCH_SIZE 32
MAX_WAIT_MS 50
QUEUE_MAX_SIZE 1000
INFERENCE_DEVICE cpu
SCHEDULER_STRATEGY fifo
ADAPTIVE_TARGET_LATENCY_MS 100
WARMUP_BATCHES 10

08 Benchmark Results

Open-loop load testing across four traffic patterns. FIFO scheduler, MAX_BATCH_SIZE=32, MAX_WAIT_MS=50, CPU execution.

Latency and Throughput Overview

Benchmark Overview

Left: P50/P95/P99 latency across load patterns. Right: Achieved throughput. Burst traffic shows highest latencies as the scheduler absorbs spikes into large batches. Ramp achieves highest throughput as gradual load increase fills batches efficiently.

Latency Percentile Fan

Latency Fan

The fan between P50 and P99 widens under burst and step-function traffic — exactly where the batching tradeoff is most visible.

Request Success Rate

Success Rate

100% success rate across all patterns (3,463 requests, zero failures). Backpressure (HTTP 429) was not triggered — queue capacity of 1000 was sufficient.

09 Why This Matters

DBIE demonstrates engineering at the intersection of backend systems and ML infrastructure.

Async concurrency

Producer-consumer pipeline with futures, executors, and event loop isolation.

Queueing & scheduling

Bounded queues, backpressure, two scheduling strategies with different throughput-latency profiles.

Inference systems

Batching, warm-up, device management, and practical constraints of serving models.

Production awareness

Health checks, timeouts, memory management, graceful shutdown, observability.

The same problem space as NVIDIA Triton Inference Server, TensorFlow Serving, and vLLM's scheduling layer.