The effects of batch size and linger time on Kafka throughput

By default, Kafka attempts to send records as soon as possible, sending up to max.in.flight.requests.per.connection messages per connection. If you attempt to send more than this, the producer will start batching messages, but ultimately if you saturate the connection so there are unacknowledged message batches pending, the producer enters blocking mode.

We can optimise the way the producer batches messages to achieve higher throughput through two settings, batch.size and linger.ms.

linger.ms is the number of milliseconds that the producer waits for more messages before sending a batch. By default, this value is 0, meaning that the producer attempts to send messages immediately. Smaller batches have higher overhead – they are less compressible, and there’s an overhead associated with processing and acknowledging them. Under moderate load, messages may not come frequent enough to fill a batch immediately, but by introducing a small delay (e.g. 50ms), we can increase the likelihood that the producer can batch more messages together, improving throughput.

Kafka will always send a batch once it’s full. The other configuration we can consider is batch.size. As the name suggests, this controls the size of a batch (per partition) – and so by increasing batch size, we make it possible to send more messages at once.

We did some testing on this, and discovered that the performance improvement basically optimises around 75-100ms for our message size and volume:

chart showing steep performance improvements for 3 different batch sizes, levelling off at around 100ms

Summary

Increasing the producer’s ability to batch – by raising linger.ms (e.g. we found ~75–100 ms worked best for our workload) and increasing batch.size – significantly improved throughput because it lets Kafka send larger, more-compressible batches and amortise per-request overhead.
But this is a trade-off – you add up to linger.ms of extra latency for the first message in a batch, and very large batches increase memory pressure and recovery cost on retries.
Testing different values will allow you to optimise throughput for your own workload.

Leave a Reply Cancel reply