The effects of batch size and linger time on Kafka throughput

By default, Kafka attempts to send records as soon as possible, sending up to max.in.flight.requests.per.connection messages per connection. If you attempt to send more than this, the producer will start batching messages, but ultimately if you saturate the connection so there are unacknowledged message batches pending, the producer enters blocking mode.

We can optimise the way the producer batches messages to achieve higher throughput through two settings, batch.size and linger.ms.

linger.ms is the number of milliseconds that the producer waits for more messages before sending a batch. By default, this value is 0, meaning that the producer attempts to send messages immediately. Smaller batches have higher overhead – they are less compressible, and there’s an overhead associated with processing and acknowledging them. Under moderate load, messages may not come frequent enough to fill a batch immediately, but by introducing a small delay (e.g. 50ms), we can increase the likelihood that the producer can batch more messages together, improving throughput.

Kafka will always send a batch once it’s full. The other configuration we can consider is batch.size. As the name suggests, this controls the size of a batch (per partition) – and so by increasing batch size, we make it possible to send more messages at once.

We did some testing on this, and discovered that the performance improvement basically optimises around 75-100ms for our message size and volume:

chart showing steep performance improvements for 3 different batch sizes, levelling off at around 100ms

Summary

  • Increasing the producer’s ability to batch – by raising linger.ms (e.g. we found ~75–100 ms worked best for our workload) and increasing batch.size – significantly improved throughput because it lets Kafka send larger, more-compressible batches and amortise per-request overhead.
  • But this is a trade-off – you add up to linger.ms of extra latency for the first message in a batch, and very large batches increase memory pressure and recovery cost on retries.
  • Testing different values will allow you to optimise throughput for your own workload.

The Multiplier Effect: How The Aggregation of Marginal Delays Derails Projects

The Aggregation of Marginal Gains is an improvement model attributed to Dave Brailsford. When he took over as the head of British Cycling in 2002, the team was near the bottom of the rankings. Brailsford’s approach was to look at their processes in tiny detail, and improve each part by 1%, the logic being that these would all add up to a significant improvement. He talks about shipping mattresses between hotels so that athletes get a better nights’ sleep, or fastidiously cleaning the team truck to prevent dust and debris undermining the fine tuning of the bikes. And it was a success – in the 2008 and 2012 olympics, the team won 70% of the gold medals for track cycling.

I want to propose a contrasting notion – the Aggregation of Marginal Delays – the slow accumulation of tiny lags and delays in a project that add up to a significant slip in delivery performance. These delays are often so small and (at the time) inconsequential that team members just brush them off. Perhaps you need to get approval from three people with busy calendars – it might take you a few days to get in their diary. Frustrating – but we’re all busy, right? Maybe you need to request something from another which takes half a day to released to you. Annoying – but the other team’s process is clearly laid out on their website – didn’t you read it? That person you needed to get advice from has taken the afternoon off to watch their kids nativity play. Who’d begrudge them that?

But these delays, each one small and explainable, add up, both quantitatively, and culturally. None of them is worth escalating – by the time you get this in front of someone who could change it, the delay is in the past. But the cumulative impact of a few hours here, half a day there, across dozens of events, across months of work, is significant. And it sets the tone for how things are done – we, as an organisation, start to feel that it’s an acceptable state of affairs – like I said, all of the causes of the delays are reasonable, none are malicious, or the result of incompetence. And because of that, when there are delays which could be avoided, they’re often not.

I’m afraid i don’t have a silver bullet here. We’ve tried lots of things to make the impact of these delays visible, but none have worked. In most cases, the cure is worse than the disease, creating massive overhead. Here’s what we’ve tried:

  • Immediate escalation – but this made people nervous, and often there was nothing to be done – are we really going to summon that team member back from their child’s school play to answer our question?
  • Flagging potential bottlenecks up front – although it did help somewhat to remind team members to consider when they might have to book things in advance, too much forward planning is somewhat wasteful, and hinders the agility of the team.
  • Pushing accountability down to teams – we already do this as much as we can. But the culture in our part of the organisation, which wants to move fast, is at odds with central IT services providing IT to the wider company, which needs to deliver a secure and reliable service which meets everyone’s needs. So we can’t make all the decisions.
  • Capture data on areas with consistent delays – we tried to use this to systematically improve the processes in those areas, but often the teams which own them weren’t interested in change. They felt our proposed ‘improvements’ would introduce too much risk. And it was quite burdensome to track every request too.

So what can you do? In short, treat it like any other Continuous Improvement exercise – gather data, plan, execute, review.

  • Introduce a system to track delays – partial data is better than none.
  • Hold other teams to their SLAs, and use your data to demonstrate this.
  • Run a “Delay Spotlight” in your team meetings, where team members can raise frustrating delays, and teams can brainstorm improvements. Focus pilots on areas you can control, rather than trying to change central teams.

The Aggregation of Marginal Delays is an inherent challenge in large , complex organisations, but one that we can chip away at with through collective analysis, data, communication and a mindset of continuous improvement. Remember that Marginal Gains accumulate too.

Names change, and so do email addresses

We’ve recently been rolling out a new internal application. At our organisation, users have an email address which is generally firstname.lastname@company.com, or something like that. When a user logs in to the application, the app will look them up using their email address and figure out what parts of the application the user should be able to use.

The problem

One day we got a ticket for a user who was adamant that they had access but when we looked in the application, we couldn’t even find them in the system! Probing a bit further, it turns out that they had recently changed their name, and as a result, their email had changed.

There are lots of common misconceptions around names, including that names never change. But people change their names for a variety of reasons:

  • In the UK and US, many people choose to change their last name when they get married, or if they changed their name and subsequently divorce, may change it back.
  • People who identify as trans or non-binary may choose to change their name to better reflect their gender identity.
  • People may choose to select or return to a name which they feel better reflects their cultural identity
  • They may not like their name and want a better one

When we designed the system, we didn’t think of this. My email address hasn’t changed for years. But in retrospect, it’s so obvious that we should have.

Inclusive design

Inclusive design means thinking about all of our users, even ones who don’t yet exist or that we have yet to identify. Inclusive design makes the experience better for everyone, and it takes almost no effort or cost. Even if you don’t think you’ll benefit, you never know who might.

The Two Email Rule: When to Escalate from Email to Real Conversation

When i think back to the deepest of the many deep holes i’ve dug myself in to over the years, they almost all start with an email.

When working through my inbox, it’s all too easy to just bash out a reply and hit send. Usually, that’s fine – a quick email is all it takes, and the issue is closed. But sometimes, that email triggers a reply, and that reply another, and it’s hard to predict when but eventually I’m having a complex conversation about a complex issue and it all goes wrong and before I know it, we’re at loggerheads and 37 people on the CC list think I’m a jerk.

Email is a low bandwidth communication channel. And this means that it’s hard to get across what we mean without misunderstanding. In an email, no one can tell if you’re smiling. You can’t acknowledge a change of mind, or moderate your language when you detect frustration on the other end.

And that’s why I have the “two email rule”. Whenever I find myself reading or writing a third email on a topic, I know I need to escalate to bandwidth – send an IM, call them, book a meeting. Obviously, in distributed teams, or during lockdown, this is hard. But it’s necessary.

So if I suddenly stop responding to your email, it’s not because I don’t care, it’s because I want to properly discuss this with you.

Can you do that new job?

Generally when evaluating someone for a role, I look for 5 things:

  1. Behaviours – how do they operate in a team? Do they admit to mistakes and learn from them? Do they help others? Communicate and live to their personal values? Are those values ones I want people in the team to live to?
  2. Accountability – can this person handle the magnitude of the role? Are they able to manage stakeholders of the right level of seniority?
  3. Domain – how deep is their knowledge of this business, industry, sector etc.? And how deep does it need to be?
  4. Function – what is their level of skill in this type of role? For example, if hiring a business analyst, how good a business analyst are they?
  5. Organisation – perhaps summarised as “knowing how things are done around here” – processes, culture etc. – does this person have the knowledge to make things happen?

Obviously, number 1 is a given – no one wants a brilliant jerk on the team. But most people have some of each of the others. The question is whether it’s enough to set them up for success in the new role. Usually, I’d expect someone to have strength in 1 or 2 of the others, and to have one or at most two which give headroom to grow as:

  • No headroom in role = boring job
  • Too many development areas = set up for failure

When a candidate is moving roles internally, they probably have number 5. So a step up to greater accountability, or moving to an entirely new business domain (if the company is big enough) might represent a solid plan. Doing both at once is probably too much for most people.

External candidates probably don’t possess organisational knowledge, so we should assume that’s a growth area. And that means they need to be fairly strong in two of the other areas. In my experience, people usually move company to step up. So i would expect external candidates to have strong domain knowledge and functional skill.

The 9 Ps

I often have conversations with friends and colleagues about their careers. And many times, i point people to a great blog post by my colleague Liz Aab, about the “7 Ps”. But i always find myself adding two to the list, so i thought i’d just post it here.

There are lots of factors which go in to choosing a job. You can’t have all of them, all of the time. At least, i think you can’t. But you can (and should) decide which are most important to you. Here are Liz’s 7 Ps (which she says were originally 5 Ps from some other source). I’ve added my two on the end, and i’ve reworded some of Liz’s original post:

  1. Place : Where geographically do you want to work? The city/country you are based in and your commute affect how you spend your time, and who you spend your time with, both inside and outside work.
  2. People : Who specifically would you work with on a daily basis? Do you like them? Does your boss care about you and want to see you succeed?
  3. Pay : Does the job or sector pay you enough to live the life you want? If not, will your pay will increase in a few years in this career path? Or, are you happy to change your lifestyle to accommodate a lower salary?
  4. Progression : Will you develop skills, knowledge, a network or a reputation that will help you move forward in your career? Does this job offer defined progression opportunities, or do you need to develop these for yourself? If so, are you comfortable with this?
  5. Perception : How do people react when you tell them what you do? Whose opinion do you really care about, and how important is that to you? Of course perceptions of jobs and industries change over time.
  6. Purpose : What is the company or organisation trying to achieve, and do you support that? It’s not just millennials that want to work on something they believe in.
  7. Procedures : In Liz’s list, this is how you do your job day to day. I’ve reworked it – for me, procedures is how the organisation operates. Do they expect a rigid 9-5, or are you trusted to deliver a result? Do decisions get made once and then implemented, or does it take a consensus to make change? Are you empowered to deliver, or do you need permission to take a bathroom break?
  8. Projects : While procedures might be how the work gets done, this is what you’re actually doing. Are you spending your day on the phone, or sitting reading stacks of paper, or crunching Excel, or standing on your feet in front of 25 teenagers? Is your work indoors or outdoors? And do you like doing those things?
  9. Pace : Is it frantic from the moment you wake to when you sleep? Or is there lots of space in the day for you to collect your thoughts or think things through? Are you expected to check your emails after hours, or do you ‘clock off’ when you’re done? What do you need to thrive?

Of course, as Liz points out, what you value today will differ to what’s important to you tomorrow. When you’re young and eager, you may want a role which is always on the go (high pace), and with a compelling purpose. If you start to plan a family, pay and progression move up the list.