Concurrency Black-Art: Excerpts from Real-world concurrency

Tricks to write multi threading

Know your cold path from hot paths

  1. In cold paths, keep the locking as coarse-grained as possible.
  2. In hot paths, locking should be simple and fine grained.
  3. If you don’t know the path, assume it’s a cold path and use coarse-grained locking. Be prepared to be proven wrong.

Intuition is frequently wrong — be data intensive

  1. In order to test, you should know how to put load on the system. Consider load generation and simulation as a first class problem; so should handle this during development(the earlier is better). Although a test load should mimic its production equivalent as closely as possible, timeliness is more important than absolute accuracy.
  2. The second step is to understand the scalability inhibitors on a production system which requires the ability to safely dynamically instrument its synchronization primitives. Dynamic instrumentation will provide the data needed to find those parts of the system that are inhibiting scalability. It will also help gather sufficient data to understand which techniques will be suitable for reducing contention.

Know when and when not to break up a lock

  1. Global locks can naturally becomes scalability inhibitors, it is reasonable to break up the lock into per-CPU locks, a hash table of locks, etc.
  2. Breaking up a lock is not the only way to reduce contention, and contention can be(and often is) more easily reduced by decreasing the hold time of the lock(by algorithmic optimization or finding activity that is needlessly protected by the lock).

Be wary of readers/writer locks

  1. Reader/Writer lock doesn’t scale well when lock is not long, because the state associated with the Reader/Writer lock must itself be updated atomically.
  2. In the absence of any sophisticated synchronization primitives, a reader/writer lock will use a single word of memory to store the number of readers. The number of readers must be updated atomically, acquiring the lock as a reader requires the same bus transaction, as acquiring a mutex, and contention on that line can hurt also.
  3. Reader/writer lock should be carefully used when a writer is blocked, reader can’t recursively acquire a lock as a reader and when a writer blocks between initial acquisition as a reader and recursive acquisition as a reader, deadlock will occur.

Consider per-CPU locking

  1. Per-CPU locking can be a convenient technique for diffracting contention, as a per-CPU lock is not likely to be contended.
  2. Let’s say, there are two types of threads with operating modes of different coherence requirements. One has a shorter hold times(and occur frequently) than the other and it can acquire the a per-CPU lock. The other type can grab all the per-CPU locks to construct coherent states. This paper implements distributed reader writer lock using per-CPU locking.
  3. It should be employed only when the data indicates that it’s necessary, as it introduces substantial complexity in implementation. In cold path, all per-CPU locks should be acquired in a single order.

Know when to broadcast — and when to signal

  1. Broadcast should be used to indicate state change and signal should be used for resource availability.

Learn to debug postmortem

  1. Problems in highly parallel systems are not necessarily reproducible, and a single core dump is often one’s only chance to debug them.
  2. One needs to have only a list of threads, their corresponding stack backtraces, and some knowledge of the system. The information is contained in a snapshot of state(core file).

Design your systems to be composable

  1. One can make locking entirely internal to the subsystem. e.g. in concurrent OS, locking never returns to user level with in-kernel locks held. This model can work whenever a crisp interface exists between software components: as long as control flow is never returned to the caller with locks held, the subsystem will remain composable.
  2. One can achieve concurrency and composability by having no locks whatever. In this case, there must be no global subsystem state — subsystem state must be captured in per-instance state, and it must be up to consumers of the subsystem to assure that they don’t access their instance in parallel. By leaving locking up to the client of the subsystem itself can be used concurrently by different subsystems and in different contexts. I don’t believe in this statement. It’s like keeping the boat in the harbor for safety.

Don’t use a semaphore where a mutex would suffice

  1. There is no way of propagating the blocking thread’s scheduling priority to the thread that is in critical section. This ability to propagate scheduling priority — priority inheritance — is critical in realtime system, and in the absence of other protocols, semaphore-based systems will always be vulnerable to priority inversion.
  2. Lack of ownership deprives the system of the ability to make assertions about itself. When ownership is tracked, the machinery that implements thread blocking can detect deadlocks and recursive lock acquisition.

Consider using nonblocking synchronization routines to monitor contention

  1. The second variant can be used to monitor one’s own contention. The failure indicates there is a contention. This failure indication can be used to dynamically reduce contention.

When reacquiring locks, consider using generation counts to detect state change

  1. In this case, a generation count may be associated with the data structure, so every time the data is changed the count is incremented.
  2. The logic that drops and reacquires the lock must cache the generation count before dropping the lock. After reacquisition, the count is compared and the state change logic may react accordingly.

Use wait- and lock-free structures only if absolutely must

Prepare for the thrill of victory — and the agony of defeat

  1. Removing the last impediment will give the best performance.
  2. Sometimes after breaking up a complicated lock, you will realize it’s the last one, it’s merely hiding another impediment to scalability. Removing this increases a very little performance or not at all.




C++11/14, Qt, Juce

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

4 common mistakes to avoid that will make your Startup a success

KWoC Blog — 2020

Build a Simple Pin Tool

Monitoring Cassandra Health and Performance Metrics

Summary of my understanding of Elasticsearch

Humans of DataHub: Arun Vasudevan

What is Google Kickstart?

Your ‘Hello World’ in Decarbonization

Your ‘Hello World’ in Decarbonization

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Asit Dhal

Asit Dhal

C++11/14, Qt, Juce

More from Medium


Understanding Microservices as a Software Architecture

Setting up Circle CI for Android (Github repo)

Percentile Latency in Backend