## Concurrency Kit Towards accessible non-blocking technology for C

Samy Al Bahra AppNexus, Inc. August 26, 2012



### History





### Motivation

Research and develop scalable synchronization methods for NUMA architectures.

What did we need?

- Concurrent memory model
- Support for atomic operations



| Spinlocks   |            |            |            |            |      |             |            |            |            |            |  |
|-------------|------------|------------|------------|------------|------|-------------|------------|------------|------------|------------|--|
| Algorithm   | cas        | s d        | ec_zer     | ro fa      | aa f | as          | load       | sto        | ore        |            |  |
| anderson    | $\diamond$ |            |            | \$         |      |             | $\diamond$ | $\diamond$ |            |            |  |
| cas         | $\diamond$ |            |            |            |      |             |            | $\diamond$ |            |            |  |
| clh         |            |            |            |            | <    | >           | $\diamond$ | $\diamond$ |            |            |  |
| dec         |            | $\diamond$ |            |            |      |             | $\diamond$ | $\diamond$ |            |            |  |
| fas         |            |            |            |            | <    | >           |            | $\diamond$ |            |            |  |
| ticket      |            |            |            | $\diamond$ |      |             | $\diamond$ | $\diamond$ |            |            |  |
| ticket_pb   |            |            |            | $\diamond$ |      |             | $\diamond$ | $\diamond$ |            |            |  |
| mcs         | $\diamond$ |            |            |            | <    | <b>&gt;</b> | $\diamond$ | $\diamond$ |            |            |  |
|             |            |            |            |            |      |             |            |            |            |            |  |
| RW          |            |            |            |            |      |             |            |            |            |            |  |
| Algorithr   | n          | cas        | inc        | dec        | faa  | fas         | s lo       | bad        | store      | RMO        |  |
| brloc       | :k         | 0          |            |            |      | 0           | \$         |            | $\diamond$ | $\diamond$ |  |
| byteloc     |            | $\diamond$ | $\diamond$ | $\diamond$ | 0    |             | $\diamond$ |            | $\diamond$ | $\diamond$ |  |
| rwlock_naiv | e          | 0          | $\diamond$ | $\diamond$ |      | 0           | \$         |            | \$         | $\diamond$ |  |



### Atomic Operation Support

| Operation    | apr        | atomic_ops | ck         | gcc        | glib       | liblfds    | urcu       |
|--------------|------------|------------|------------|------------|------------|------------|------------|
| •            |            | - 1        |            | 0          | 0          |            |            |
| add          |            |            | \$         |            |            |            | \$         |
| and          |            |            | <u>ہ</u>   |            | $\diamond$ |            | $\diamond$ |
| btc<br>btr   |            |            | <u>ہ</u>   |            |            |            |            |
|              |            |            | <u>ہ</u>   |            |            |            |            |
| bts          |            |            | \$         |            |            |            |            |
| cas          |            | <u> </u>   | <b>\$</b>  | $\diamond$ |            | \$         |            |
| cas2         |            | <u> </u>   | \$         |            |            |            |            |
| cas2_value   |            | \$         | \$         |            |            | \$         |            |
| cas_value    | \$         | <b>\$</b>  | \$         | <b>\$</b>  | <b>\$</b>  | \$         | \$         |
| , dec        | <b>\$</b>  |            | <b>\$</b>  |            |            |            | <b>\$</b>  |
| dec_zero     |            |            | <b>\$</b>  |            | <b>\$</b>  |            |            |
| faa          | <b>\$</b>  | <b>\$</b>  | $\diamond$ |            | \$         |            | <b>\$</b>  |
| fas          | <b>\$</b>  | <b>\$</b>  | <b>\$</b>  | <b>\$</b>  |            |            | <b>\$</b>  |
| inc          | <b>\$</b>  |            | $\diamond$ |            | <b>\$</b>  | $\diamond$ | <b>\$</b>  |
| inc_zero     |            |            | $\diamond$ |            |            |            |            |
| load         | $\diamond$ | <b>\$</b>  | $\diamond$ |            | $\diamond$ |            | $\diamond$ |
| neg          |            |            | $\diamond$ |            |            |            |            |
| neg_zero     |            |            | $\diamond$ |            |            |            |            |
| not          |            |            | $\diamond$ |            |            |            |            |
| or           |            | \$         | $\diamond$ |            | $\diamond$ |            | $\diamond$ |
| store        | $\diamond$ | $\diamond$ | $\diamond$ |            | $\diamond$ |            | $\diamond$ |
| sub          | $\diamond$ |            | $\diamond$ |            |            |            |            |
| xor          |            |            | $\diamond$ |            | $\diamond$ |            |            |
| Memory Model | G          | R          | R          | G          | G          | N          | R          |



## Memory Models

#### Intel Architecture Manual, Volume 3A, 8.5

"Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access, other processors should not access the semaphore using a byte access."

- Is wait-free atomic snapshot with respect to loads and CAS2 possible?
- Implications on algorithms with asymmetric accesses.
- How can higher-level memory models accomodate ordering exceptions?

Bytelock read acquisition in the real world: 4 ticks  $\rightarrow$  38 ticks



#### Memory Ordering Modifiers

| Library    | LoadLoad   |       | LoadStore |        | StoreLoad  |       |        | StoreStore |       |            |            |       |  |
|------------|------------|-------|-----------|--------|------------|-------|--------|------------|-------|------------|------------|-------|--|
| atomic_ops | 0          |       | 0         | 0      | $\diamond$ | 0     | 0      |            | 0     | 0          | $\diamond$ | 0     |  |
| ck         | •          | ◇     | ◇         |        |            |       |        |            |       | •          | ◇          | ◇     |  |
| freebsd    | $\diamond$ | ◇     | *         |        |            |       |        |            |       | $\diamond$ | ◇          | *     |  |
| linux      | •          | ◇     |           |        |            |       |        |            |       | •          | ◇          | ◇     |  |
| urcu       | •          | ◇     | ◇         |        |            |       |        |            |       | •          |            | ♦     |  |
|            | x86-64     | Power | SPARC     | x86-64 | Power      | SPARC | x86-64 | Power      | SPARC | x86-64     | Power      | SPARC |  |

= NOP (strict fallback)

o = NOP

 $\star$  = Full semantics

♦ = Yes



Portability

A port should require little besides atomic load, store, fences and a single universal atomic primitive.



Transparency

Non-blocking progress guarantees should reflect rank in the wait-free hierarchy.



#### The ABA Problem

- Algorithms subject to CAS-based speculation are prone to ABA.
- ABA-safety typically provided with ABA counters or SMR.
- Currently, we have pigeon-holed LL/SC-based architectures into SMR.

#### NI 548

"The weak compare-and-exchange operations may fail spuriously, that is, return zero while leaving the value pointed to by expected unchanged."

The operation is guaranteed to fail if object was the destination of any concurrent write operations, regardless of functional side-effects.



#### Atomics License

| atomic_ops | ck          | urcu    |
|------------|-------------|---------|
| MIT        | Revised BSD | LGPL2.1 |

#### LGPL2 is restrictive

"If such an object file uses only numerical parameters, data structure layouts and accessors, and small macros and small inline functions (ten lines or less in length), then the use of the object file is unrestricted, regardless of whether it is legally a derivative work ... Otherwise, if the work is a derivative of the Library, you may distribute the object code for the work under the terms of Section 6. Any executables containing that work also fall under Section 6, whether or not they are linked directly with the Library itself."



# Summary of Opinions

- There is no silver bullet, concurrent applications that care about performance eventually require their own concurrent memory model
- A model should:
  - Expose wait-free hierarchy
  - Allow for exceptions to the memory model
- An implementation should optimize for underlying instruction set
- LL/SC needs love, ABA-safety yields fair number of performance wins







#### Examples of Progress

"Thank you for your message. I have no knowledge of the legal issues related to the use of [...], or to represent [...]'s position ..."

| State of the World |                                                             |     |    |      |  |  |  |  |  |
|--------------------|-------------------------------------------------------------|-----|----|------|--|--|--|--|--|
| EBR                | ΗP                                                          | PTB | PC | QSBR |  |  |  |  |  |
| 0                  | ٠                                                           | •   | 0  | 0    |  |  |  |  |  |
| $\circ = A_V$      | $\circ = Available$ (at least partially) $\bullet = Locked$ |     |    |      |  |  |  |  |  |





#### Literature Survey

- HP, RCU, QSBR and PTB have comprehensive documentation
- Proxy collectors backed by mailing list threads and one brief blog post
- A handful of paragraphs exploring epoch-based reclamation
  - Primarily thanks to Thomas Hart and Paul McKenney



Bringing hobbyists and academics back into the fold...

- Define the performance metrics for SMR.
  - Identify characteristic workloads
  - What can we quantify?
- Provide a starting point...
  - Reference Implementations
  - Documentation
  - Open Problems
  - Bounties and grants



#### **Epoch Reclamation**

- Amortizing protected section begin/end yields performance that is practically on-par with QSBR.
- Implemented lock-free and wait-free variants built on deferral scheme with thread-local retirement lists.
- Still vulnerable to unbounded memory growth.



PC, QSBR and EBR are all non-intrusive forms of SMR

| PnP SMR                                                                             |                                         |                                                   |                                    |
|-------------------------------------------------------------------------------------|-----------------------------------------|---------------------------------------------------|------------------------------------|
| Read-Side                                                                           |                                         |                                                   |                                    |
| EBR                                                                                 | PC                                      | QSBR                                              |                                    |
| <pre>epoch_read_begin(); [] epoch_read_end();</pre>                                 | <pre>pc_addref(); [] pc_delref();</pre> | <pre>rcu_read_lock(); [] rcu_read_unlock();</pre> |                                    |
| Write-Side (deferral)                                                               |                                         |                                                   |                                    |
| EBR                                                                                 | PC                                      |                                                   | QSBR                               |
| <pre>epoch_write_begin(); [] epoch_write_end(); epoch_defer(&amp;a-&gt;e, des</pre> | []<br>stroy); pc_de                     | efer(&a->e, destroy);                             | [_]<br>qsbr_defer(&a->e, destroy); |

- Do we want to generalize an interface?
- Can we create a general interface to these forms of SMR?

# Summary of Opinions

- Documentation and reference implementations are underwhelming for neophytes (exception is RCU and QSBR).
- No metrics and standards of quality for prospective contributors.
- NBDS is vulnerable to SMR-scheme lock-down, difficult to plug and play schemes for various workloads.
- There is a lack of independent surveys.







| State of the World |      |      |      |      |      |      |      |  |  |  |  |
|--------------------|------|------|------|------|------|------|------|--|--|--|--|
| Structure          | SPSC | SPMC | SPNC | UPMC | MPMC | MPSC | MPNC |  |  |  |  |
| bag                |      | 0/0  |      |      |      |      |      |  |  |  |  |
| bitmap             |      |      |      |      | 0/0  |      |      |  |  |  |  |
| fifo               | 0/0  |      |      | ●/●  |      |      |      |  |  |  |  |
| ht                 |      | 0/●  |      |      |      |      |      |  |  |  |  |
| stack              |      |      |      | ●/●  | ●/●  |      | 0/-  |  |  |  |  |
| ring               | 0/0  |      |      |      |      |      |      |  |  |  |  |
| queue.h            | 0/0  |      |      |      |      |      |      |  |  |  |  |

o = WF • = LF
Format is consumer>/<consumer>



#### Hash tables for unmanaged languages

- Bytestring keys allow for program-defined lifetime semantics
- Well-defined semantics
- Supports replace, insert, growth, delete, get, iteration
- Power-of-2 hash table with linear-probing/double-hashing hybrid
- Statistically a wait-free hash table

#### Future Work

- Provide an MPMC variant
- Trim down constant factors associated with interface
- Allow for user-defined upper-bounds on retry probability
- Allow bytestring tables to emulate memory savings of pointer-packing
- Expose memory hierarchy to probe sequence



#### Future Work

- RADIX tree
- Set
- Skiplist
- T-Tree (SPMC)



# Summary of Opinions

- First instinct is to generalize to MPMC while interesting design patterns and avenues can be found through SPMC
- Allow applications to define lifetime of referenced objects with minimal data duplication
- General purpose libraries should allow for workload specialization
- Reference implementations needed for the classics
- Expose rank in wait-free hierarchy



### The End

http://concurrencykit.org http://appnexus.com





Nothing in this presentation constitutes legal advice. Consult with a lawyer, accountant, and insurance professional before making your decisions. The views and opinions in this presentation represent my own and not those of people, institutions or organizations I am affiliated with unless stated explicitly. This presentation does not represent the views, position or attitudes of my employer, their clients, or any of their affiliated companies.

