Linux Device Driver 3rd edition - CH5 recap

Concurrency and Race Conditions

what is concurrency?
- what happens when the system tries to do more than one thing a once.
concurrency is one of the core problems in operating system.
concurrency-related bugs are some of the easiest to create and some of hardest to find.

Pitfalls in scull

// write logic
if (!dptr->data[s_pos]) {
    dptr->data[s_pos] = kmalloc(quantum, GFP_KERNEL);
    if (!dptr->data[s_pos])
        goto out;
}

when there are two processes independently do the write logic at the same time, they will allocate memory at the same time, as a result, only the second assignment will “win”, scull will forget the first memory allocated (memory leak).
also called race condition.
- result of uncontrolled access to shared data.
- diff access patterns cause diff unexpected results.

Concurrency and Its Management

sources of concurrency and race condition.
- multiple user-space processes are running.
- SMP system can be executing your code simultaneously on different processors.
- kernel code is preemptible, your driver’s code can lose the processor at any time, the process that replace it could also be running in your driver.
- device interrupts are asynchronous events that can cause concurrent execution of your code.
- kernel mechanisms for delayed code execution.
  - workqueues
  - tasklets
  - timers
- device could simply disappear while working on it (hot-plug).
root cause of race condition
- shared access to resources.
first rule of thumb
- avoid shared resources whenever possible.
- avoid using global variables.
but sharing is a fact of life.
hard rule of resource sharing
- any time that a hardware or software resource is shared beyond a single thread of execution.
- possibility exists that one thread could encounter an inconsistent view of that resource.
  - explicitly manage access to that resource.
access management : locking or mutual exclusion.
one other important rule
- shared objects should exist until there’s no outside reference to it.

Semaphores and Mutexes

goal
- make our operations on the scull data structure atomic.
  - set up critical sections.
    - code that can be executed by only one thread at any given time.
when a linux process reaches a point where it cannot make any further process, it goes to sleep (or “blocks”).
- often sleep when waiting for I/O to complete.
semaphores is a single integer value combined with a pair of functions that are typically called P and V.
- a process wishing to enter a critical seciton will call P on the relevant semaphore.
  - if the semaphore’s value is greater than zero
    - the value is decremented by one and the process continues.
  - if the semaphore’s value is zero or less.
    - the process must wait until somebody else releases the semaphore.
- call V to unlock a semaphore.
  - increments the value of the semaphore.
  - wakes up processes that are waiting if necessary.
semaphore as a mutex (mutual exclusion)
- initially set semaphore value to 1

The Linux Semaphore Implementation

<asm/semaphore.h>

void sema_init(struct semaphore *sem, int val);

val
- semaphore initial value.

// use semaphore as mutex
DECLARE_MUTEX(name);        // init to 1
DECLARE_MUTEX_LOCKED(name);   // init to 0

name
- semaphore variable.

// initialized at runtime (allocated dynamically)
void init_MUTEX(struct semaphore *sem);
void init_MUTEX_LOCKED(struct semaphore *sem);

```
// P functions
void down(struct semaphore *sem);
int down_interruptible(struct semaphore *sem);
int down_trylock(struct semaphore *sem);
```
- down
  - decrements the value of the semaphore and waits as long as need be.
- down_interruptible
  - same as above, but the operation is interruptible.
  - almost always the one you will want.
  - allows a user-space process that is waiting on a semaphore to be interrupted by the user.
  - noninterruptible operations are a good way to create unkillable processes (the dreaded “D state” seen in ps)
  - proper use of down_interruptible requires always checking the return value and responding accordingly.
- down_trylock
  - never sleeps.
  - if the semaphore is not available, it returns immediately with a nonzero value.
```
// V functions
void up(struct semaphore *sem);
```
- up
  - once called, the process no longer holds the semaphore.
any thread that takes out a semaphore is required to release it with one (and only one) call to up.
- if an error is encountered while a semaphore is held, that semaphore must be released before returning the error status to the caller.
- failure to free a semaphore is an easy error to make.
  - the result (processes hanging in seemingly unrelated places) can be hard to reproduce and track down.

Using Semaphores in scull

struct scull_dev {
    ...
    struct semaphore sem; // mutual exclusion semaphore
    ...
}

for (i = 0; i < scull_nr_devs; i++) {
    ...
    init_MUTEX(&scull_devices[i].sem);
    ...
}

if (down_interruptible(&dev->sem))
    return -ERESTARTSYS;

out:
    up(&dev->sem);
    return retval;

Reader / Writer Semaphores

Many tasks break down into two distinct types of work
- tasks that only need to read the protected data.
- tasks that must make changes to the protected data.
often possible to allow multiple concurrent readers.
- optimize performance significantly.
- read-only tasks can get their work done in parallel without having to wait for other readers to exit the critical section.

// header file
#include <linux/rwsem.h>
    
// data structure
struct rw_semaphore sem;

// init function
void init_rwsem(struct rw_semaphore *sem);

// read-only access
void down_read(struct rw_semaphore *sem);
int down_read_trylock(struct rw_semaphore *sem);
void up_read(struct rw_semaphore *sem);

// write access
void down_write(struct rw_semaphore *sem);
int down_write_trylock(struct rw_semaphore *sem);
void up_write(struct rw_semaphore *sem);
void downgrade_write(struct rw_semaphore *sem);

down_read
- may put the calling process into an uninterruptible sleep.
down_read_trylock
- will not wait if read access is unavailable.
  - returns nonzero value.
downgrade_write
- If you have a situation where a writer lock is needed for a quick change, followed by a longer period of read-only access, you can use downgrade_write to allow other readers in once you have finished making changes.

an rwsem allows either one writer or an unlimited number of readers to hold the semaphore.
- writers get priority
  - as soon as writer tries to enter the critical section, no readers will be allowed in untill all writers have completed their work.
  - can lead to reader starvation.
    - readers are denied access for a long time if you have a large number of writers contending for the semaphore.
  - rwsems are best used when write access is required only rarely, and writer access is held for short periods of time.

Completions

A common pattern in kernel programming.
- initiating some activity outside of the current thread, then waiting for that activity to complete.
  - creation of a new kernel thread or user-space process.
  - a request to an existing process.
  - some sort of hardware-based action.
- it can be tempting to use a semaphore for synchronization of the two tasks.
  - ```
  struct semaphore sem;
  init_MUTEX_LOCKED(&sem);
  start_external_task(&sem);
  down(&sem);
  /* the external task can then call up(&sem) when its work is done. */
```
semaphores are not the best tool to use in this situation.
- semaphores have been heavily optimized for the “available” case.
  - performance will suffer when it needs to wait.
- semaphores can also be subject to (difficult) race condition when used in this way if they are declared as automatic variable.
- in some cases, the semaphore could vanish before the process calling up is finished with it.

Completions are a lightweight mechanism with one task

allowing one thread to tell another that the job is done.

// header file
#include <linux/completion.h>
        
// create completion
DECLARE_COMPLETION(my_completion);

// create and initialize dynamically
struct completion my_completion;
init_completion(&my_completion);

// wait for the completion
void wait_for_completion(struct completion *c);

// signal completion
void complete(struct completion *c);
void complete_all(struct completion *c);

wait_for_completion
- cause uninterruptible wait.
- if nobody ever completes the task, the result will be an unkillable process.
complete
- wakes up only one of the waiting threads.
complete_all
- wakes up all of the waiting threads to proceed.

A completion is normally a one-shot device.

it is used once then discarded.

// reinitialize the completion structure before reusing it.
INIT_COMPLETION(struct completion c);

example code

any process trying to read from the device will wait until some other process writes to the device.

DECLARE_COMPLETION(comp);

ssize_t complete_read(struct file *filp, char __user *buf, size_t count, loff_t *pos) {
    printk(KERN_DEBUG "process %i (%s) going to sleep\n",
            current->pid, current->comm);
    wait_for_completion(&comp);
    printk(KERN_DEBUG "awoken %i (%s)\n", current->pid, current->comm);
    return 0; /* EOF */
}

ssize_t complete_write(struct file *filp, const char __user *buf, size_t count, loff_t *pos) {
    printk(KERN_DEBUG "process %i (%s) awakening the readers...\n",
            current->pid, current->comm);
    complete(&comp);
    return count; /* succeed, to avoid retrial */
}

a typical use of the completion mechanism is with kernel thread termination at module exit time.
- ```
void complete_and_exit(struct completion *c, long retval);
```

Spinlocks

most locking is implemented with a mechanism called spinlock.
- may be used in code that cannot sleep. (e.g., interrupt handlers)
- offers higher performance than semaphores in general when properly used.
a spinlock
- is a mutual exclusion device that can have only two values
  - locked
  - unlocked
- usually implemented as a single bit in an integer value.
- if the lock is available
  - the “locked” bit is set and the code continues into the critical section.
- if the lock has been taken by somebody else
  - the code goes into a tight loop where it repeatedly checks the lock until it becomes available.
  - this loop is the “spin” part of a spinlock.
the real implementation of a spinlock is a bit more complex.
- the “test and set” operation must be done in an atomic manner.
  - only one thread can obtain the lock, even if several are spinning at any given time.
- care must also be taken to avoid deadlocks on hyperthreaded processors
  - chips that implement multiple, virtual CPUs sharing a single processor core and cache.
- different for every architecture that Linux supports.
  - core concept is the same on all systems.

Introduction to the Spinlock API

// header file
#include <linux/spinlock.h>

// initialization at compile time
spinlock_t my_lock = SPIN_LOCK_UNLOCKED;

// or at runtime
void spin_lock_init(spinlock_t *lock);

// obtain lock before entering a critical section
void spin_lock(spinlock_t *lock);

// release a lock
void spin_unlock(spinlock_t *lock);

Spinlocks and Atomic Context

When your driver acuires a spinlock then loses the processor due to
- called a function (e.g., copy_from_user) that puts the process to sleep.
- kernel preemption kicks in.
- higher-priority process pushes your code aside.
your code is now holding a lock that it will not release any time in the foreseeable future.
- if some other thread tries to obtain the same lock
  - in the best case, wait (spinning in the processor) for a very long time.
  - in the worst case, the system could deadlock entirely.
The core rule that applies to spinlocks is that any code must, while holding a spinlock, be atomic.
- cannot sleep
- cannot relinquish the processor for any reason except to service interrupts (and sometimes not even then).
The kernel preemption case is handled by the spinlock code itself.
- any time kernel code holds a spinlock, preemption is disabled on the relevant processor.
Avoid sleep while holding a lock can be more difficult.
- many kernel functions can sleep
  - copy data to or from user space
    - the required user-space page may need to be swapped in from the disk before the copy can proceed
  - any operation that must allocate memory can sleep.
    - kmalloc can decide to give up the processor, and wait for more memory to become available.
  - surprising places.
- writing code that will execute under a spinlock requires paying attention to every function that you call.
another scenario
- your driver is executing and has just taken out a lock that controls access to its device.
- while the lock is held, the device issues an interrupt, which causes your interrupt handler to run.
- the interrupt handler, before accessing the device, must also obtain the lock.
  - taking out a spinlock in an interrupt handler is a legitimate thing to do.
    - one of the reasons tha spinlock operations do not sleep.
- what happens if the interrupt routine executes in the same processor as the code that took out the lock originally?
  - the interrupt handler is spinning
  - the noninterrupt code will not be able to run to release the lock.
  - that processor will spin forever.
Avoiding this trap requires disabling interrupts (on the local CPU only) while the spinlock is held.
The last important rule for spinlock usage is that spinlocks must always be held for the minimum time possible.
- the longer you hold a lock
  - the longer another processor may have to spin waiting for you to release it.
  - the chance of it having to spin at all is greater.
- long lock hold times keep the current processor from scheduling
  - higher priority process which really should be able to get the CPU may have to wait.
- a poorly written driver can wipe out all that progress just by holding a lock for too long.

The Spinlock Functions

```
void spin_lock(spinlock_t *lock);
void spin_lock_irqsave(spinlock_t *lock, unsigned long flags);
void spin_lock_irq(spinlock_t *lock);
void spin_lock_bh(spinlock_t *lock);
```
- spin_lock_irqsave
  - disables interrupts (on the local processor only) before taking the spinlock.
  - previous interrupt state is stored in flags.
- spin_lock_irq
  - used when you are absolutely sure nothing else might have already disabled interrupts on your processor (or, in other words, you are sure that you should enable interrupts when you release your spinlock)
  - not have to keep track of the flags.
- spin_lock_bh
  - disables software interrupts before taking the lock.
  - leaves hardware interrupts enabled.

// corresponding unlock functions.
void spin_unlock(spinlock_t *lock);
void spin_unlock_irqrestore(spinlock_t *lock, unsigned long flags);
void spin_unlock_irq(spinlock_t *lock);
void spin_unlock_bh(spinlock_t *lock);

// nonblocking spinlock operations
void spin_trylock(spinlock_t *lock);
void spin_trylock_bh(spinlock_t *lock);

returns nonzero on success (the lock was obtained), 0 otherwise.
no “try” version that disables interrupts.

Reader / Writer Spinlocks

// header file
#include <linux/spinlock.h>

rwlock_t my_rwlock = RW_LOCK_UNLOCKED; /* Static initialization */

rwlock_t my_rwlock;
rwlock_init(&my_rwlock); /* Dynamic initialization */

// reader lock/unlock functions
void read_lock(rwlock_t *lock);
void read_lock_irqsave(rwlock_t *lock, unsigned long flags);
void read_lock_irq(rwlock_t *lock);
void read_lock_bh(rwlock_t *lock);

void read_unlock(rwlock_t *lock);
void read_unlock_irqsave(rwlock_t *lock, unsigned long flags);
void read_unlock_irq(rwlock_t *lock);
void read_unlock_bh(rwlock_t *lock);
/* there's no read_trylock */

// writer lock/unlock functions
void write_lock(rwlock_t *lock);
void write_lock_saveirq(rwlock_t *lock, unsigned long flags);
void write_lock_irq(rwlock_t *lock);
void write_lock_bh(rwlock_t *lock);
int write_trylock(rwlock_t *lock);
    
void write_unlock(rwlock_t *lock);
void write_unlock_saveirq(rwlock_t *lock, unsigned long flags);
void write_unlock_irq(rwlock_t *lock);
void write_unlock_bh(rwlock_t *lock);

Locking Traps

Ambiguous Rules

A proper locking scheme requires clear and explicit rules
if one function has a lock and then calls another function that also attempts to acuire the lock, your code deadlocks.
write functions with the assumption that their caller has already aquired the relavent lock(s).
- functions called outside must handle locking explicitly.
- document those assumptions explicitly.

Lock Ordering Rules

the more lock you hold, the more possibility you deadlock.
when multiple locks must be acuired
- they should always be acuired in the same order.
- nearest to local first, nearest to kernel last.
- semaphores first, spinlocks last.
  - calling down (which can sleep) while holding a spinlock is a serious error.
try to avoid situations where you need more than one lock.

Fine- Versus Coarse-Grained Locking

2.0
- the big kernel lock turned the entire kernel into one large critial section.
- pros
  - solved the concurrency problem well enough.
- cons
  - not scale very well.
  - multi-processor could spend a significant amount of time simply waiting for the big kernel lock.
2.2
- A modern kernel can contain thouthands of locks, each protecting one small resource.
- pros
  - good for scalability.
  - allow each processor to work on its specific task without contending for locks used by other processors.
- cons
  - very hard to know which locks you need. (there are thouthands of locks)
  - in which order to acuire the locks.
  - more locks provide more opportunities for truly nasty locking bugs to creep into the kernel.
start with relatively coarse locking unless you have a real reason to believe that contention could be a problem.
- resist the urge to optimize prematurely.
- the real performance constraints often show up in unexpected places.
measure time spent waiting in locks.
- lockmeter tool
  - http://oss.sgi.com/projects/lockmeter/

Alternative to Locking

Lock-Free Algorithm

recast algorithms to avoid the need of locking altogether.
- only one writer in reader/writer situations can often work in this manner.
useful data structure for lockess producer/consumer tasks
- circular buffer
  - empty
    - read & write pointers are equal.
  - full
    - write pointer is right behind read pointer.
  - header file: <linux/kfifo.h>
    Atomic Variables
the operations are very fast
- they compile to a single machine instruction whenever possible.

// header file
#include <asm/atomic.h>

// set atomic variable v to value i.
void atomic_set(atomic_t *v, int i);
/* or use ATOMIC_INIT macro at compile time. */

// return the current value of v.
int atomic_read(atomic_t *v);

// add i to the atomic variable pointed to v.
void atomic_add(int i, atomic_t *v);

// subtract i from *v.
void atomic_sub(int i, atomic_t *v);

// increment or decrement and atomic variable.
void atomic_inc(atomic_t *v);
void atomic_dec(atomic_t *v);

/* perform the specified operation and test the result,
 * if the value is 0 after operation, return true.
 */
int atomic_inc_and_test(atomic_t *v);
int atomic_dec_and_test(atomic_t *v);
int atomic_sub_and_test(int i, atomic_t *v);

// add i to v, return true if the result is negative.
int atomic_add_negative(int i, atomic_t *v);

// return new atomic value after operation
int atomic_add_return(int i, atomic_t *v);
int atomic_sub_return(int i, atomic_t *v);
int atomic_inc_return(atomic_t *v);
int atomic_dec_return(atomic_t *v);

Bit Operations

architecture dependent.
nr describes which bit to manipulate.

// header file
#include <asm/bitops.h>

// sets bit number nr in the data item pointed to by addr.
void set_bit(nr, void *addr);

// clears bit number nr in the data item pointed to by addr.
void clear_bit(nr, void *addr);

// toggles the bit
void change_bit(void *addr);

// returns the current value of the bit.
test_bit(nr, void *addr);

// combo operations and return prev bit value.
int test_and_set_bit(nr, void *addr);
int test_and_clear_bit(nr, void *addr);
int test_and_change_bit(nr, void *addr);


// example
/* try to set lock */
while (test_and_set_bit(nr, addr) != 0)
    wait_for_a_while();

do_your_work();

/* release lock, and check... */
if (test_and_clear_bit(nr, addr) == 0))
    something_went_wrong(); /* already released: error */

seqlocks

work in situatios where the resource to be protected is
- small
- simple
- frequently accessed.
- write access is rare but must fast.

// header file
#include <linux/seqlock.h>

// two methods for init a seqlock.
seqlock_t lock1 = SEQLOCK_UNLOCKED;

seqlock_t lock2;
seqlock_init(&lock2);

// reader code format
unsigned int seq;
do {
    seq = read_seqbegin(&the_lock);
    do_your_work();
} while read_seqretry(&the_lock, seq);

/* if seqlock might be accessed from an interrupt handler,
 * use irq-safe version instead.
 */
unsigned int read_seqbegin_irqsave(seqlock_t *lock, unsigned long flags);
int read_seqretry_irqrestore(seqlock_t *lock, unsigned int seq, unsigned long flags);

// writer lock/unlock functions
void write_seqlock(seqlock_t *lock);
void write_seqlock_irqsave(seqlock_t *lock, unsigned long flags);
void write_seqlock_irq(seqlock_t *lock);
void write_seqlock_bh(seqlock_t *lock);
/* write_tryseqlock returns nonzero if it was able to obtain the lock. */

void write_sequnlock(seqlock_t *lock);
void write_sequnlock_irqrestore(seqlock_t *lock, unsigned long flags);
void write_sequnlock_irq(seqlock_t *lock);
void write_sequnlock_bh(seqlock_t *lock);

Read-Copy-Update

RCU
- advanced mutual exclusion scheme
- can yield high performance in the right conditions.
- RCU algo: http://www.rdrop.com/users/paulmck/rclock/intro/rclock_intro.html
optimized for situations where
- reads are common
- writes are rare
- the protected resources should be accessed via pointers.
- all references to those resources must be held only by atomic code.

// header file
#include <linux/rcupdate.h>

// format
struct my_stuff *stuff;

rcu_read_lock();
stuff = find_the_stuff(args...);
do_something_with(stuff);
rcu_read_unlock();

no reference to the protected resource may be used after the call to rcu_read_unlock.
Steps to change the protected structure
- allocates a new structure.
- copies data from the old one (if need be).
- replace the pointer that is seen by the read code.
the writer code must wait until it knows that no such reference can exist.
cleanup callback
- void call_rcu(struct rcu_head *head, void (*func)(void *atg), void *arg);
  - func
    - usually just need to call kfree to free resources.
  - arg
    - the arg in func is same as the one in call_rcu.
  - more details can be found in the header file.

Concurrency and Race Conditions

Pitfalls in scull

Concurrency and Its Management

Semaphores and Mutexes

The Linux Semaphore Implementation

Using Semaphores in scull

Reader / Writer Semaphores

Completions

Spinlocks

Introduction to the Spinlock API

Spinlocks and Atomic Context

The Spinlock Functions

Reader / Writer Spinlocks

Locking Traps

Ambiguous Rules

Lock Ordering Rules

Fine- Versus Coarse-Grained Locking

Alternative to Locking

Lock-Free Algorithm

Atomic Variables

Bit Operations

seqlocks

Read-Copy-Update