LinuxCon Europe 2013 - About Data Races in the Linux Kernel

From Rosalab Wiki
Jump to: navigation, search

October 22, 2013 - Eugene Shatokhin, one of our developers, gave a talk devoted to data races in the Linux kernel modules at LinuxCon Europe (slides, notes for the slides).

LinuxConEurope2013 Races.jpg
According to one of possible definitions, a data race is a situation when two or more threads simultaneously access the same memory location and at least one of the threads modifies that memory location.

Such errors could be very hard to find and their consequences may vary from negligible to critical. Hunting the races down is especially important for the Linux kernel, where, for example, the code of the drivers can be executed by many threads at the same time. Now add interrupts and other asynchronous events and remember that the synchronization rules for the data are not always fully described (if described at all)...

Most of the talk was about the tools that can detect data races in the Linux kernel. KernelStrider and RaceHound tools, Eugene was one of the main developers of, were covered in more detail.

KernelStrider collects data about the operation of the kernel component (e.g. a driver) under analysis in runtime. The information about memory accesses, allocations and deallocations, locks and unlocks, etc., is then analyzed in the use space by ThreadSanitizer (Google). The algorithm of searching for races is briefly described here.

KernelStrider may issue false alarms in some cases. For example, false alarms happen when a network driver turns off interrupts in hardware and then accesses some common data without the risk of conflicts with the interrupt handlers.

RaceHound tool allows to check the warnings about data races issued by KernelStrider and find real races among them. RaceHound works as follows.

  • A software breakpoint is placed on an instruction in the binary code of the driver that may be involved in a data race.
  • When the breakpoint triggers, RaceHound determines the address of the memory area which is about to be accessed by the instruction. Then it places a hardware breakpoint to track the accesses of the needed kinds (writes only or both reads and writes) to that memory area.
  • A small delay is made before the execution of the instruction.
  • If some other thread accesses that memory area during the delay, the hardware breakpoint will trigger and RaceHound will report a race.

That is, KernelStrider plays the part of a "detective" or an "analyst" here and narrows the range of "suspects" - the fragments of the code possibly involved in data races. RaceHound is then a covert monitoring system tracking these suspects. If it catches a suspect "red-handed", everything is clear, the "crime" (data race) is confirmed.

There were many questions asked both during the talk and after it. Among other things, the audience was interested in the following.

  • Plans to support ARM (the mentioned tools currently work on x86 only) - may be later, not in the near future.
  • Situations when KernelStrider misses data races - yes, this is possible in some cases, mostly due to how ThreadSanitizer works as well as due to sometimes inaccurate event ordering rules used.
  • Support for suspend/resume in KernelStrider - yes, KernelStrider operates during suspend and resume too.
  • Support for analysis of the kernel proper rather than the modules in KernelStrider and RaceHound - not implemented at the moment.
  • Instrumentation of the code to be analyzed during the compilation rather than during its loading as KernelStrider does now - may be beneficial, it is actually one of the future directions of the development.
  • and so on.

The developers from Intel actively participated in the discussion of the races found by the tools mentioned above. It is no surprise because these races were found in the network driver e1000, created by Intel. A strange thing became obvious during that discussion: it is a common practice in the network drivers not to use synchronization in some cases even if a race may happen as a result (and the races were actually found there). This is the case, for example, for NAPI and some of the functions involved in data transmission. This is probably to avoid performance losses due to locking but the estimates of such losses as well as the guidelines how to avoid problems there are yet to be found.

It seems that many kernel developers share the following attitude to the data races:

- Have you observed any particular problems due to that race? Has anything crashed or otherwise worked wrong?

- Not yet.

- Oh, well.

And nothing happens then.

Reasonable? Perhaps, but if one remembers this article, for example, the reason becomes less certain.

[ List view ]Comments

Very interesting story, thanks for the report.

Please login to comment.