LinuxCon Europe 2013 - About Data Races in the Linux Kernel

From Rosalab Wiki
Jump to: navigation, search
m
 
 
Line 1: Line 1:
 
<!-- Смело пишите здесь свою заметку. Можно использовать все возможности вики, включать картинки и другие статьи — все они автоматически отреплицируются наружу -->
 
<!-- Смело пишите здесь свою заметку. Можно использовать все возможности вики, включать картинки и другие статьи — все они автоматически отреплицируются наружу -->
  
October 22, 2013 - Eugene Shatokhin, one of our developers, gave a [http://linuxconcloudopeneu2013.sched.org/event/918f111663374602fe344259b188d807 talk] devoted to [http://en.wikipedia.org/wiki/Race_condition data races] in the Linux kernel modules at LinuxCon Europe.
+
October 22, 2013 - Eugene Shatokhin, one of our developers, gave a [http://linuxconcloudopeneu2013.sched.org/event/918f111663374602fe344259b188d807 talk] devoted to [http://en.wikipedia.org/wiki/Race_condition data races] in the Linux kernel modules at LinuxCon Europe ([http://cdn.2safe.com/153759033759/LinuxCon_2013-Shatokhin-v03.pdf slides], [http://cdn.2safe.com/200700033759/speaker_notes.odt notes for the slides]).
  
According to one of possible definitions, a ''data race'' is a situation when two or more threads simultaneously access the same memory location and at least one of the threads modifies that memory location.
+
[[File:LinuxConEurope2013_Races.jpg|right|384px]] According to one of possible definitions, a ''data race'' is a situation when two or more threads simultaneously access the same memory location and at least one of the threads modifies that memory location.
  
 
Such errors could be very hard to find and their consequences may vary from negligible to critical. Hunting the races down is especially important for the Linux kernel, where, for example, the code of the drivers can be executed by many threads at the same time. Now add interrupts and other asynchronous events and remember that the synchronization rules for the data are not always fully described (if described at all)...
 
Such errors could be very hard to find and their consequences may vary from negligible to critical. Hunting the races down is especially important for the Linux kernel, where, for example, the code of the drivers can be executed by many threads at the same time. Now add interrupts and other asynchronous events and remember that the synchronization rules for the data are not always fully described (if described at all)...
  
Most of the talk was about the tools that can detect data races in the Linux kernel. [http://code.google.com/p/kernel-strider/ KernelStrider] and [https://github.com/winnukem/racehound RaceHound] tools, Eugene was actively involved in development of, were covered in more detail.
+
Most of the talk was about the tools that can detect data races in the Linux kernel. [http://code.google.com/p/kernel-strider/ KernelStrider] and [https://github.com/winnukem/racehound RaceHound] tools, Eugene was one of the main developers of, were covered in more detail.
  
 
KernelStrider collects data about the operation of the kernel component (e.g. a driver) under analysis in runtime. The information about memory accesses, allocations and deallocations, locks and unlocks, etc., is then analyzed in the use space by [http://code.google.com/p/data-race-test/ ThreadSanitizer] (Google). The algorithm of searching for races is briefly described [http://code.google.com/p/data-race-test/wiki/ThreadSanitizerAlgorithm here].
 
KernelStrider collects data about the operation of the kernel component (e.g. a driver) under analysis in runtime. The information about memory accesses, allocations and deallocations, locks and unlocks, etc., is then analyzed in the use space by [http://code.google.com/p/data-race-test/ ThreadSanitizer] (Google). The algorithm of searching for races is briefly described [http://code.google.com/p/data-race-test/wiki/ThreadSanitizerAlgorithm here].
Line 20: Line 20:
 
* If some other thread accesses that memory area during the delay, the hardware breakpoint will trigger and RaceHound will report a race.
 
* If some other thread accesses that memory area during the delay, the hardware breakpoint will trigger and RaceHound will report a race.
  
That is, KernelStrider plays the part of a "detective" or an "analyst" here that narrows the range of "suspects" - the fragments of the code possibly involved in data races. RaceHound is then a covert monitoring system tracking these suspects. If it catches a suspect "red-handed", everything is clear, the "crime" (data race) is confirmed.
+
That is, KernelStrider plays the part of a "detective" or an "analyst" here and narrows the range of "suspects" - the fragments of the code possibly involved in data races. RaceHound is then a covert monitoring system tracking these suspects. If it catches a suspect "red-handed", everything is clear, the "crime" (data race) is confirmed.
  
 
There were many questions asked both during the talk and after it. Among other things, the audience was interested in the following.
 
There were many questions asked both during the talk and after it. Among other things, the audience was interested in the following.
Line 31: Line 31:
 
* and so on.
 
* and so on.
  
The slides and notes for this talk are available at the [http://code.google.com/p/kernel-strider/ project page], in "Talks and Slides" section.
+
The developers from Intel actively participated in the discussion of the races found by the tools mentioned above. It is no surprise because these races [http://sourceforge.net/mailarchive/message.php?msg_id=31245543 were found in the network driver e1000], created by Intel. A strange thing became obvious during that discussion: it is a common practice in the network drivers not to use synchronization in some cases even if a race may happen as a result (and the races were actually found there). This is the case, for example, for NAPI and some of the functions involved in data transmission. This is probably to avoid performance losses due to locking but the estimates of such losses as well as the guidelines how to avoid problems there are yet to be found.
 +
 
 +
It seems that many kernel developers share the following attitude to the data races:
 +
 
 +
<blockquote>
 +
- Have you observed any particular problems due to that race? Has anything crashed or otherwise worked wrong?
 +
 
 +
- Not yet.
 +
 
 +
- Oh, well.
 +
</blockquote>
 +
 
 +
And nothing happens then.
 +
 
 +
Reasonable? Perhaps, but if one remembers [https://www.usenix.org/legacy/event/hotpar11/tech/final_files/Boehm.pdf this article], for example, the reason becomes less certain.
  
 
[[Category:ToROSAPlanet]]
 
[[Category:ToROSAPlanet]]
 +
{{wl-publish: 2013-10-29 13:15:17 +0400 | Eugene.shatokhin }}

Latest revision as of 12:15, 29 October 2013


October 22, 2013 - Eugene Shatokhin, one of our developers, gave a talk devoted to data races in the Linux kernel modules at LinuxCon Europe (slides, notes for the slides).

LinuxConEurope2013 Races.jpg
According to one of possible definitions, a data race is a situation when two or more threads simultaneously access the same memory location and at least one of the threads modifies that memory location.

Such errors could be very hard to find and their consequences may vary from negligible to critical. Hunting the races down is especially important for the Linux kernel, where, for example, the code of the drivers can be executed by many threads at the same time. Now add interrupts and other asynchronous events and remember that the synchronization rules for the data are not always fully described (if described at all)...

Most of the talk was about the tools that can detect data races in the Linux kernel. KernelStrider and RaceHound tools, Eugene was one of the main developers of, were covered in more detail.

KernelStrider collects data about the operation of the kernel component (e.g. a driver) under analysis in runtime. The information about memory accesses, allocations and deallocations, locks and unlocks, etc., is then analyzed in the use space by ThreadSanitizer (Google). The algorithm of searching for races is briefly described here.

KernelStrider may issue false alarms in some cases. For example, false alarms happen when a network driver turns off interrupts in hardware and then accesses some common data without the risk of conflicts with the interrupt handlers.

RaceHound tool allows to check the warnings about data races issued by KernelStrider and find real races among them. RaceHound works as follows.

  • A software breakpoint is placed on an instruction in the binary code of the driver that may be involved in a data race.
  • When the breakpoint triggers, RaceHound determines the address of the memory area which is about to be accessed by the instruction. Then it places a hardware breakpoint to track the accesses of the needed kinds (writes only or both reads and writes) to that memory area.
  • A small delay is made before the execution of the instruction.
  • If some other thread accesses that memory area during the delay, the hardware breakpoint will trigger and RaceHound will report a race.

That is, KernelStrider plays the part of a "detective" or an "analyst" here and narrows the range of "suspects" - the fragments of the code possibly involved in data races. RaceHound is then a covert monitoring system tracking these suspects. If it catches a suspect "red-handed", everything is clear, the "crime" (data race) is confirmed.

There were many questions asked both during the talk and after it. Among other things, the audience was interested in the following.

  • Plans to support ARM (the mentioned tools currently work on x86 only) - may be later, not in the near future.
  • Situations when KernelStrider misses data races - yes, this is possible in some cases, mostly due to how ThreadSanitizer works as well as due to sometimes inaccurate event ordering rules used.
  • Support for suspend/resume in KernelStrider - yes, KernelStrider operates during suspend and resume too.
  • Support for analysis of the kernel proper rather than the modules in KernelStrider and RaceHound - not implemented at the moment.
  • Instrumentation of the code to be analyzed during the compilation rather than during its loading as KernelStrider does now - may be beneficial, it is actually one of the future directions of the development.
  • and so on.

The developers from Intel actively participated in the discussion of the races found by the tools mentioned above. It is no surprise because these races were found in the network driver e1000, created by Intel. A strange thing became obvious during that discussion: it is a common practice in the network drivers not to use synchronization in some cases even if a race may happen as a result (and the races were actually found there). This is the case, for example, for NAPI and some of the functions involved in data transmission. This is probably to avoid performance losses due to locking but the estimates of such losses as well as the guidelines how to avoid problems there are yet to be found.

It seems that many kernel developers share the following attitude to the data races:

- Have you observed any particular problems due to that race? Has anything crashed or otherwise worked wrong?

- Not yet.

- Oh, well.

And nothing happens then.

Reasonable? Perhaps, but if one remembers this article, for example, the reason becomes less certain.

[ List view ]Comments

Very interesting story, thanks for the report.

Please login to comment.