Linux Kernel Debugging on Google-Sized ClustersMartin Bligh (email@example.com)
This paper will discuss the difficulties and methods involved in debugging the Linux kernel. Intermittent errors that occur once every few years are hard to debug ... but a problem when running across thousands of machines simultaneously. The more we scale to very large clusters, the more reliablilty becomes critical. In such environments, many of the normal debugging luxuries are gone (like a serial console, or any physical access), and we're forced to change to a different strategy to solve thorny intermittent race conditions.