Speed up your Linux - the whole story
Quote from magazine interview "You wouldn't know it from seeing him online, but by day Con Kolivas works an anaesthetist at a hospital in Melbourne. Linux kernel hacking is just one of his hobbies.Despite this, Con is one of the most well known names in Linux kernel development -- and with good reason. His focus on improving the kernel for desktop performance has won him a legion of fans, and his patchsets for the Linux kernel (marked as -ck) have had a significant impact. So much so that some of his changes have been directly incorporated into the kernel, and some of ideas inspire changes still taking place.
Recently, however, Con announced he was leaving it all behind. Interested in hearing what prompted the move I contacted Con to talk about the reasons for his leaving, what it takes to be a kernel developer, the future as he sees it.
Computers of today may be 1,000 times faster than they were a decade ago, yet the things that matter are slower.
The standard argument people give me in response is 'but they do such more these days it isn't a fair comparison'. Well, they're 10 times slower despite being 1000 times faster, so they must be doing 10,000 times as many things. Clearly the 10,000 times more things they're doing are all in the wrong place.
Eventually the only places I noticed any improvements in speed were kernel developments. They were never huge, but caused slightly noticeable changes in things like snappiness, behaviour under CPU load and so on. The first patchset I released to the public contained none of my own code and was for kernel 2.4.18 which was about February 2002. I didn't even know what C code looked like back then, having never actually been formally taught any computer science.
So I stuck with that for a while until the 2.6 development process was under way (we were still in a 2.5 kernel at the time). I watched the development and to be honest... I was horrified. The names of all the kernel hackers I had come to respect and observe were all frantically working away on this new and improved kernel and pretty much everyone was working on all this enterprise crap that a desktop cares not about.
Even worse than that, while I obviously like to see Linux run on 1024 CPUs and 1000 hard drives, I loathe the fact that to implement that we have to kill performance on the desktop. What's that? Kill performance? Yes, that's what I mean.
If we numerically quantify it with all the known measurable quantities, performance is better than ever. Yet all it took was to start up an audio application and wonder why on earth if you breathed on it the audio would skip. Skip! Jigabazillion bagigamaherz of CPU and we couldn't play audio?
Or click on a window and drag it across the screen and it would spit and stutter in starts and bursts. Or write one large file to disk and find that the mouse cursor would move and everything else on the desktop would be dead without refreshing for a minute.
I felt like crying.
I even recall one bug report we tried to submit about this and one developer said he couldn't reproduce the problem on his quad-CPU 4GB RAM machine with 4 striped RAID array disks... think about the sort of hardware the average user would have had four years ago. Is it any wonder the desktop sucked so much?
The developers were all developing for something that wasn't the desktop. They had all been employed by big name manufacturers who couldn't care less about the desktop (and still don't) but want their last 1% on their database benchmark or throughput benchmark or whatever.
The users had lost. The desktop PC, for which linux started out as being development for, had fallen by the wayside. Performance, as home desktop users understand performance, was gone. Worse yet, there was no way to quantify it, and the developers couldn't care if we couldn't prove it. The one place I found some performance was to be gained on the desktop (the kernel) was now doing the opposite.
I had some experience at merging patches from previous kernels and ironically most of them were code around the CPU scheduler. Although I'd never learnt how to program, looking at the code it eventually started making sense.
After a few failed experiments I started writing some code which helped... a lot. As it turns out people did pay attention and eventually my code got incorporated. I was never very happy with how the CPU scheduler tackled interactivity but at least it was now usable on the desktop.
Not being happy with how the actual underlying mechanism worked I set out to redesign that myself from scratch, and the second generation of the -ck patchset was born. This time it was mostly my own code. So this is the story of how I started writing my own code for the linux kernel.
However I have tried very hard to make myself relatively resistant to this placebo effect myself from years of testing, and I guess the fact that my website has close to 1 million hits suggests there are people who agree it makes a difference. This inability to prove quantitatively the advantage that -ck offered, though, was basically what would eventually spell out the death of it.
It seems that the emerging challenges for the linux kernel on the desktop never seem to get whole-heartedly tackled by any full time developer, and only get a sideways glance when the problems are so obvious that even those on the linux kernel mailing list are willing to complain about them.
So I still haven't answered your question about what made me stop kernel development have I? I guess explaining my motivations helps me explain why I stopped.
The user response was still there. If anything, the users got more vocal than ever as I was announcing quitting kernel development.
The intellectual challenge? Well that still existed of course.
The fun? Yes that's what was killed. It stopped being fun. In my now quite public email announcing that I was quitting I explained it briefly. The -ck patchset was for quite a while, a meaningless, out of mainline's spotlight playground for my experiments. As the scope of changes got larger, the improvements became more drastic and were more acutely noticeable. This led to more and more people asking me when the changes would me merged into mainline. As the number of requests grew, my resolve to not get mainline involved diminished. So I'd occasionally post patches as examples to the linux kernel mailing list. This generated more interest and requests to get mainline involved. So I tried.
You asked before what patches from -ck got into mainline and I listed a whole lot of random minor patches. The magnitude of the changes in the patches that did _not_ get involved stands out far more than those that did get in.My first major rejection was the original staircase CPU scheduler. It stood out as being far better in interactivity than the mainline CPU scheduler, but ultimately, just as the mainline CPU scheduler, it had corner cases that meant it was not perfect. While I was still developing it, the attention moved away from the CPU scheduler at that time. The reason given by Andrew Morton (the maintainer and second last gateway into the mainline kernel) at the time was that the kernel had more burning issues and bugs to address.
Of course it did. There were so many subsystems being repeatedly rewritten that there was never-ending breakage. And rewriting working subsystems and breaking them is far more important than something that might improve the desktop right? Oops, some of my bitterness crept in there. I'll try and keep emotion out and just tell the rest of the story as objectively as I can.With a truckload of help from William Lee Irwin III (who wrote the main architecture) I posted a pluggable CPU scheduler framework that would allow you to build into the kernel as many of multiple CPU schedulers as you like and choose at boot time which one to run. I planned to extend that to runtime selection as well. This is much like the modular pluggable I/O scheduler framework that Linux kernel currently has. It was flat out refused by both Linus and Ingo (who is the CPU scheduler maintainer) as leading to specialisation of CPU schedulers and they both preferred there to be one CPU scheduler that was good at everything. I guess you can say the CPU scheduler is a steamroller that we as desktop users use to crack nuts with, and they didn't want us to build a nutcracker into the kernel.
Then along came swap prefetch. I spent a long time maintaining and improving it. It was merged into the -mm kernel 18 months ago and I've been supporting it since. Andrew to this day remains unconvinced it helps and that it 'might' have negative consequences elsewhere. No bug report or performance complaint has been forthcoming in the last 9 months. I even wrote a benchmark that showed how it worked, which managed to quantify it! In a hilarious turnaround Linus asked me offlist 'yeah but does it really help'. Well, user reports and benchmarks weren't enough... It's still in limbo now but they'll have no choice but to drop it without anyone maintaining it.
A lot of users and even kernel developers found that many long lasting complaints with the mainline and other schedulers were fixed by this code and practically begged me to push it into mainline, and one user
demanded Linus merge it as soon as possible as a bugfix. So I supported the code and fixed it as problems arose and did many bugfixes and improvements along the way.
Then I hit an impasse. One very vocal user found that the unfair behaviour in the mainline scheduler was something he came to expect. A flamewar of sorts erupted at the time, because to fix 100% of the problems with the CPU scheduler we had to sacrifice interactivity on some workloads. It wasn't a dramatic loss of interactivity, but it was definitely there. Rather than use 'nice' to proportion CPU according to where the user told the operating system it should be, the user believed it was the kernel's responsibility to guess. As it turns out, it is the fact that guessing means that no matter how hard and how smart you make the CPU scheduler, it will get it wrong some of the time. The more it tries to guess, the worse will be the corner cases of misbehaving.Then one day presumably Ingo decided it was a good idea and the way forward and... wrote his own fair scheduling interactive design with a modular almost pluggable CPU scheduling framework... and had help with the code from the person who refused to accept fair behaviour in my flamewar.
So I had plenty of time lying on my back to reflect on what I was doing and why, and whether I was going to regret it from that point on. I decided to complete the work on Staircase Deadline to make sure it was the reference for comparison, instead of having the CPU scheduler maintainer's new code comparing to the old clunky scheduler. Then I quit forever.If there is any one big problem with kernel development and Linux it is the complete disconnection of the development process from normal users. You know, the ones who constitute 99.9% of the Linux user base.
The Linux kernel mailing list is the way to communicate with the kernel developers. To put it mildly, the Linux kernel mailing list (lkml) is about as scary a communication forum as they come. Most people are absolutely terrified of mailing the list lest they get flamed for their inexperience, an inappropriate bug report, being stupid or whatever. And for the most part they're absolutely right. There is no friendly way to communicate normal users' issues that are kernel related. Yes of course the kernel developers are fun loving, happy-go-lucky friendly people. Just look at any interview with Linus and see how he views himself.I think the kernel developers at large haven't got the faintest idea just how big the problems in userspace are. It is a very small brave minority that are happy to post to lkml, and I keep getting users telling me on IRC, in person, and via my own mailing list, what their problems are. And they've even become fearful of me, even though I've never viewed myself as a real kernel developer.
And there are all the obvious bug reports. They're afraid to mention these. How scary do you think it is to say 'my Firefox tabs open slowly since the last CPU scheduler upgrade'? To top it all off, the enterprise users are the opposite. Just watch each kernel release and see how quickly some $bullshit_benchmark degraded by .1% with patch $Y gets reported. See also how quickly it gets attended to."
More information about Brain Fuck Scheduler
Why "Brain Fuck"?
Because it throws out everything about what we know is good about how to design a modern scheduler in scalability.
Because it's so ridiculously simple. Because it performs so ridiculously well on what it's good at despite being that simple.
Because it's designed in such a way that mainline would never be interested in adopting it, which is how I like it.
Because it will make people sit up and take notice of where the problems are in the current design.
Because it throws out the philosophy that one scheduler fits all and shows that you can do a -lot- better
with a scheduler designed for a particular purpose.
I don't want to use a steamroller to crack nuts. Because it actually means that more CPUs means better latencies.
Because I must be fucked in the head to be working on this again. I'll think of some more becauses later.
How scalable is it?
I don't own the sort of hardware that is likely to suffer from using it, so
I can't find the upper limit. Based on first principles about the overhead
of locking, and the way lookups occur, I'd guess that a machine with 16 CPUS
or more would start to have exponentially less performance (thanks Ingo for
confirming this). Note that the number of logical CPUs is what affects BFS'
scalability, not the physical ones. By that I mean that a hyperthreaded CPU
that is a quad core hyperthreaded is 8 EIGHT logical CPUs. So it is NOT the
same as a quad core without hyperthreading.
Since version 0.300, scalability improvements have been added that should
further improve performance, including NUMA support! No scalability benchmarks
on very big machines.have been performed on new versions to compare its
performance.
The O(n) lookup of BFS will cause people some concern because of the
notation. However, if the actual overhead is very small, then even with
large numbers of n, it can be lower overhead than an O(1) design. Testing
this scheduler vs CFS with the test app "forks" which forks 1000 tasks that
do simple work, shows no difference in time to completion compared to CFS.
That's a load of 1000 on a quad core machine. But note that BFS gets much
faster when the loads are lower and approximate the number of CPUs, which
is much more what you would experience on a desktop.
Multicore processors?
This is where BFS shines.
Single processors?
Single processors benefit a lot from BFS too.
Is this stable?
It is now pretty stable. It has been a while since serious crashes have been
reported. It first booted on 25th August 2009 but the codebase has since
become a lot more robust. Of course the usual warnings apply that it might
eat up your children and spit out your data, or worse, eat up your data and
spit out children (but I doubt it).
Quick walthrough on manually patching to -ck2 for beginners.
As of writing we have stable linux kernel 2.6.32 and Con Kolivas patchset -ck2-2.6.32 so first we need to have required development tools and required headers/libraries to build new patched kernel. On ubuntu/debian we need to run apt-get install build-essential first. Then download vanilla kernel sources to /usr/src
# sudo wget http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.32.tar.bz2 -O /usr/src
# cd /usr/src
# sudo tar xjf linux-2.6.32.tar.bz2
# sudo ln -s linux-2.6.32 linux
# cd linux
Download the patchset and apply:
# sudo wget http://users.on.net/~ckolivas/kernel/patch-2.6.32-ck2.bz2&&bzcat patch-2.6.32-ck2.bz2| patch -p1
Now we need to configure and recompile our new patched kernel...
If you want you can extract old config from your current kernel (if it has support for storing it in procfs.. Debian/Ubuntu does).
# sudo gunzip -c /proc/config.gz > .config
This will only prompt you for any new features in the kernel which are usually easy enough to understand, but generally if you don't know then choosing the default recommended by the script just by pressing enter will do.
# sudo make oldconfig
A lot of distributions support methods for installing custom kernels so if you desire you can use that method. Otherwise here is a manual summary:
# sudo make bzImage
# sudo cp arch/x86/boot/bzImage /boot/vmlinuz-2.6.32-ck2
if you have another architecture like x86_64 the first step is instead:
# sudo cp arch/x86_64/boot/bzImage /boot/vmlinuz-2.6.32-ck2
# sudo make modules && make modules_install
# sudo cp .config /boot/config-2.6.32-ck2 (optional)
# sudo mkinitrd /boot/initrd-2.6.32-ck2.img 2.6.32-ck2 (only required if you use initrd)
Now configure your GRUB/LILO boot loader and thats all..
Many distributions have automated tools for installing new custom kernels in their configuration tools and you can use those. Be wary to never make a new kernel the only option, as if it doesn't boot you may be in serious trouble. Also it is not a good idea to make it the default kernel to boot if you have not booted it at least once before
Prelink Guide
Most common applications make use of shared libraries. These shared libraries need to be loaded into memory at runtime and the various symbol references need to be resolved. For most small programs this dynamic linking is very quick. But for programs written in C++ and that have many library dependencies, the dynamic linking can take a fair amount of time.
On most systems, libraries are not changed very often and when a program is run, the operations taken to link the program are the same every time. Prelink takes advantage of this by carrying out the linking and storing it in the executable, in effect prelinking it.
Prelinking can cut the startup times of applications. For example, a typical KDE program's loading time can be cut by as much as 50%. The only maintenance required is re-running prelink every time a library is upgraded for a pre-linked executable.
- Prelinking is done via a program called, surprisingly, prelink. It changes the binary to make it start faster.
- If an application's dependent libraries change after you have prelinked it, you need to re-prelink the application, otherwise you lose the speed advantage. This is to say, everytime you update a package via your distribution packages management tool that updates libraries, they need to be re-prelinked.
- The change to the binary is fully reversible. prelink has an undo function.
You can install this tool with your distribution's package management tools (depends on the linux distribution).
On Debian/Ubuntu:
# sudo apt-get install prelink
On Gentoo:
# emerge prelink
Fedora/Redhat/CentOS
# yum install prelink
After successfull install you can run:
# sudo prelink -avmR
These prelink options means:
a - Prelink all binaries and dependent libraries found in directory hierarchies specified in /etc/pre‐link.conf. Normally, only binaries specified on the command line and their dependent libraries are pre‐linked.
v - verbose output
m - This results in a smaller virtual address space range used for libraries. On the other hand, if prelink sees a binary during incremental
prelinking which puts together two libraries which were not present together in any other binary and were given the same virtual address space slots, then the binary cannot be prelinked. Without this option, each library is assigned a unique virtual address space slot. This results in a smaller virtual address space range used for libraries. On the other hand, if prelink sees a binary during incremental prelinking which puts together two libraries which were not present together in any other binary and were given the same virtual address space slots, then the binary cannot be prelinked. Without this option, each library is assigned a unique virtual address space slot.
R - When assigning addresses to libraries, start with a random address within the architecture-dependent vir‐tual address space range. This can make some buffer overflow attacks slightly harder to exploit, because libraries are not present on the same addresses across different machines. Normally, assigning virtual addresses starts at the bottom of the architecture-dependent range.
------------------------------------------------------------------------------------------------------------
Thats all, if you managed to get it working then you will have powerful Linux machine! I personally know Con Kolivas, he is smart and great! Greetz flying out to CK community !
No comments: