Greg's The year of BSD

This is a historical document from Daemon News, now defunct. It was originally published in August 2000. It's based on the copy that I submitted, and may be marginally different. I have reformatted it and removed the multiple markup errors and updated some links.

My last article on the Microsoft mail worm brought quite a bit of feedback. A surprising number of people disagreed with my viewpoint (well, I was surprised, anyway). The reasons were interesting. Claus Andersen thought that I was being offensive to Bill Gates personally, though he wasn't able to specify how. In case others got this impression, my apologies: my only reference to Bill Gates was because he's the richest man in the world, not because he's the founder of Microsoft. Claus goes on to say

My point is that ILOVEYOU cannot be attributed to either M$ is bad or BSD is TheRightThingTM. ILOVEYOU was caused by what most BOFH's would call lusers! The rest of the article dealing with BSD and Linux is nice - but the lead in is an off-shot by far!

I have to disagree here: this is like saying “yes, I know you lost your right hand when the guard fell off the chainsaw you were using, but you can't blame the manufacturer, because you were using it at the time”.

In case you're wondering about the term BOFH above, it's short for Bastard Operator From Hell

Also on this topic, Alexander Langer pointed out that in Germany the press did indeed point out Microsoft's missed responsibility to supply products with adequate security. Others didn't seem to agree: a number of messages stated that Microsoft was not responsible for users executing scripts they received via mail: the user himself has to “open” them. I disagree: Microsoft requires you to “open” a number of kinds of attachment, and the names usually drop the suffix, which Microsoft uses to identify a file. Nobody addressed the fact that the complete system environment is available to a script which is being executed in this context. In UNIX, we would put it in a sandbox, so that even if it were malicious or buggy, it couldn't harm the rest of the system.

BSD at USENIX

June is traditionally the month of the USENIX technical conference, which was in San Diego this year. This year was also the 25th anniversary celebration, and BSD was very much in evidence. The carrier bags issued to everybody on registration carried a prominent BSD and a daemon image. I saw Linus Torvalds in person for the first time at the Linux BOF (Birds of a Feather) session. He was wearing BSD daemon horns.

All this is just superficial, of course, but it's indicative of the way things have been going lately. Berkeley Software Design, Inc was issuing clip-on daemon horns, and everybody was wearing them. It might sound like just a gimmick, and of course it was, but don't forget that one of the many reasons people have stated for the relative success of Linux against BSD was that BSD wasn't as well known as Linux. BSDi is helping there.

Revamping the BSD multiprocessor code

Remember this time last year when Mindcraft published benchmarks showing that Microsoft NT could outperform Linux in some very specific areas? As I commented at the time, nobody in the BSD camp got up and said “we can do better”. We were pretty sure it would still not have been as good as Microsoft.

The SMP problem

To explain all this, and what we're doing about it, I'll have to be a little more technical than I usually am in these columns. Bear with me.

UNIX was written for single processor machines, and many of the design choices are not only suboptimal for SMP, they're just plain ugly. In particular the synchronization mechanisms don't work well with more than one processor. Briefly:

The process context, including the upper half of device drivers, doesn't need to protect itself. The kernel is non-preemptive: as long as a process is executing in the kernel, no other process can execute in the kernel. If another process, even with higher priority, becomes runnable while a process is executing kernel code, it will have to wait until the active process leaves the kernel or sleeps.
Processes protect themselves against the interrupt context, primarily the bottom half of device drivers, by masking interrupts. The original PDP-11 UNIX used the hardware priority levels (numbered 4 to 7), and even today you'll find function calls like spl4() and spl7() in System V code. BSD changed the names to more descriptive terms like splbio(), splnet() and splhigh(), and also replaced the fixed priorities by interrupt masks, but the principle remains the same. It's not always easy to solve the question of which interrupts need to be masked in which context, and one of the interesting observations at this meeting was that as time goes on, the interrupt masks are getting “blacker”: each spl() is masking off more and more bits in the interrupt mask register. This is not good for performance.
Processes synchronize with each other using the sleep() or tsleep() calls. Traditional UNIX, including System V, uses sleep(), but BSD prefers tsleep(), which provides nice strings which ps(1) displays to show what the process is waiting for. FreeBSD no longer has a sleep() call, while BSD/OS has both, but sleep() is deprecated. tsleep() is used both for voluntary process synchronization (e.g. send a request to another process and wait until it is finished), and for involuntary synchronization (e.g. wait for a shared resource to become available).
Processes sleep on a specific address. In many cases, the address in itself has no meaning, and it's probably easier to think of it as a number. When a process sleeps, it is put on a sleep queue. The wakeup() function takes the sleep address, walks through the sleep queue, and wakes every process which is sleeping on this address. This can cause massive problems even on single processor machines; UNIX was never really intended to have hundreds of processes waiting on the same resource, and a number of Apache performance problems center around this behaviour. As a partial solution, FreeBSD also has an additional function, wakeup_one(), which only wakes one process.

There are a number of reasons why this concept is not a good solution for SMP. Firstly, the simplistic assumption “nothing else can be executing in the kernel while I am” falls flat. FreeBSD currently hasn't implemented a solution for this. Instead, we found a way of enforcing this illogical state, the Big Giant Lock (BGL). Any process entering the kernel must first obtain the BGL; if a process executing on another processor has the lock, then the current processor spins (it sits in a tight loop waiting for the lock to become available); it can't even schedule another process to run, because that requires entering the kernel. This method works surprisingly well for compute bound processes, but for a large number of applications, including database and networking, it can give rise to performances which are only a fraction of what the hardware is capable of. This is the background to the success of the Mindcraft benchmark: at the time, Linux was also using this kind of synchronization.

The other issue is with masking interrupts. This is also quite a problem for SMP machines, since it requires masking the interrupts on all processors, which requires an expensive synchronization.

Solving the problem

There's no quick and easy solution to this synchronization problem. Sun Microsystems has probably spent more effort on a good SMP implementation than anybody else, but it has taken them the best part of 10 years to do so, and only now is their Solaris 2 operating system showing the benefits.

The Linux people started working on improving their SMP support shortly after the Mindcraft results became known, and they have made significant progress. By comparison, in the FreeBSD camp, we have done almost nothing. NetBSD and OpenBSD haven't even released any SMP support at all. Why?

For some time, I have had a theory that the open source model works well for small projects, but it is not optimal for really big undertakings. Even before the Mindcraft incident I had decided that getting good SMP support for BSD would be a proof of this theory. Well, we're on the way to better support now, but the way it happened is rather unexpected.

BSDi to the rescue

A few months ago we talked about the merger between Berkeley Software Design, Inc. and Walnut Creek CDROM. At the time of the merger, we had been told that FreeBSD and BSDi's proprietary operating system, BSD/OS, would be merged. It didn't take long for BSDi to announce that this wasn't going to happen, and there was some dissatisfaction as a result. BSDi did agree, however, to let the FreeBSD project merge some BSD/OS code into FreeBSD. In mid-May, BSDi made a snapshot of their development source tree available to the FreeBSD developers.

On the 15th and 16th June (2000) we had a meeting of BSDi and FreeBSD developers at Yahoo!'s facility in Sunnyvale CA. Chuck Patterson, BSDi's lead SMP developer, spent Thursday presenting how BSDi implemented SMP in BSD/OS 5.0 (as of yet unreleased). Chuck also briefly explained BSD/OS 4.x's SMP implementation. On Friday we discussed how to incorporate the structures into FreeBSD.

The BSD/OS 4.x SMP implementation is mainly comprised of a giant lock, but with a twist. Whenever a processor tries to acquire the giant lock it can either succeed or fail. If it succeeds, then it's business as usual. However, if the acquisition fails, the processor does not spin on the giant lock (in other words, it doesn't just keep looping until the lock becomes free). Instead, it acquires another lock, the scheduler lock or schedlock, which protects scheduler-related portions of the kernel, and schedules another runnable process, if any. This technique works extremely well for heavy work loads that have less than one CPU worth of system (kernel processing) load. It is very simple, and it achieves good throughput for these workloads.

The meeting concentrated on the BSD/OS 5.0 SMP implementation, which is more complex:

The BGL remains, but becomes increasingly meaningless. In particular, it is not always necessary to obtain it in order to enter the kernel.
Instead the system protects shared data structures with mutexes. These mutexes replace calls to tsleep() when waiting on shared resources (the involuntary process synchronization mentioned above). In contrast to traditional UNIX, mutexes will be used much more frequently in order to protect data structures which were previously implicitly protected by the non-preemptive nature of the kernel. This mechanism will replace calls to tsleep() for involuntary context switches. Compared with the use of tsleep(), mutexes have a number of advantages:

Each mutex has its own wait (sleep) queue. When a process releases a mutex, it automatically schedules the next process waiting on the queue. This is more efficient than searching a possibly very long, linear sleep queue. It also avoids the flooding when multiple processes get scheduled, and most of them have to go back to sleep again.
Mutexes can be a combination of spin and sleep mutexes: for a resource which may be held only for a very short period of time, even the overhead of sleeping and rescheduling may be higher than waiting in a tight loop. A spin/sleep lock might first wait in a tight loop for 2 microseconds and then sleep if the lock is still not available at that time. This is an issue which Sun has investigated in great detail with Solaris. BSDi has not pursued this yet, though the BSD/OS threading primitives make this an easy extention to add. It's possibly an area for us to investigate once the system is up and limping again.

Interrupt lockouts (spl()s) go away completely. Instead, interrupt functions use mutexes for synchronization. This means that an interrupt function must be capable of blocking, which is currently impossible. In order to block, the function must have a “process” context (a stack and a process structure). In particular, this could include kernel threads.

BSD/OS on Intel currently uses light-weight interrupt threads to process interrupts, while on SPARC uses normal (“heavyweight”) processes. Chuck indicated that the decision to implement light-weight threads initially was probably the wrong one, since it gave rise to a large number of problems, and although the heavyweight process model would give lousy performance, it would probably make it easier to develop the kernel while the light-weight processes were being debugged. There is also the possibility of building a kernel with one or the other support, so that in case of problems during development it would be possible to revert to the heavy-weight processes while searching for the bug.

The FreeBSD way

On Friday (16 June 2000) we discussed how to implement this code in FreeBSD.

There are a number of things we need to do. During the meeting we didn't get beyond deciding the first couple of things:

First remove the BGL (currently a spinlock) and replace it with two, maybe three mutexes, one for the scheduler (schedlock), and a blocking mutex for the kernel in place of the BGL. BSD/OS also has an ipending lock for posting interrupts. At the time, we thought it might be a good idea to implement it as well.
In addition, implement the heavy-weight interrupt processes. These would remain in place while the light-weight threads were being debugged.

That was six weeks ago. In the meantime, we have effectively completed these modifications, though debugging the interrupt processes proves interesting. Still, we have had the first machine up and running for several minutes, and things are looking good.

What about NetBSD and OpenBSD?

I'm not aware of the state of negotiations between BSDi and the NetBSD and OpenBSD communities. The people I've spoken to at BSDi sounded very interested in supplying the code to NetBSD and OpenBSD as well, and hopefully they'll be able to come to an agreement on how to use the code.

Greg's home page

Greg's diary

Greg's photos

$Id: unix-way-c.php,v 1.2 2024/09/01 02:29:19 grog Exp $