[The following meeting summary was originally written by Greg Lehey, and he later revised it to include various points from the notes that I took during the meeting. Finally, I edited (added some, changed some, removed some) Greg's summary. Thanks go to Greg for doing the majority of the writing work!] On the 15th and 16th of June we had a seminar at Yahoo! in Sunnyvale about the recent changes to the BSD/OS kernel designed to improve SMP performance. Participants were, in seating order: Don Brady Apple Computer File systems Ramesh ? Apple Computer Ted Walker Apple Computer network drivers Jeffrey Hsu FreeBSD project Chuck Paterson BSDi Chief developer Jonathan Lemon Cisco, FreeBSD project Matt Dillon FreeBSD project VM, NFS Paul Saab Yahoo! Kirk McKusick Peter Wemm Yahoo! Jayanth ? Yahoo! Doug Rabson FreeBSD project Alpha port Jason Evans FreeBSD project kernel threads David Greenman FreeBSD project chief architect Justin Gibbs Adaptec, FreeBSD project SCSI, 0 copy TCP Greg Lehey Linuxcare, FreeBSD project storage management Mike Smith BSDi, FreeBSD project hardware, iA64 port Alfred Perlstein Wintel, FreeBSD project David O'Brien BSDi, FreeBSD project compilers, binutils Ceren Ercen Linuxcare Daemon babe We met for approximately 8 hours on Thursday and 4 hours on Friday. Chuck Patterson spent Thursday presenting how BSDi implemented SMP in BSD/OS 5.0 (as of yet unreleased). Chuck also briefly explained BSD/OS 4.x's SMP implementation. The BSD/OS 4.x SMP implementation is mainly comprised of a giant lock, but with a twist. Whenever a processor tries to acquire the giant lock it can either succeed or fail. If it succeeds, then it's business as usual. However, if the acquisition fails, the processor does not spin on the giant lock. Instead, it acquires the schedlock (which protects scheduler-related portions of the kernel) and schedules another runnable process, if any. This technique works extremely well for heavy work loads that have less than one CPU worth of system (kernel processing) load. It is very simple, and it achieves optimal throughput. The BSD/OS 5.0 SMP implementation is more complex, and is what most of the meeting time was spent discussing. From here on, all discussion of BSD/OS is with regard to 5.0. 1. Source code access. BSD/OS is a proprietary operating system, for which binary-only and source code licenses are available. BSD/OS is based on the same free sources (4.4BSD) as the free BSD operating systems. It is similar to FreeBSD, though the two have diverged significantly enough to cause serious pains when moving kernel code between them. A few weeks back, BSDi made the source code of BSD/OS available to all FreeBSD committers. During the meeting we discussed what this really means, and Kirk McKusick (amongst other things chairman of the board of BSDi) said, "Well, we're quite happy for you to take generous chunks, but if you ended up taking it all, people might get a little uneasy". Basically, anything short of simple repackaging of BSD/OS is not an issue. 2. The current problems. UNIX was written for single processor machines, and many of the design choices are not just suboptimal for SMP, they're just plain ugly. In particular the synchronization mechanisms don't work well with more than one processor. Briefly: - The process context, including the upper half of device drivers, doesn't need to protect itself. The kernel is non-preemptive, so as long as a process is executing in the kernel, no other process can execute in the kernel. If another process, even with higher priority, becomes runnable while a process is executing kernel code, it will have to wait until the active process leaves the kernel or sleeps. - Processes protect themselves against the interrupt context, primarily the bottom half of device drivers, by masking interrupts. The original PDP-11 UNIX used the hardware priority levels (numbered 4 to 7), and even today you'll find function calls like spl4() and spl7() in System V code. BSD changed the names to more descriptive terms like splbio(), splnet() and splhigh(), and also replaced the fixed priorities by interrupt masks, but the principle remains the same. It's not always easy to solve the question of which interrupts need to be masked in which context, and one of the interesting observations at this meeting was that as time goes on, the interrupt masks are getting blacker. In other words each spl() is masking off more and more bits in the interrupt mask register. This is not good for performance. - Processes synchronize with each other using the sleep() or tsleep() calls. Traditional UNIX, including System V, uses sleep(), and BSD prefers tsleep(), which provides nice strings which ps(1) displays to show what the process is waiting for. FreeBSD no longer has a sleep() call, while BSD/OS has both, but sleep() is deprecated. tsleep() is used both for voluntary process synchronization (e.g. send a request to another process and wait until it is finished), and for involuntary synchronization (e.g. wait for a shared resource to become available). Processes sleep on a specific address. In many cases, the address in itself has no meaning, and it's probably easier to think of it as a number. When a process sleeps, it is put on a sleep queue. The wakeup() function takes the sleep address, walks through the sleep queue, and wakes *every* process which is sleeping on this address. This can cause massive problems even on single processor machines; UNIX was never really intended to have hundreds of processes waiting on the same resource, and a number of Apache performance problems center around this behavior. As a partial solution, FreeBSD also has an additional function, wakeup_one(), which only wakes one process. There are a number of reasons why this concept is not a good solution for SMP. Firstly, the simplistic assumption "nothing else can be executing in the kernel while I am" falls flat. We currently haven't implemented a solution for this. Instead, we found a way of enforcing this illogical state, the Big Giant Lock (BGL). Any process entering the kernel must first obtain the BGL; if a process executing on another processor has the lock, then the current processor spins; it can't even schedule another process to run, because that requires entering the kernel. The other issue is with masking interrupts. This is also quite a problem for SMP machines, since it requires masking the interrupts on all processors, again requiring an expensive synchronization. 3. The BSD/OS solution. - The BGL remains, but becomes increasingly meaningless. In particular, it is not always necessary to obtain it in order to enter the kernel. - Instead the system protects shared data structures with mutexes. These mutexes replace calls to tsleep() when waiting on shared resources (the involuntary process synchronization mentioned above). In contrast to traditional UNIX, mutexes will be used much more frequently in order to protect data structures which were previously implicitly protected by the non-preemptive nature of the kernel. This mechanism will replace calls to tsleep() for involuntary context switches. Compared with the use of tsleep(), mutexes have a number of advantages: - Each mutex has its own wait (sleep) queue. When a process releases a mutex, it automatically schedules the next process waiting on the queue. This is more efficient than searching a possibly very long, linear sleep queue. It also avoids the flooding when multiple processes get scheduled, and most of them have to go back to sleep again. - Mutexes can be a combination of spin and sleep mutexes: for a resource which may be held only for a very short period of time, even the overhead of sleeping and rescheduling may be higher than waiting in a tight loop. A spin/sleep lock might first wait in a tight loop for 2 microseconds and then sleep if the lock is still not available at that time. This is an issue which Sun has investigated in great detail with Solaris. BSDi has not pursued this yet, though the BSD/OS threading primitives make this an easy extention to add. It's possibly an area for us to investigate once the system is up and limping again. - Interrupt lockout (spl()s) go away completely. Instead, interrupt functions use mutexes for synchronization. This means that an interrupt function must be capable of blocking, which is currently impossible. In order to block, the function must have a "process" context (a stack and a process control block). In particular, this could include kernel threads. BSD/OS on Intel currently uses light-weight interrupt threads to process interrupts, while on SPARC uses normal ("heavyweight") processes. Chuck indicated that the decision to implement light-weight threads initially was probably the wrong one, since it gave rise to a large number of problems, and although the heavyweight process model would give lousy performance, it would probably make it easier to develop the kernel while the light-weight processes were being debugged. There is also the possibility of building a kernel with one or the other support, so that in case of problems during development it would be possible to revert to the heavy-weight processes while searching for the bug. Other details we discussed included: - BSD/OS has not implemented condition variables. We didn't go into details. The opinion was expressed that they would be very useful for synchronization, but that they require coding discipline somewhat different than the tsleep() mechanism. Converting all use of tsleep() is a lot of work, and of dubious value. However, condition variables can live along with tsleep(), so a full changeover is not necessary. - BSD/OS also does not implement read/write locks, a thing that Linux has recently introduced. We didn't discuss this further, but the concepts are well understood, and it should be relatively simple to implement them if we find a need. - Netgraph poses locking performance problems, since locks have to be released at multiple potential transfer points, regardless of whether Netgraph is in use. This problem also exists with System V STREAMS. During the meeting we didn't come to a clear consensus on how much of a problem this really is. - Interrupts can have priority inversion problems on MP machines in combination with lazy context switching (aka context stealing). However, it's a temporary inversion that just causes latency. The reason that deadlock never occurs is that as soon as a lock is missed, the interrupt stack stealing is unwound, so there is never a situation where a lock is held that can cause deadlock. When a high-priority process waits for a lower priority process, the blocking process temporarily lends its priority to the running process in order to ensure that it finishes quickly. This technique is interchangeably called priority inheritance, priority lending, deadlock avoidance, and probably other names, just to make things confusing. - NFS leasing causes big problems. Samba will have similar problems, potentially. - Message queues are probably worthwhile, but they're currently not a high priority. - There are a number of global variable updates that are not locked, and can thus result in partially updated variables (i.e. reads can get corrupt values). This requires either using a locked instruction, or using a mutex. A mutex isn't much more expensive, and is probably easier. - We should split part of struct proc out into a fixed-size kproc for ps use. This isn't really related to the SMP work, but it would be nice to get rid of the dreaded "proc size mismatch" error message that people get when their kernel is out of sync with userland. - We spoke about naming conventions. Some people weren't too happy with BSD/OS's macro names. Chuck agreed and said that he would adopt our naming convention if we chose a better one. - Per-CPU variables need GET_*() and SET_*() routines to lock. 4. Things we need to do. There are a number of things we need to do. During the meeting we didn't get beyond deciding the first couple of things: - First remove the BGL (currently a spinlock) and replace it with two, maybe three mutexes, one for the scheduler (schedlock), and a blocking mutex for the kernel in place of the BGL. BSD/OS also has an ipending lock for posting interrupts, which we should probably implement in the short term, though it's possible that it might go again. - In addition, implement the heavy-weight interrupt processes. These would remain in place while the light-weight threads were being debugged. 5. Who does what? A number of people will work on the SMP project. During the first stage: - Matt Dillon will put in locking primitives and schedlock. This includes resurrecting our long-dead idle process to scan the run queue for interrupt threads. He won't have time for NFS. - Doug Rabson will work on the alpha bits, so that it doesn't get left in the dust. - Greg Lehey will implement the heavyweight interrupt processes and lightweight interrupt threads. - Jason Evans will be the project manager. 6. Timing. We have a general agreement that it's better to do it right than do it quickly. Thus far, Matt has implemented much of his part and is now waiting on Greg to do the interrupt processes. When they've done that, they'll do their own tests, and others will do additional testing. All commits will be dependent on approval from Jason, and the first can be expected within two months (probably sooner). The SMP changes will be maintained as patches against -current until the following milestones have been met: - Port the BSD/OS locking primitives to the i386 port (Matt) and the alpha port (Doug Rabson). - Convert the BGL to a blocking lock, add the schedlock, add per-CPU idle processes (Matt). - Implement heavy-weight interrupt threads (Greg). Light-weight interrupt thread context switching may be working by the time the first commit is made, but this is not a requirement. - Stub out (basically disable) spl()s. - Demonstrated successful compilation and reasonable stability (self-hosted kernel build) on both i386 (UP and SMP) and alpha. The maintenance of the patches is expected to be a bit of pain, but we have decided not to branch due to technical issues with maintaining branches in CVS. The patches are expected to exist only until the first commit is made. At that point, all further development will be done directly on HEAD in cvs. On the light side, we had a rather amusing experience on Friday. We wanted to order some sandwiches, but something went wrong with the order, so Paul ordered pizza instead. A bit later, the pizza boy came in and deposited the pizzas on the conference table and was about to leave when Paul introduced him. His name is David Filo. Thanks for the pizza!