[The following meeting summary was originally written by Greg Lehey, and he
later revised it to include various points from the notes that I took
during the meeting.  Finally, I edited (added some, changed some, removed
some) Greg's summary.  Thanks go to Greg for doing the majority of the
writing work!]

On the 15th and 16th of June we had a seminar at Yahoo! in Sunnyvale about
the recent changes to the BSD/OS kernel designed to improve SMP
performance.

Participants were, in seating order:

  Don Brady	   Apple Computer	       File systems
  Ramesh ?	   Apple Computer
  Ted Walker	   Apple Computer              network drivers
  Jeffrey Hsu	   FreeBSD project
  Chuck Paterson   BSDi			       Chief developer
  Jonathan Lemon   Cisco, FreeBSD project
  Matt Dillon	   FreeBSD project             VM, NFS
  Paul Saab	   Yahoo!
  Kirk McKusick
  Peter Wemm	   Yahoo!
  Jayanth ?	   Yahoo!
  Doug Rabson	   FreeBSD project	       Alpha port
  Jason Evans	   FreeBSD project	       kernel threads
  David Greenman   FreeBSD project	       chief architect
  Justin Gibbs	   Adaptec, FreeBSD project    SCSI, 0 copy TCP
  Greg Lehey	   Linuxcare, FreeBSD project  storage management
  Mike Smith	   BSDi, FreeBSD project       hardware, iA64 port
  Alfred Perlstein Wintel, FreeBSD project
  David O'Brien	   BSDi, FreeBSD project       compilers, binutils
  Ceren Ercen	   Linuxcare                   Daemon babe

We met for approximately 8 hours on Thursday and 4 hours on Friday.

Chuck Patterson spent Thursday presenting how BSDi implemented SMP in
BSD/OS 5.0 (as of yet unreleased).  Chuck also briefly explained BSD/OS
4.x's SMP implementation.

The BSD/OS 4.x SMP implementation is mainly comprised of a giant lock, but
with a twist.  Whenever a processor tries to acquire the giant lock it can
either succeed or fail.  If it succeeds, then it's business as usual.
However, if the acquisition fails, the processor does not spin on the giant
lock.  Instead, it acquires the schedlock (which protects scheduler-related
portions of the kernel) and schedules another runnable process, if any.
This technique works extremely well for heavy work loads that have less
than one CPU worth of system (kernel processing) load.  It is very simple,
and it achieves optimal throughput.

The BSD/OS 5.0 SMP implementation is more complex, and is what most of the
meeting time was spent discussing.  From here on, all discussion of BSD/OS
is with regard to 5.0.

1.  Source code access.

    BSD/OS is a proprietary operating system, for which binary-only and
    source code licenses are available.  BSD/OS is based on the same free
    sources (4.4BSD) as the free BSD operating systems.  It is similar to
    FreeBSD, though the two have diverged significantly enough to cause
    serious pains when moving kernel code between them.

    A few weeks back, BSDi made the source code of BSD/OS available to all
    FreeBSD committers.  During the meeting we discussed what this really
    means, and Kirk McKusick (amongst other things chairman of the board of
    BSDi) said, "Well, we're quite happy for you to take generous chunks,
    but if you ended up taking it all, people might get a little uneasy".
    Basically, anything short of simple repackaging of BSD/OS is not an
    issue.

2.  The current problems.

    UNIX was written for single processor machines, and many of the design
    choices are not just suboptimal for SMP, they're just plain ugly.  In
    particular the synchronization mechanisms don't work well with more
    than one processor.  Briefly:

    - The process context, including the upper half of device drivers,
      doesn't need to protect itself.  The kernel is non-preemptive, so as
      long as a process is executing in the kernel, no other process can
      execute in the kernel.  If another process, even with higher
      priority, becomes runnable while a process is executing kernel code,
      it will have to wait until the active process leaves the kernel or
      sleeps.

    - Processes protect themselves against the interrupt context, primarily
      the bottom half of device drivers, by masking interrupts.  The
      original PDP-11 UNIX used the hardware priority levels (numbered 4 to
      7), and even today you'll find function calls like spl4() and spl7()
      in System V code.  BSD changed the names to more descriptive terms
      like splbio(), splnet() and splhigh(), and also replaced the fixed
      priorities by interrupt masks, but the principle remains the same.
      It's not always easy to solve the question of which interrupts need
      to be masked in which context, and one of the interesting
      observations at this meeting was that as time goes on, the interrupt
      masks are getting blacker.  In other words each spl() is masking off
      more and more bits in the interrupt mask register.  This is not good
      for performance.

    - Processes synchronize with each other using the sleep() or tsleep()
      calls.  Traditional UNIX, including System V, uses sleep(), and BSD
      prefers tsleep(), which provides nice strings which ps(1) displays to
      show what the process is waiting for.  FreeBSD no longer has a
      sleep() call, while BSD/OS has both, but sleep() is deprecated.
      tsleep() is used both for voluntary process synchronization
      (e.g. send a request to another process and wait until it is
      finished), and for involuntary synchronization (e.g. wait for a
      shared resource to become available).

      Processes sleep on a specific address.  In many cases, the address in
      itself has no meaning, and it's probably easier to think of it as a
      number.  When a process sleeps, it is put on a sleep queue.  The
      wakeup() function takes the sleep address, walks through the sleep
      queue, and wakes *every* process which is sleeping on this address.
      This can cause massive problems even on single processor machines;
      UNIX was never really intended to have hundreds of processes waiting
      on the same resource, and a number of Apache performance problems
      center around this behavior.  As a partial solution, FreeBSD also has
      an additional function, wakeup_one(), which only wakes one process.

   There are a number of reasons why this concept is not a good solution
   for SMP.  Firstly, the simplistic assumption "nothing else can be
   executing in the kernel while I am" falls flat.  We currently haven't
   implemented a solution for this.  Instead, we found a way of enforcing
   this illogical state, the Big Giant Lock (BGL).  Any process entering
   the kernel must first obtain the BGL; if a process executing on another
   processor has the lock, then the current processor spins; it can't even
   schedule another process to run, because that requires entering the
   kernel.

   The other issue is with masking interrupts.  This is also quite a
   problem for SMP machines, since it requires masking the interrupts on
   all processors, again requiring an expensive synchronization.

3. The BSD/OS solution.

   - The BGL remains, but becomes increasingly meaningless.  In particular,
     it is not always necessary to obtain it in order to enter the kernel.

   - Instead the system protects shared data structures with mutexes.
     These mutexes replace calls to tsleep() when waiting on shared
     resources (the involuntary process synchronization mentioned above).
     In contrast to traditional UNIX, mutexes will be used much more
     frequently in order to protect data structures which were previously
     implicitly protected by the non-preemptive nature of the kernel.  This
     mechanism will replace calls to tsleep() for involuntary context
     switches.

     Compared with the use of tsleep(), mutexes have a number of
     advantages:

     - Each mutex has its own wait (sleep) queue.  When a process releases
       a mutex, it automatically schedules the next process waiting on the
       queue.  This is more efficient than searching a possibly very long,
       linear sleep queue.  It also avoids the flooding when multiple
       processes get scheduled, and most of them have to go back to sleep
       again.

     - Mutexes can be a combination of spin and sleep mutexes: for a
       resource which may be held only for a very short period of time,
       even the overhead of sleeping and rescheduling may be higher than
       waiting in a tight loop.  A spin/sleep lock might first wait in a
       tight loop for 2 microseconds and then sleep if the lock is still
       not available at that time.  This is an issue which Sun has
       investigated in great detail with Solaris.  BSDi has not pursued
       this yet, though the BSD/OS threading primitives make this an easy
       extention to add.  It's possibly an area for us to investigate once
       the system is up and limping again.

   - Interrupt lockout (spl()s) go away completely.  Instead, interrupt
     functions use mutexes for synchronization.  This means that an
     interrupt function must be capable of blocking, which is currently
     impossible.  In order to block, the function must have a "process"
     context (a stack and a process control block).  In particular, this
     could include kernel threads.

     BSD/OS on Intel currently uses light-weight interrupt threads to
     process interrupts, while on SPARC uses normal ("heavyweight")
     processes.  Chuck indicated that the decision to implement
     light-weight threads initially was probably the wrong one, since it
     gave rise to a large number of problems, and although the heavyweight
     process model would give lousy performance, it would probably make it
     easier to develop the kernel while the light-weight processes were
     being debugged.  There is also the possibility of building a kernel
     with one or the other support, so that in case of problems during
     development it would be possible to revert to the heavy-weight
     processes while searching for the bug.

   Other details we discussed included:

   - BSD/OS has not implemented condition variables.  We didn't go into
     details.  The opinion was expressed that they would be very useful for
     synchronization, but that they require coding discipline somewhat
     different than the tsleep() mechanism.  Converting all use of tsleep()
     is a lot of work, and of dubious value.  However, condition variables
     can live along with tsleep(), so a full changeover is not necessary.

   - BSD/OS also does not implement read/write locks, a thing that Linux
     has recently introduced.  We didn't discuss this further, but the
     concepts are well understood, and it should be relatively simple to
     implement them if we find a need.

   - Netgraph poses locking performance problems, since locks have to be
     released at multiple potential transfer points, regardless of whether
     Netgraph is in use.  This problem also exists with System V STREAMS.
     During the meeting we didn't come to a clear consensus on how much of
     a problem this really is.

   - Interrupts can have priority inversion problems on MP machines in
     combination with lazy context switching (aka context stealing).
     However, it's a temporary inversion that just causes latency.  The
     reason that deadlock never occurs is that as soon as a lock is missed,
     the interrupt stack stealing is unwound, so there is never a situation
     where a lock is held that can cause deadlock.  When a high-priority
     process waits for a lower priority process, the blocking process
     temporarily lends its priority to the running process in order to
     ensure that it finishes quickly.  This technique is interchangeably
     called priority inheritance, priority lending, deadlock avoidance, and
     probably other names, just to make things confusing.

   - NFS leasing causes big problems.  Samba will have similar problems,
     potentially.

   - Message queues are probably worthwhile, but they're currently not a
     high priority.

   - There are a number of global variable updates that are not locked, and
     can thus result in partially updated variables (i.e. reads can get
     corrupt values).  This requires either using a locked instruction, or
     using a mutex.  A mutex isn't much more expensive, and is probably
     easier.

   - We should split part of struct proc out into a fixed-size kproc for ps
     use.  This isn't really related to the SMP work, but it would be nice
     to get rid of the dreaded "proc size mismatch" error message that
     people get when their kernel is out of sync with userland.

   - We spoke about naming conventions.  Some people weren't too happy with
     BSD/OS's macro names.  Chuck agreed and said that he would adopt our
     naming convention if we chose a better one.

   - Per-CPU variables need GET_*() and SET_*() routines to lock.

4. Things we need to do.

   There are a number of things we need to do.  During the meeting we
   didn't get beyond deciding the first couple of things:

    - First remove the BGL (currently a spinlock) and replace it with two,
      maybe three mutexes, one for the scheduler (schedlock), and a
      blocking mutex for the kernel in place of the BGL.  BSD/OS also has
      an ipending lock for posting interrupts, which we should probably
      implement in the short term, though it's possible that it might go
      again.

    - In addition, implement the heavy-weight interrupt processes.  These
      would remain in place while the light-weight threads were being
      debugged.

5. Who does what?

   A number of people will work on the SMP project.  During the first stage:

  - Matt Dillon will put in locking primitives and schedlock.  This
    includes resurrecting our long-dead idle process to scan the run queue
    for interrupt threads.  He won't have time for NFS.

  - Doug Rabson will work on the alpha bits, so that it doesn't get left in
    the dust.

  - Greg Lehey will implement the heavyweight interrupt processes and
    lightweight interrupt threads.

  - Jason Evans will be the project manager.

6. Timing.

  We have a general agreement that it's better to do it right than do it
  quickly.  Thus far, Matt has implemented much of his part and is now
  waiting on Greg to do the interrupt processes.  When they've done that,
  they'll do their own tests, and others will do additional testing.  All
  commits will be dependent on approval from Jason, and the first can be
  expected within two months (probably sooner).

  The SMP changes will be maintained as patches against -current until the
  following milestones have been met:

   - Port the BSD/OS locking primitives to the i386 port (Matt) and the
     alpha port (Doug Rabson).

   - Convert the BGL to a blocking lock, add the schedlock, add per-CPU
     idle processes (Matt).

   - Implement heavy-weight interrupt threads (Greg).  Light-weight
     interrupt thread context switching may be working by the time the
     first commit is made, but this is not a requirement.

   - Stub out (basically disable) spl()s.

   - Demonstrated successful compilation and reasonable stability
     (self-hosted kernel build) on both i386 (UP and SMP) and alpha.

  The maintenance of the patches is expected to be a bit of pain, but we
  have decided not to branch due to technical issues with maintaining
  branches in CVS.  The patches are expected to exist only until the first
  commit is made.  At that point, all further development will be done
  directly on HEAD in cvs.

On the light side, we had a rather amusing experience on Friday.  We wanted
to order some sandwiches, but something went wrong with the order, so Paul
ordered pizza instead.  A bit later, the pizza boy came in and deposited
the pizzas on the conference table and was about to leave when Paul
introduced him.  His name is David Filo.  Thanks for the pizza!