Day 2 21 Oct 2002


Today was full-day of Linux Kernel Internals, basically an
introduction to the Kernel source code. Speaker was Theodore T'so, who
was dynamic enough to keep things moving for most of the day and kept
me awake for almost all of it. Impressive.  

The course covered the major elements of the linux kernel from about a
1,000 foot level, just enough to be able to find  the major subsystem
where a kernel might be having trouble or figure what parts need
rewritten for a modification.  

It based mostly on 2.2 and 2.4 code. Numbers in * * are slide
numbers. 

*9*

Started with a short history of linux, a 'hobby which got out of
control'.  Originally linux was an entertainment to play with the
multitasking  and scheduling capabilities of the i386 chip. For a
while it was a terminal program to talk to Linus's grad school. Then a
famous flame war erupted between Linus and Andrew Tannenbaum over
whether a monolithic kernel or a microkernel is the best way to
implement an OS. So far Linux is winning, for a variety of reasons.


It's almost 100% original GPLed code, with perhaps 8K of borrowed
code, such as the Jacobson PPP code which nobody wanted to rewrite.
GPL is good in several ways. The BSD internals have not changed in a
good 12 to 15 years, which is good for experimenting with novel  
TCP/IP implementations, but bad for efficiency.  For example, the TTY
character queueing assumes that it is a vax. The character queue is a
bunch of tiny tiny nodes in a linked list, instead of just an
array. This is due to hardware limitations in the VAX tty hardware.
But Linux has gone through three or four complete reimplementations of
the network layer, hencde inefficiencies like this have been ironed
out.  It is not locked in to yesterday's design decisions.  BSD also
has stuff like "cylinder groups", which uses the fact that each track
on a disk has a constant number of sectors to minimize horizontal
seeks on disk platter stacks. Unfortunately, modern disks do not have
this constraint -- they are constant density not constant
sectors. Lots of complexity in BSD's disk access code was devoted to
trying to heurist this out, but it never really worked and eventually
was dropped -- but the cylinder group code still persists in the disk
access code.   ext2, by contrast, has block groups (which work) but
not cylinder groups (which do not).  

The kernel is the first truly distributed open-source project. Mostly
because it came along at exactly the right time in sociology and
technology. In 1991, arpanet was downright painful to use. It took as
long as 2 minutes to successfully route around a problem, and because
nodes were very slow, an overload looked like a failure. So the
internet would stay down for too long to do real distributed
development...

The kernel is developed in a distributed hierarchichal model, with 1
main developer per module or subsystem, and then  --say-- Dave Miller
in charge of the networking stack. And Linus in charge of all. It is a
working meritocracy. Developers try hard not to let corporate
affiliation affect coding decisions, and Linus started to work for
Transmeta deliberately to avoid even the appearance of a corporate
bias.

If the patch is good, eventually it will make it into the kernel.

Kernel development strongly discourages binary-only modules because
the kernel code changes so fast that you must keep recompiling for
each new version. For example, essential control structures will get
re-ordered for efficiency, forcing recompilation of all
drivers to pick up the new structs. If you're trying to maintain
binary-only code, you now must have code both for the new and the old
version of the kernel at this point, and this is only at a point
revision in the stable archive.  

This also makes the system more stable because more source code is
available for bug-fixing.  Newer versions of the kernel have a
"tainted" flag in kernel dumps which tells whether binary-only code
was in the kernel at the time of a panic.  If you submit such a dump
to the kernel mailing list, the first response you will get is a
request to remove the binary-only code and try to reproduce the
bug. IFF you can reproduce the bug on an untainted kernel will you get
help. This was because many binary-only drivers would cause problems
which would manifest themselves only very remotely from the driver
itself, leading to wild goose chases where the real problem was
outside the ability of anyone without source access to repair.

Linux is a monolithic kernel, so there are no protection boundaries
between various subsystems.  Microkernels keep a lot of overhead to
cross protection boundaries.  Most microkernel OSs have almost
everything in one protection boundary for performance. So the original
NT had its graphics code outside of the main code. But the overhead of
crossing the protection boundary meant that you couldnt run Quake on
it worth a damn. Versions after 3.5 had graphics code in the same
protection boundary as the rest of the system for performance reasons,
but it meant that the OS was radically less stable, since display
driver writers were used to a much greater degree of protection from
their bugs.   

But the Linux code is modular from a software engineering point of
view.  A networking driver should not be chaning file system data
structures, even though no specific mechanism exists to ensure that it
cannot.  

Portable -- Only the asm and arch stuff is truly machine
dependent. Common code is generally very portable.  1.0.* only worked
on Intel, but over time that has changed.   The kernel list is
pragmatic rather than zealotry about portability, as for example
NetBSD. The result is that NetBSD did not have PCMCIA support until 3
full years after Linux, because they had to do so much infrastructure
work. With PCMCIA, since you can jerk the card out of the machine at
any time, you must worry about all sorts of edge conditions which
could cause the entire OS to crash.  But because they wrote all that
infrastructure for PCMCIA, NetBSD had USB support a full year before
Linux did. 


Linux can be thought of as an object-oriented system. Although we
don't use C++, we make heavy use of function pointers, so that the
same logic can be applied to multiple different implementations.  

OS as Resource Manager
   -- Physical resources (memory, disk)
   -- Logical resources (a mutually consenting illusion), such as
pids, users, processes, etc. 

Resource manager must include 
(1) performance manage resources most
efficiently 
(2) Security If someone has a resource, make sure that
someone else cannot interfere.
(3) Robustness Cope gracefully with hard disk or memory system
problems.  

Multics
Old timers always say "But multics did...".  On a multics system, you
could fsck a mounted volume while it was in use. It had magnetic core
memory and extreme robustness. Memory could be powered on and off
without loss. Multics would simply come back where it left off.  In
fact, you could even turn off portions of main memory while the OS was
running, and only the processes in the memory lost would die.  

Hardware was so bad in those days that sometimes something would pull
down an interrupt, then report the wrong service number. So the OS
would respond to the interrupt which said tty 3 had something to say,
and ask tty3 for it, but tty3 would say "Who, me?". There was a
section of the multics code entitled "Timmy, Lassie's trying to tell
us something" which was devoted to straightening this mess out.

4) Conformity  an OS must export the standard Unix APIs. A classic
problem of this nature is readdir/telldir/seekdir. This API is broken if
you're using --say-- a Btree file system. But you must implement it to
stay POSIX compliant.  

Rob Pike wrote a famous paper in which he said "Operating system
research is dead". He was embittered because 80% of plan 9 was just
stuff to do POSIX compliance. JFS has an entirely separate Btree
stored on disk just so readdir/telldir/seekdir works properly.

Performance

Efficient use of space and time
High throughput
Predictability
Fairness

Mutually limiting.

Local resources:
*  Reliable
*  Trusted
*  Always there
*  Always necessary

Sudden death if not present is acceptable.  Memory is a local resource
on Linux, but in multics it was a remote resource.  

Kernel vs process context 

Global resources 
Kernel memory
I/O
DMA, IRQ channels
Time and timers

Resource registration

Prevent device conflicts, improve autoprobe stability.
Allow the admin to read the current configuration
Track dynamically loaded modules.

This mechanism is advisory but not enforced. In kernel context, you
are omnipotent.

IRQ Registration

Request_IRQ (Like Sigaction)
sa_interrupt
fast interrupt -- All other interrupts disabled on this CPU Stops some
overhead, e.g. does not save segment registers before executing
interrupt. 

SA_SHIRQ -- multiple handlers can share this interrupt vector. Older
ISA cards could damage the actual TTL if 1 chip tried to drive the
line high while the other drives it low. In that case, it could short
all 5 volts to ground.  Now, cheap PCI machines use only 1 interrupt
for all cards. This is Not Good.

SA_Random -- This interrupt can be used for sampling for
/dev/random. The timer interrupt, for example, is a poor choice for
this, because it is completely nonrandom.  But the disk drive
interrupt could be good. It varies chaotically due to different system
load, different disk seek conditions, and even air currents inside the
drive.  Net cards are also quite unpredictable, but may be a poor
choiced because their interrupts can be predicted from inside your own
LAN.

Bottom half black majic. 

There used to be a request_dma and a request_io call, but in 2.4 it is
a struct resource tree and various drivers can grab whole portions of
the tree. This is bigger problem in architectures other than I386.

*26*
Scheduling

Foreground kernel code

All kernel code but some specific code at the very beginning of
boot-time runs in a process context. 

Kernel tasks are never preempted by other processes

Kernel code can yield control explicitly or implicitly by page fault.


Interrupts 'borrow' someone else's process context, so you cannot
sleep or block inside an interrupt.  Most interrupts allow a 2nd
interrupt. 

Interrupts can request bottom-half handling.  Network packets can take
a long time to route, so interrupt simply puts the packet on a queue,
then requests the bottom-half handler to take care of it on return. 

Top-half:  user-level code
bottom-half: system level code . 

The name "bottom-half" is another name for a "soft IRQ", which Linus
said was bad, evil and wrong but I snuck in under a different
name. Bottom half routines may not block. The bottom half queue is
global per-processor.  Do not wait for I/O to complete in a
bottom-half routine. Kernel coding means you're often restricted on
what you can  do.  

2.2 machines have only simple bottom-half support. The 2.4 kernel has
 
     * bottom half 2.2 compatibility model
     * set_irq -- can run on multiple cpus at once, but you are responsible
     for multithread support
     * tasklet -- dynamically registered soft IRQ which can only run on 1
     cpu, so no need for explicit multithread support. 

High priority clock tick handler with low-priority clock tick bottom
half handler. The tick just increments a counter, but the bottom half
handler increments system clock, checks for timer expiry, et cetera. 

*31*
Kernel timers:

   * Fixed number of timers
   * Call every clock tick if required

One off (new) timers (sched.c)

   * unlimited number
   * implemented as list of buckets, check only bucket required where
timer goes off.

*33*

Kernel synchronization

-- wait queues
-- locks
-- semaphores

Wait queue -- linked list of processes waiting for a thing to happen

To join a wait queue, put yourself to sleep _first_, then check wakeup
condition, then call schedule() and retest until true. The wakeup()
call makes you runnable again.  Otherwise, characters can show up
between when you put yourself on the wait queue and when you put
yourself to sleep -- in which case you will never wake up and see
them, and your process will hang.  This happens a lot, which is why
I'm spending time on it.

*36*
BKL -- Big Kernel Lock 
Several user-space processes can run at once in smp, but only 1
process at a time can make a system call in  2.0.36.  The BKL just
spins, and the CPU in question is doing no work.  This is only almost
SMP, since only stuff which makes minimal use of system calls has a
real performance boost. 

SMP scaleability

How many _effective_ CPUs in 4 CPUs? In 2.0.x, maybe 1.6, because the
minute a user process makes a system call, everything else stops until
it's finished.  

The BKL is a cheap and dirty way to do multi-CPU capability. 

*37*
The 2.2 and 2.4 kernels broke locks down into different subsystems
e.g. VM, networking, filesystem. This required more kernel changes
than the BKL, but still pretty minimal changes.

The mindcraft benchmark forced the networking stack to be fine-grain
locked so 4 network cards on 4 different CPUs could run completely
parallel. Linux still ran faster with one 1GB card on 1 CPU, but w/ 4
cards on 4 processors NT beat it before this patch.

You can compile this stuff out by not selecting SMP. Linux 2.6 adds
preemtable kernel at every spinlock. But this means a lot of
overhead. Getting everything to run well on a 2-CPU system may not be
good for a 16 or 32 way system. Too many spin locks increase overhead
on smaller systems unneccessarily. Adding compile-time switches for
different numbers of CPUS leads to a testing nightmare, because now
you must test everything on 2, 4, 8, 16 processor systems instead of
just 1 processor and "more". Also scaling depends on workload e.g. if
doing nothing but matrix math it looks great to 32 processors, but on
a real-world problem, 4-way is best. A 2.6 system may go as far as
8-way. 

VM and I/O layers are the most performance-sensitive parts of the
kernel.  
-- No spinlocks
Atomic toggle, clear, set, test-and-set bit
xchg operator 
inc, dec, add, subtract memory location atomically

These are hardware on X86 platforms, others have
architecture-dependent layers to emulate hardware support for these if
they're not present.

*38*

New in 2.2
 If you need to view 1 field in a struct and change another field
based on it, now you need a semaphore including a read/write
semaphore.

If 1) readlock 2) writelock 3) readlock sequence occurs, does 3 wait
until after write or not? In L2.2, 3 waits until after writelock
clears due to lock starvation danger -- otherwise, write locks may
never occur as more read locks come in before they can take
effect. This is a fairness issue.

Semaphore is a blocking operation, so you can use it when you have a
process context. Semaphore allows someone else to run.  

Spinlocks are available only on SMP machines. They're lightweight
because no sched call polls lock until it is finished. Best for short
durations --e.g. updating pointers in a linked list, where you only
spin for 12-15 cycles. 

*40*

Big Reader Lock 
e.g. network routing table This prevents "cache ping-pong effects".
Each CPU has its own cache memory, with memory permanently associated with it.
If a memory location cached in multiple CPUs is updated, then you must 
clean every CPU memory cache which contains it -- a substantial 
performance penalty. 

Memory is very slow relative to cpu. Memory
has been getting further and further from the cpu.

Processes and memory maps

The global variable "Current" points to whatever process is currently
executing on the cpu. The swapper runs as a kernel thread and thus is
allowed to block.  Kernel threads normally are created at boot time;
user-spacde programs strictly speaking cannot create a kernel
thread. They're mostly created through system calls or loadable module
initialization.

*44*

file system context -- cwd, chrooted or open files, an index into open
files.

Memory mappings include syared libraries, signal handlers, any
sig_actions declared by a program.

The clone() system call tells what parts of the current context get
shared and what parts get copied. fork() always copies. This means
that if a fork()ed child closes a file, it's still open in the
parent. But if a clone()ed child closes a file and the fs context has 
has also been clone()ed, then the file also closes in the parent.

clone() is specific to Linux but is used to implement Posix
threads. It is similar to plan 9 and SGI's R4. clone() does not set
up a new stack, so clone()ed children must explicitly set up a new
stack and move their stack pointers to the new stack.

The POSIX thread API is horrible.  This is also why Java sucks on
Linux. The implementations use old threading models as a result of the
poor POSIX threading API available. 

*45*

Linux uses the same page tables in user and kernel mode, which is very
different from most other OSs.

*46*

bot -- ring 0 mode, legacy 16-bit mode of processor.

Linux uses only 1 and 3 in the four-level Ix86 security ring model. 
Kmode is very lmited and essentially for software interrupts.

*48*
The scheduler decides when processes run. All this changes in 2.6. The
main benefit in 2.4 is that it is simple. 

Processes can be sleeping (on a wait queue), or runnable. POSIX for
also defines a soft realtime -- higher priority than any other proces
on the system. Most systems have none of these running. Their primary
use is data acquisition. 

Safety tip: If you're writing or debugging a "soft realtime" process,
make sure that you have a shell runnable at the same runtime
privilege. Mostly it will just sleep with nothing to do. But If the
soft realtime process blocks or goes into an infinite loop, that shell
is the only thing which can kill it.

The realtime linux patches allow the scheduler to gain control more
quickly from system routines. The posix 4 designation allows the
high-priority process to gain control once the scheduler has it.

*49*

The scheduler computes a 'goodness value' which determines which
process will run. Normal processes have a counter variable which
allows how many clock ticks they can run before being preempted. They
also have a priority boost on the same CPU they were running on
before, so that the CPU can use its memory cache more effectively.

*50*

The scheduler is clever because if you have some process which is
sleeping (e.g. the shell), it accumulates credits so when it wakes up
(e.g. you move the mouse or hit a key), it has higher priority. This
makes the system interact well with the outside world. 

As load on a system rises, scheduling overhead falls, because the
scheduler runs less frequently. This is a behavior you want.

90 percent of the cost of a phone call is accounting for it. This is
one reason for the current spate of 'all the minutes you want' plans,
because they're actually more efficient and hence more profitable
than more complex plans.

Many academic schedulers soak so much resources to ensure fairness
that all processes would be better off if they didn't even run.

*53*

A per-CPU call called "need_resched()" On return path, instead of
returning to process, it returns to the scheduler. Realtime patches
change this so it can happen inside a system call.

Java -- has no select() or poll() -- you must start another thread and
block on any I/O on it. So every socket must have two threads
associated with it -- one read thread, one write thread. 

So a java benchmark with 10,000 threads means 20,000 threads. Java 3
has an actual select() call, which makes this a little better.

The O(1) scheduler maintains separate queues for each CPU so we can
reschedule each queue in a given state. 
It's called O(1) scheduler because it is constant time. The older
scheduler takes longer linearly as the n umber of threads go up on
the machine.

Several efforts to solve the Java problem are happening simultaneously
on different fronts. 

-- O(1) scheduler
-- better java
-- Multiplexing virtual threads to real threads (NGPT and ingo are
user-space threading libraries which are linked in explicitly).
-- Kernel preemption -- some kernel code libraries and device drivers
may fail badly if this is turned on.

Linux threads may have issues with mixing models between forked/execed
executables:

Proc A <-- Linux threads
|
|--- Forks Proc B (Posix threads)

Proc A segfaults but some threads in proc B keep running.

*57*

VM
Different views of memory:

Kernel memory -- > under 1 GB, 1 to 1 map to physical memory, other
parts may be different 

(Linus Torvalds walked in here)

Bus memory -- identical to physical memory on I386, but this is not
necessarily true on non-intel machines.

User-mode -- the per-process memory map

Every physical page of memory has a 'struct Page' associated with it.

2.2 it is an unsigned long

2.4 it is a pointer to struct page.

2.6 rmaps mean more changes to this model.

If highmem is not available you must get dma or normal memory on a
memory request for high mem.

If read error then the up-to-date bit is not set. This is currently
the only I/O state we have. Some FS folks are getting unhappy with
this and it may change.

On 2.2+ you also have the 'slab' allocator, which can allocate
sub-page amounts of memory.

*62*

(order) is power of 2 number of pages:  1=2 2=4 3=8 et cetera.


Mostly you don't need physically contiguous memory unless you are
working with a frame grabber or something else which does not
understand discontiguous memory. So alloc mem is only used by
low-level stuff.

*61*

himem is generally reserved for stuff rarely touched by the kernel.

Free pages are arranged as a buddy heap so if 2 adjacent pages free
they are coalesced into 1 page. But there's no other garbage
collection.

In bractice the buddy won't necessarily be freed so you can't easily
coalesce back up. At boot time you can get arbitrarily large pieces of
contiguous memory but afer that it gets a little tricky.

A page size of 4K is in practice a hard-coded architecture constant.

*64* 

The slab allocator first showed up in Solaris in a Usenix paper 11
years ago.

It reduces the Translation Lookaside Buffer (TLB) footprint.
-- Solaris sets up some fields to auto-init on a first allocation of
the page, so realloc won't re-init them. Linux does _not_ do this.

*67*

The low 0.9 gb of address space in the kernel is mapped to meory above
3 gb. Memory between 0 and 3 GB contains whatever user process is
currently running. So above 3 gb won't change, but below will change
randomly. Lowest 0.9 gb is id mapped wr shift to above 3 gb. Above
that is high memory.  High memory is only seen in 2 ways: explicitly
requested  or already mapped.

kmap does not allocate memory -- you give it a struct page pointer
which it reates a page table entry in high 128 mb so it is visible to
the rest of the system.  

Translation look-aside bufer (TLB) This translates a kernel address
into physical memory without reference to the page table, so it is a
lot faster. But when you clear a page table entry you must clear it
from the TLB as well.

*68*

If you need to emulate the access bit, set the page not present bit
and do an access on the page, then set present and access bits. 

*70*

Process starts with a new page table entry. So the first memory access
scauses a page faulit, we work on from there.

*71*

A ring buffer is reserved for kmap (we only flush the TLB when it runs
out. Tis is important because we must flush TLB of all processors if
we're in SMP mode.

*72*
Use  these API s even though it may not matter on your architecture,
because it _will_ matter on others.

*76*

The PCI bus has its own scheme for working with this.

*77*

Most systems have only 1 PCI bus, but big scary things may have more
than 1. Sometimes a PCI bus is accessible only from 1 CPU.

*79* 
Bridges also have their own nodes in pci_dev.  You must now turn
things on and off in the right order so you turn off all devices on a
bus before turning off the bus. Otherwise it's like sitting on a tree
limb and sawing it off between you and the trunk.

*80*

Jumping up a level , mm_area structs describe what various ranges in
user space contain, e.g. if they're swapped out  et cetera.

*81*

This is the first example of using an array of structs as a virtual
function table for a given area of memory. This is a technique
frequently used in kernel code. This array of function pointers tells
what to do in response to a page fault, et cetera.

*82*

Most frequent if you're writing an ioctl handler. This features some
scary gcc magic so no type specification is required on the
pointer. On x86 file system the segment register always pointed to
user space, so memcpy_tofs and memcpy_fromfs is a blatent reference to
the X86 architecture. They've been renamed.

These functions will cause a page fault if the user memory is not in
memory, so watch out if you are in a critical section.

*83*

Verify_area-- Must call for security. Otherwise after 2.2 your
function can be used to overwrite kernel memory.

The exception handler is written so if someone passes in an illegal
pointer, it will analyse the pointer into an exception table which
guesses where it came from and figures out what to do next. Good calls
go fast, errors take as long as needed.

*84*

COW used for ptrace &C. Ptrace won't touch I/O mapping, since you
don't know what'll happen if you read that stuff. It handles its own
security. Heavy magic here.

*89*

shm_fs is a temporary fs for handling this even when nobody is using
it.

*90*

Also used as a temporary FS which goes away on every reboot. This is a
Good Thing, because otherwise my /tmp would become my home
directory. I have stuff which I depend on there for months before
clearing it out...

*91*

Modern mallocs use mem right above data for small allocations, but
mmap large mallocs.  Malloc does this so large accesses don't cause
sbrk to get stuck high up in process space when a large alloc is
followed by a small request, then the large alloc is freed.  But
mmap is expensive, so sometimes it can be _very_ slow.  But you can
tune malloc to use each approach. So frequent large allocs and
deallocs should use sbrk rather than the standard technique.

*92*

Don't put large arrays on the stack because you can run out of kernel
stack space.  

It's hard to figure out what memory to free. A perfect memory freer
would look into the future to free the least used memory. Failing
that, we are stuck with various algorithms to make good guesses based
on past performance. 

Allocate at device setup, not during use if at all possible. Remember
that network drivers can drop packets on the floor if they get into
trouble, e.g. a memory allocation fails.

*95*

kswapd wakes up and initiates I/O as well as finds clean pages. 

*97*

Assumes past performance equqals future returns. Very simple so that
fork/exec/exit are fast.  

*99*

Concern over whether benefits exceed costs on many systems, but as of
now will be in 2.6.

*100*

Swap cache in 2.4 pages in swap are not released until possibly long
after they come back into RAM.  m_advise call can pass hints to kernel
re the best strategy to reallocate memory, but so few people use
it. Some flags make no difference at all per Linus. Theodore T'so says
that the m_advise call does change the algorithm used to free memory,
but in practice it's  difficult to see any performance changes.  

Relying on user advice for performance is bad because sys admins and
application programmers tend to be clueless about these things.

***  I left for 30 minutes or so here ****


*124*

2.4.9 is stable but not in low memory situations because the balance
between the buffer cache and the page cache was wrong. 

*125*

This is how scatter-gather I/O got implemented.

*126*

ll_rw_block -- temporarily block I/O request & then merge adjacent I/O
requests. This meant that on large requests you:

  -- chopped the request into little chunks
  -- Glued the little chunks back together into one big request
  -- Filled the request.

Large I/O requests on SCSI were thus often CPU bound (!).  2.6
replaces this interface with a sane one. This was a painful change,
because a lot of code was affected.

*127*

So paging uses struct buffer_head so we don't have to rewrite every
block device driver out there. It also assumes the vm_page_size is
greater than or equal to the filesystem block size.  Modern drives use
32K blocks, but page vm layer assumes 4K blocks.

Gradually making buffer_cache less significant so that we can phase
out struct buffer_head.

*128*

Best way to make mmap work is to use page cache.  But vfat does not
use this method.


*130*

All these functions are in the mm_map.c file. Only thing file system
must provide is  get_block method to get actual contents from an inode
and block number. 

*131*

Reusing I/O scheduling mechanism from buffer cache for page cache even
though page cache is not in same list.

Must clean pages here under memory pressure so they can be reclaimed
by other processes.

*132*

Memmapped files flush later than regular written files to keep
application coders expectations.  Exec an ELF binary and the binary is
mmaped to where the executable goes then dropped into place in 2.4.  

*134*
"Except we got rid of cylinder groups".

*140*
Networking 

Unlike everything else, this may lie to you. e.g. you need not be
paranoid about the disk drive, because it's not out to get you. Also,
if you're talking to disk, it basically only reacts. Networking
happens at arbitrary times.  But performance is critical.

Linux will never have SysV streams. They solve problem 3, but at a
large cost in performance. 

Code in the networking layer is mostly easy to understand, but it is
amazingly sensitive, because it's tweaked and tuned heavily for
performance.

*141*
A pretty good generic list, but for networking this is so
critical. You don't think of memcpy as expensive in most applications
for example, but in networking it is.

2 Do as much as possible _around_ copy. e.g. copy + cksum. Memory is so
slow compared to processor that we can do the cksum calculation on a
byte while we wait for memory to respond. So the cost for cksum is
basically 0 due to memory latency.

3 Align headers on cache boundaries. In practice this means 16 or 32
byte cache lines. By aligning headers in cace a missed fetch gets 16
bytes, so your structure is exactly 1 cache read.

4 It does not take long to overflow the network card buffer. You must
have end-to-end transmission, so in memory or cpu congestion, you can
just drop a packet on the floor.  If you must drop a packet, drop it
as soon as possible so you don't waste time processing it
further. This is counter-intuitive.

5 Routing is expensive. Do it once, just like the TLB in the virtual
memory system.

*142*

We do have some things as a stack, but it's not as modular and flexible as
unix/bsd. But it's faster.

*143*

Queue discipline. Control the order the packets get sent to a device,
so that some kinds of packets have higher priority than others.

*146*

If this changed, everything else changes.

*147*

No linked list here as BSD has, so it is faster.

*148*

eth0, ppp0 etc plus virtual function table.

*149*

Implement queueing discipline to reorganize queue for best
interactivity.

*151*

Related to next layer up -- routing decision. Rate limiting here is
separate from the traffic shaper and redundant.

*152*

Received packets placed on backlog queue in 2.2. 2.4 uses softnet
architecture, so every CPU has its own transmit and received soft irq
running in parallel. CF the mindcraft benchmarks.

*153*

Interrupt bonding ties 1 network card to 1 cpu. Good for mindcraft
benchmark. Bad for most real life.

*156*

IP routing is shared between udp and tcp so it is abstracted out.
Academicallyy this is unpleasing but it is a sensible implementation
choice to minimize interface and mainenance here. ARP is handled in a
library routine called by the device drivers only if needed because
routing is really only needed by tcp/ip.  

*157*

Alexei -- the mad Russian -- wanted to make linux into a cisco
replacement. He succeeded surprisingly well for lower-end boxes, but
for really high-end stuff e.g. OC3 &c, cisco uses dedicated hardware
which of course a PC running Linux cannot match.  

*158*
Fragments have a timer associated with them and if it runs out you
throw them away.  Fragmenting is bad because if you lose even one
fragment you must retransmit the entire packet.

*159*

pppoe added 6 bytes of overhead to every packet, thus blowing the
maximum packet size on a lot of routers. Normal case is to send an
ICMP packet back to the sender asking it to make packet size smaller.
 But many dain-bramaged
firewalls will not let ICMP packets through, since they can be used
for DOS attacks. So the result is that the packet would appear to be
simply dropped, and the sender would resend it until timeout. The ISPs
solved this by actually changing the mtu size of the packets they ship
on the fly, thus messing with content. A horrible hack, but necessary
because in many aspects of CS, there's a "If you touched it last, it's
your fault" theory.  In Linux, you can actually set a per route MTU,
as well as a per device one. This is useful in this case.

*160*

Every fragment consumes kernel memory so we must have a rate limiter.

*162*

Most folks use the Sun route(8) command, but the iproute(8) command
has far more flexibility. You can use it to route by packet source as
well as packet destination, for example.
 
*164*

Somewhat slow on lookup but frequent queries to this table are cached.

*166* 

A horrid hack that works. Rusty separated that from net filtering.

*171*

Hanging onto a packet for 30-40 seconds. Also has a congestion window
governing how many bytes of outgoing data can be held with no
acknowledgement. Van Jacobson did a lot of this based on
hydrodynamics.

tcp optimizes the use of a shared network automagically, since
everyone throttles back on congestion.  An evil tcp/ip implementation
might use this to be more aggressive than others, thus soaking up what
bandwidth is left.  Figuring out router technology to detect and
penalize cheaters is an active area of research right now.

*175*

Provides extra knobs into various kernel parameters. How long to wait
before closing out a journal transaction? Also turns on debugging
&c. Switches turn up in /proc/sys.

*176*

Great way to speed up edit/compile/debug cycle on device drivers,
develop as a module. That way, you don't have to stop and start the
kernel most of the time, you can just unload the module.

Also you can use a sacrificial machine, optimized for fast boot by
hardware and configuration, which contains a tftp bootable floppy to
fetch the kernel from your development machine.

*178*

printk will not work very very early in the boot sequence (before
virtual memory) The messages get saved up and printed when the system
is up (assuming it comes up).  Linus is prejudiced against debuggers
because it's easy to use them sloppily. You see that X=3 where it
should=4, so rather than figure out the exact cause you simply
increment it just before.  This is bad.  But there are some good
kernel debuggers around.  kdb is an internal debugger patch, and
there's also a way to connect gdb to a kernel running on another
machine. In that case, you get full symbolic debugging capabilities
with source code numbering and everything. Also, with user-mode linux
you can set yourself up to run a kernel as a linux user and simply gdb
it. But user mode linux won't work with device handlers.

Panic early, panic often. The bosses where TOPS-10 decided that
customers were complaining too often about debug halts, so they
directed the engineers to replace all debug halts with a debug
message and have the OS continue to march. The results were
disastrous, as the operating system staggered on mortally wounded,
often corrupting data.  Much better to go down early and hard at the
first sign of a problem.

You can set an option in /proc/sys to reboot immediately on panic. You
can also use hardware and software watchdog timers in production
systems. 

*179*

e.g. no standard Linux system call table. We re-use this stuff from
the architecture we port from. So on ix86 stuff we use the minix
syscall table. Sparc uses the Solaris one, et cetera.

*180*

this is out of date now. Scaling numbers will double with 2.6, but we
won't probably go up much beyond 32-way if that.


*******

Poor-mouthed Java in Eric Raymond's "Zen of Unix" talk and found 
myself talking to  Brandon Wiley (Tristero, http://tristero.sourceforge.net) 
late in the evening.  I poured my heart out to him about the
inadequacies of Java in a mixed environment, especially wrt our
Troubles on reading the environment and execing programs written in
othber languages, and he agreed that Java had major weaknesses
there. He did mention that loading a new class in a running jvm was
tricky but possible, and that he had some GPL code which might help
us. He also asked about a job, although he lives in Austin. Could be a
cool dude, I thought.  Card says brandon@blanu.net, (512)750-8474, 
http://blanu.net.