Day 2 21 Oct 2002 Today was full-day of Linux Kernel Internals, basically an introduction to the Kernel source code. Speaker was Theodore T'so, who was dynamic enough to keep things moving for most of the day and kept me awake for almost all of it. Impressive. The course covered the major elements of the linux kernel from about a 1,000 foot level, just enough to be able to find the major subsystem where a kernel might be having trouble or figure what parts need rewritten for a modification. It based mostly on 2.2 and 2.4 code. Numbers in * * are slide numbers. *9* Started with a short history of linux, a 'hobby which got out of control'. Originally linux was an entertainment to play with the multitasking and scheduling capabilities of the i386 chip. For a while it was a terminal program to talk to Linus's grad school. Then a famous flame war erupted between Linus and Andrew Tannenbaum over whether a monolithic kernel or a microkernel is the best way to implement an OS. So far Linux is winning, for a variety of reasons. It's almost 100% original GPLed code, with perhaps 8K of borrowed code, such as the Jacobson PPP code which nobody wanted to rewrite. GPL is good in several ways. The BSD internals have not changed in a good 12 to 15 years, which is good for experimenting with novel TCP/IP implementations, but bad for efficiency. For example, the TTY character queueing assumes that it is a vax. The character queue is a bunch of tiny tiny nodes in a linked list, instead of just an array. This is due to hardware limitations in the VAX tty hardware. But Linux has gone through three or four complete reimplementations of the network layer, hencde inefficiencies like this have been ironed out. It is not locked in to yesterday's design decisions. BSD also has stuff like "cylinder groups", which uses the fact that each track on a disk has a constant number of sectors to minimize horizontal seeks on disk platter stacks. Unfortunately, modern disks do not have this constraint -- they are constant density not constant sectors. Lots of complexity in BSD's disk access code was devoted to trying to heurist this out, but it never really worked and eventually was dropped -- but the cylinder group code still persists in the disk access code. ext2, by contrast, has block groups (which work) but not cylinder groups (which do not). The kernel is the first truly distributed open-source project. Mostly because it came along at exactly the right time in sociology and technology. In 1991, arpanet was downright painful to use. It took as long as 2 minutes to successfully route around a problem, and because nodes were very slow, an overload looked like a failure. So the internet would stay down for too long to do real distributed development... The kernel is developed in a distributed hierarchichal model, with 1 main developer per module or subsystem, and then --say-- Dave Miller in charge of the networking stack. And Linus in charge of all. It is a working meritocracy. Developers try hard not to let corporate affiliation affect coding decisions, and Linus started to work for Transmeta deliberately to avoid even the appearance of a corporate bias. If the patch is good, eventually it will make it into the kernel. Kernel development strongly discourages binary-only modules because the kernel code changes so fast that you must keep recompiling for each new version. For example, essential control structures will get re-ordered for efficiency, forcing recompilation of all drivers to pick up the new structs. If you're trying to maintain binary-only code, you now must have code both for the new and the old version of the kernel at this point, and this is only at a point revision in the stable archive. This also makes the system more stable because more source code is available for bug-fixing. Newer versions of the kernel have a "tainted" flag in kernel dumps which tells whether binary-only code was in the kernel at the time of a panic. If you submit such a dump to the kernel mailing list, the first response you will get is a request to remove the binary-only code and try to reproduce the bug. IFF you can reproduce the bug on an untainted kernel will you get help. This was because many binary-only drivers would cause problems which would manifest themselves only very remotely from the driver itself, leading to wild goose chases where the real problem was outside the ability of anyone without source access to repair. Linux is a monolithic kernel, so there are no protection boundaries between various subsystems. Microkernels keep a lot of overhead to cross protection boundaries. Most microkernel OSs have almost everything in one protection boundary for performance. So the original NT had its graphics code outside of the main code. But the overhead of crossing the protection boundary meant that you couldnt run Quake on it worth a damn. Versions after 3.5 had graphics code in the same protection boundary as the rest of the system for performance reasons, but it meant that the OS was radically less stable, since display driver writers were used to a much greater degree of protection from their bugs. But the Linux code is modular from a software engineering point of view. A networking driver should not be chaning file system data structures, even though no specific mechanism exists to ensure that it cannot. Portable -- Only the asm and arch stuff is truly machine dependent. Common code is generally very portable. 1.0.* only worked on Intel, but over time that has changed. The kernel list is pragmatic rather than zealotry about portability, as for example NetBSD. The result is that NetBSD did not have PCMCIA support until 3 full years after Linux, because they had to do so much infrastructure work. With PCMCIA, since you can jerk the card out of the machine at any time, you must worry about all sorts of edge conditions which could cause the entire OS to crash. But because they wrote all that infrastructure for PCMCIA, NetBSD had USB support a full year before Linux did. Linux can be thought of as an object-oriented system. Although we don't use C++, we make heavy use of function pointers, so that the same logic can be applied to multiple different implementations. OS as Resource Manager -- Physical resources (memory, disk) -- Logical resources (a mutually consenting illusion), such as pids, users, processes, etc. Resource manager must include (1) performance manage resources most efficiently (2) Security If someone has a resource, make sure that someone else cannot interfere. (3) Robustness Cope gracefully with hard disk or memory system problems. Multics Old timers always say "But multics did...". On a multics system, you could fsck a mounted volume while it was in use. It had magnetic core memory and extreme robustness. Memory could be powered on and off without loss. Multics would simply come back where it left off. In fact, you could even turn off portions of main memory while the OS was running, and only the processes in the memory lost would die. Hardware was so bad in those days that sometimes something would pull down an interrupt, then report the wrong service number. So the OS would respond to the interrupt which said tty 3 had something to say, and ask tty3 for it, but tty3 would say "Who, me?". There was a section of the multics code entitled "Timmy, Lassie's trying to tell us something" which was devoted to straightening this mess out. 4) Conformity an OS must export the standard Unix APIs. A classic problem of this nature is readdir/telldir/seekdir. This API is broken if you're using --say-- a Btree file system. But you must implement it to stay POSIX compliant. Rob Pike wrote a famous paper in which he said "Operating system research is dead". He was embittered because 80% of plan 9 was just stuff to do POSIX compliance. JFS has an entirely separate Btree stored on disk just so readdir/telldir/seekdir works properly. Performance Efficient use of space and time High throughput Predictability Fairness Mutually limiting. Local resources: * Reliable * Trusted * Always there * Always necessary Sudden death if not present is acceptable. Memory is a local resource on Linux, but in multics it was a remote resource. Kernel vs process context Global resources Kernel memory I/O DMA, IRQ channels Time and timers Resource registration Prevent device conflicts, improve autoprobe stability. Allow the admin to read the current configuration Track dynamically loaded modules. This mechanism is advisory but not enforced. In kernel context, you are omnipotent. IRQ Registration Request_IRQ (Like Sigaction) sa_interrupt fast interrupt -- All other interrupts disabled on this CPU Stops some overhead, e.g. does not save segment registers before executing interrupt. SA_SHIRQ -- multiple handlers can share this interrupt vector. Older ISA cards could damage the actual TTL if 1 chip tried to drive the line high while the other drives it low. In that case, it could short all 5 volts to ground. Now, cheap PCI machines use only 1 interrupt for all cards. This is Not Good. SA_Random -- This interrupt can be used for sampling for /dev/random. The timer interrupt, for example, is a poor choice for this, because it is completely nonrandom. But the disk drive interrupt could be good. It varies chaotically due to different system load, different disk seek conditions, and even air currents inside the drive. Net cards are also quite unpredictable, but may be a poor choiced because their interrupts can be predicted from inside your own LAN. Bottom half black majic. There used to be a request_dma and a request_io call, but in 2.4 it is a struct resource tree and various drivers can grab whole portions of the tree. This is bigger problem in architectures other than I386. *26* Scheduling Foreground kernel code All kernel code but some specific code at the very beginning of boot-time runs in a process context. Kernel tasks are never preempted by other processes Kernel code can yield control explicitly or implicitly by page fault. Interrupts 'borrow' someone else's process context, so you cannot sleep or block inside an interrupt. Most interrupts allow a 2nd interrupt. Interrupts can request bottom-half handling. Network packets can take a long time to route, so interrupt simply puts the packet on a queue, then requests the bottom-half handler to take care of it on return. Top-half: user-level code bottom-half: system level code . The name "bottom-half" is another name for a "soft IRQ", which Linus said was bad, evil and wrong but I snuck in under a different name. Bottom half routines may not block. The bottom half queue is global per-processor. Do not wait for I/O to complete in a bottom-half routine. Kernel coding means you're often restricted on what you can do. 2.2 machines have only simple bottom-half support. The 2.4 kernel has * bottom half 2.2 compatibility model * set_irq -- can run on multiple cpus at once, but you are responsible for multithread support * tasklet -- dynamically registered soft IRQ which can only run on 1 cpu, so no need for explicit multithread support. High priority clock tick handler with low-priority clock tick bottom half handler. The tick just increments a counter, but the bottom half handler increments system clock, checks for timer expiry, et cetera. *31* Kernel timers: * Fixed number of timers * Call every clock tick if required One off (new) timers (sched.c) * unlimited number * implemented as list of buckets, check only bucket required where timer goes off. *33* Kernel synchronization -- wait queues -- locks -- semaphores Wait queue -- linked list of processes waiting for a thing to happen To join a wait queue, put yourself to sleep _first_, then check wakeup condition, then call schedule() and retest until true. The wakeup() call makes you runnable again. Otherwise, characters can show up between when you put yourself on the wait queue and when you put yourself to sleep -- in which case you will never wake up and see them, and your process will hang. This happens a lot, which is why I'm spending time on it. *36* BKL -- Big Kernel Lock Several user-space processes can run at once in smp, but only 1 process at a time can make a system call in 2.0.36. The BKL just spins, and the CPU in question is doing no work. This is only almost SMP, since only stuff which makes minimal use of system calls has a real performance boost. SMP scaleability How many _effective_ CPUs in 4 CPUs? In 2.0.x, maybe 1.6, because the minute a user process makes a system call, everything else stops until it's finished. The BKL is a cheap and dirty way to do multi-CPU capability. *37* The 2.2 and 2.4 kernels broke locks down into different subsystems e.g. VM, networking, filesystem. This required more kernel changes than the BKL, but still pretty minimal changes. The mindcraft benchmark forced the networking stack to be fine-grain locked so 4 network cards on 4 different CPUs could run completely parallel. Linux still ran faster with one 1GB card on 1 CPU, but w/ 4 cards on 4 processors NT beat it before this patch. You can compile this stuff out by not selecting SMP. Linux 2.6 adds preemtable kernel at every spinlock. But this means a lot of overhead. Getting everything to run well on a 2-CPU system may not be good for a 16 or 32 way system. Too many spin locks increase overhead on smaller systems unneccessarily. Adding compile-time switches for different numbers of CPUS leads to a testing nightmare, because now you must test everything on 2, 4, 8, 16 processor systems instead of just 1 processor and "more". Also scaling depends on workload e.g. if doing nothing but matrix math it looks great to 32 processors, but on a real-world problem, 4-way is best. A 2.6 system may go as far as 8-way. VM and I/O layers are the most performance-sensitive parts of the kernel. -- No spinlocks Atomic toggle, clear, set, test-and-set bit xchg operator inc, dec, add, subtract memory location atomically These are hardware on X86 platforms, others have architecture-dependent layers to emulate hardware support for these if they're not present. *38* New in 2.2 If you need to view 1 field in a struct and change another field based on it, now you need a semaphore including a read/write semaphore. If 1) readlock 2) writelock 3) readlock sequence occurs, does 3 wait until after write or not? In L2.2, 3 waits until after writelock clears due to lock starvation danger -- otherwise, write locks may never occur as more read locks come in before they can take effect. This is a fairness issue. Semaphore is a blocking operation, so you can use it when you have a process context. Semaphore allows someone else to run. Spinlocks are available only on SMP machines. They're lightweight because no sched call polls lock until it is finished. Best for short durations --e.g. updating pointers in a linked list, where you only spin for 12-15 cycles. *40* Big Reader Lock e.g. network routing table This prevents "cache ping-pong effects". Each CPU has its own cache memory, with memory permanently associated with it. If a memory location cached in multiple CPUs is updated, then you must clean every CPU memory cache which contains it -- a substantial performance penalty. Memory is very slow relative to cpu. Memory has been getting further and further from the cpu. Processes and memory maps The global variable "Current" points to whatever process is currently executing on the cpu. The swapper runs as a kernel thread and thus is allowed to block. Kernel threads normally are created at boot time; user-spacde programs strictly speaking cannot create a kernel thread. They're mostly created through system calls or loadable module initialization. *44* file system context -- cwd, chrooted or open files, an index into open files. Memory mappings include syared libraries, signal handlers, any sig_actions declared by a program. The clone() system call tells what parts of the current context get shared and what parts get copied. fork() always copies. This means that if a fork()ed child closes a file, it's still open in the parent. But if a clone()ed child closes a file and the fs context has has also been clone()ed, then the file also closes in the parent. clone() is specific to Linux but is used to implement Posix threads. It is similar to plan 9 and SGI's R4. clone() does not set up a new stack, so clone()ed children must explicitly set up a new stack and move their stack pointers to the new stack. The POSIX thread API is horrible. This is also why Java sucks on Linux. The implementations use old threading models as a result of the poor POSIX threading API available. *45* Linux uses the same page tables in user and kernel mode, which is very different from most other OSs. *46* bot -- ring 0 mode, legacy 16-bit mode of processor. Linux uses only 1 and 3 in the four-level Ix86 security ring model. Kmode is very lmited and essentially for software interrupts. *48* The scheduler decides when processes run. All this changes in 2.6. The main benefit in 2.4 is that it is simple. Processes can be sleeping (on a wait queue), or runnable. POSIX for also defines a soft realtime -- higher priority than any other proces on the system. Most systems have none of these running. Their primary use is data acquisition. Safety tip: If you're writing or debugging a "soft realtime" process, make sure that you have a shell runnable at the same runtime privilege. Mostly it will just sleep with nothing to do. But If the soft realtime process blocks or goes into an infinite loop, that shell is the only thing which can kill it. The realtime linux patches allow the scheduler to gain control more quickly from system routines. The posix 4 designation allows the high-priority process to gain control once the scheduler has it. *49* The scheduler computes a 'goodness value' which determines which process will run. Normal processes have a counter variable which allows how many clock ticks they can run before being preempted. They also have a priority boost on the same CPU they were running on before, so that the CPU can use its memory cache more effectively. *50* The scheduler is clever because if you have some process which is sleeping (e.g. the shell), it accumulates credits so when it wakes up (e.g. you move the mouse or hit a key), it has higher priority. This makes the system interact well with the outside world. As load on a system rises, scheduling overhead falls, because the scheduler runs less frequently. This is a behavior you want. 90 percent of the cost of a phone call is accounting for it. This is one reason for the current spate of 'all the minutes you want' plans, because they're actually more efficient and hence more profitable than more complex plans. Many academic schedulers soak so much resources to ensure fairness that all processes would be better off if they didn't even run. *53* A per-CPU call called "need_resched()" On return path, instead of returning to process, it returns to the scheduler. Realtime patches change this so it can happen inside a system call. Java -- has no select() or poll() -- you must start another thread and block on any I/O on it. So every socket must have two threads associated with it -- one read thread, one write thread. So a java benchmark with 10,000 threads means 20,000 threads. Java 3 has an actual select() call, which makes this a little better. The O(1) scheduler maintains separate queues for each CPU so we can reschedule each queue in a given state. It's called O(1) scheduler because it is constant time. The older scheduler takes longer linearly as the n umber of threads go up on the machine. Several efforts to solve the Java problem are happening simultaneously on different fronts. -- O(1) scheduler -- better java -- Multiplexing virtual threads to real threads (NGPT and ingo are user-space threading libraries which are linked in explicitly). -- Kernel preemption -- some kernel code libraries and device drivers may fail badly if this is turned on. Linux threads may have issues with mixing models between forked/execed executables: Proc A <-- Linux threads | |--- Forks Proc B (Posix threads) Proc A segfaults but some threads in proc B keep running. *57* VM Different views of memory: Kernel memory -- > under 1 GB, 1 to 1 map to physical memory, other parts may be different (Linus Torvalds walked in here) Bus memory -- identical to physical memory on I386, but this is not necessarily true on non-intel machines. User-mode -- the per-process memory map Every physical page of memory has a 'struct Page' associated with it. 2.2 it is an unsigned long 2.4 it is a pointer to struct page. 2.6 rmaps mean more changes to this model. If highmem is not available you must get dma or normal memory on a memory request for high mem. If read error then the up-to-date bit is not set. This is currently the only I/O state we have. Some FS folks are getting unhappy with this and it may change. On 2.2+ you also have the 'slab' allocator, which can allocate sub-page amounts of memory. *62* (order) is power of 2 number of pages: 1=2 2=4 3=8 et cetera. Mostly you don't need physically contiguous memory unless you are working with a frame grabber or something else which does not understand discontiguous memory. So alloc mem is only used by low-level stuff. *61* himem is generally reserved for stuff rarely touched by the kernel. Free pages are arranged as a buddy heap so if 2 adjacent pages free they are coalesced into 1 page. But there's no other garbage collection. In bractice the buddy won't necessarily be freed so you can't easily coalesce back up. At boot time you can get arbitrarily large pieces of contiguous memory but afer that it gets a little tricky. A page size of 4K is in practice a hard-coded architecture constant. *64* The slab allocator first showed up in Solaris in a Usenix paper 11 years ago. It reduces the Translation Lookaside Buffer (TLB) footprint. -- Solaris sets up some fields to auto-init on a first allocation of the page, so realloc won't re-init them. Linux does _not_ do this. *67* The low 0.9 gb of address space in the kernel is mapped to meory above 3 gb. Memory between 0 and 3 GB contains whatever user process is currently running. So above 3 gb won't change, but below will change randomly. Lowest 0.9 gb is id mapped wr shift to above 3 gb. Above that is high memory. High memory is only seen in 2 ways: explicitly requested or already mapped. kmap does not allocate memory -- you give it a struct page pointer which it reates a page table entry in high 128 mb so it is visible to the rest of the system. Translation look-aside bufer (TLB) This translates a kernel address into physical memory without reference to the page table, so it is a lot faster. But when you clear a page table entry you must clear it from the TLB as well. *68* If you need to emulate the access bit, set the page not present bit and do an access on the page, then set present and access bits. *70* Process starts with a new page table entry. So the first memory access scauses a page faulit, we work on from there. *71* A ring buffer is reserved for kmap (we only flush the TLB when it runs out. Tis is important because we must flush TLB of all processors if we're in SMP mode. *72* Use these API s even though it may not matter on your architecture, because it _will_ matter on others. *76* The PCI bus has its own scheme for working with this. *77* Most systems have only 1 PCI bus, but big scary things may have more than 1. Sometimes a PCI bus is accessible only from 1 CPU. *79* Bridges also have their own nodes in pci_dev. You must now turn things on and off in the right order so you turn off all devices on a bus before turning off the bus. Otherwise it's like sitting on a tree limb and sawing it off between you and the trunk. *80* Jumping up a level , mm_area structs describe what various ranges in user space contain, e.g. if they're swapped out et cetera. *81* This is the first example of using an array of structs as a virtual function table for a given area of memory. This is a technique frequently used in kernel code. This array of function pointers tells what to do in response to a page fault, et cetera. *82* Most frequent if you're writing an ioctl handler. This features some scary gcc magic so no type specification is required on the pointer. On x86 file system the segment register always pointed to user space, so memcpy_tofs and memcpy_fromfs is a blatent reference to the X86 architecture. They've been renamed. These functions will cause a page fault if the user memory is not in memory, so watch out if you are in a critical section. *83* Verify_area-- Must call for security. Otherwise after 2.2 your function can be used to overwrite kernel memory. The exception handler is written so if someone passes in an illegal pointer, it will analyse the pointer into an exception table which guesses where it came from and figures out what to do next. Good calls go fast, errors take as long as needed. *84* COW used for ptrace &C. Ptrace won't touch I/O mapping, since you don't know what'll happen if you read that stuff. It handles its own security. Heavy magic here. *89* shm_fs is a temporary fs for handling this even when nobody is using it. *90* Also used as a temporary FS which goes away on every reboot. This is a Good Thing, because otherwise my /tmp would become my home directory. I have stuff which I depend on there for months before clearing it out... *91* Modern mallocs use mem right above data for small allocations, but mmap large mallocs. Malloc does this so large accesses don't cause sbrk to get stuck high up in process space when a large alloc is followed by a small request, then the large alloc is freed. But mmap is expensive, so sometimes it can be _very_ slow. But you can tune malloc to use each approach. So frequent large allocs and deallocs should use sbrk rather than the standard technique. *92* Don't put large arrays on the stack because you can run out of kernel stack space. It's hard to figure out what memory to free. A perfect memory freer would look into the future to free the least used memory. Failing that, we are stuck with various algorithms to make good guesses based on past performance. Allocate at device setup, not during use if at all possible. Remember that network drivers can drop packets on the floor if they get into trouble, e.g. a memory allocation fails. *95* kswapd wakes up and initiates I/O as well as finds clean pages. *97* Assumes past performance equqals future returns. Very simple so that fork/exec/exit are fast. *99* Concern over whether benefits exceed costs on many systems, but as of now will be in 2.6. *100* Swap cache in 2.4 pages in swap are not released until possibly long after they come back into RAM. m_advise call can pass hints to kernel re the best strategy to reallocate memory, but so few people use it. Some flags make no difference at all per Linus. Theodore T'so says that the m_advise call does change the algorithm used to free memory, but in practice it's difficult to see any performance changes. Relying on user advice for performance is bad because sys admins and application programmers tend to be clueless about these things. *** I left for 30 minutes or so here **** *124* 2.4.9 is stable but not in low memory situations because the balance between the buffer cache and the page cache was wrong. *125* This is how scatter-gather I/O got implemented. *126* ll_rw_block -- temporarily block I/O request & then merge adjacent I/O requests. This meant that on large requests you: -- chopped the request into little chunks -- Glued the little chunks back together into one big request -- Filled the request. Large I/O requests on SCSI were thus often CPU bound (!). 2.6 replaces this interface with a sane one. This was a painful change, because a lot of code was affected. *127* So paging uses struct buffer_head so we don't have to rewrite every block device driver out there. It also assumes the vm_page_size is greater than or equal to the filesystem block size. Modern drives use 32K blocks, but page vm layer assumes 4K blocks. Gradually making buffer_cache less significant so that we can phase out struct buffer_head. *128* Best way to make mmap work is to use page cache. But vfat does not use this method. *130* All these functions are in the mm_map.c file. Only thing file system must provide is get_block method to get actual contents from an inode and block number. *131* Reusing I/O scheduling mechanism from buffer cache for page cache even though page cache is not in same list. Must clean pages here under memory pressure so they can be reclaimed by other processes. *132* Memmapped files flush later than regular written files to keep application coders expectations. Exec an ELF binary and the binary is mmaped to where the executable goes then dropped into place in 2.4. *134* "Except we got rid of cylinder groups". *140* Networking Unlike everything else, this may lie to you. e.g. you need not be paranoid about the disk drive, because it's not out to get you. Also, if you're talking to disk, it basically only reacts. Networking happens at arbitrary times. But performance is critical. Linux will never have SysV streams. They solve problem 3, but at a large cost in performance. Code in the networking layer is mostly easy to understand, but it is amazingly sensitive, because it's tweaked and tuned heavily for performance. *141* A pretty good generic list, but for networking this is so critical. You don't think of memcpy as expensive in most applications for example, but in networking it is. 2 Do as much as possible _around_ copy. e.g. copy + cksum. Memory is so slow compared to processor that we can do the cksum calculation on a byte while we wait for memory to respond. So the cost for cksum is basically 0 due to memory latency. 3 Align headers on cache boundaries. In practice this means 16 or 32 byte cache lines. By aligning headers in cace a missed fetch gets 16 bytes, so your structure is exactly 1 cache read. 4 It does not take long to overflow the network card buffer. You must have end-to-end transmission, so in memory or cpu congestion, you can just drop a packet on the floor. If you must drop a packet, drop it as soon as possible so you don't waste time processing it further. This is counter-intuitive. 5 Routing is expensive. Do it once, just like the TLB in the virtual memory system. *142* We do have some things as a stack, but it's not as modular and flexible as unix/bsd. But it's faster. *143* Queue discipline. Control the order the packets get sent to a device, so that some kinds of packets have higher priority than others. *146* If this changed, everything else changes. *147* No linked list here as BSD has, so it is faster. *148* eth0, ppp0 etc plus virtual function table. *149* Implement queueing discipline to reorganize queue for best interactivity. *151* Related to next layer up -- routing decision. Rate limiting here is separate from the traffic shaper and redundant. *152* Received packets placed on backlog queue in 2.2. 2.4 uses softnet architecture, so every CPU has its own transmit and received soft irq running in parallel. CF the mindcraft benchmarks. *153* Interrupt bonding ties 1 network card to 1 cpu. Good for mindcraft benchmark. Bad for most real life. *156* IP routing is shared between udp and tcp so it is abstracted out. Academicallyy this is unpleasing but it is a sensible implementation choice to minimize interface and mainenance here. ARP is handled in a library routine called by the device drivers only if needed because routing is really only needed by tcp/ip. *157* Alexei -- the mad Russian -- wanted to make linux into a cisco replacement. He succeeded surprisingly well for lower-end boxes, but for really high-end stuff e.g. OC3 &c, cisco uses dedicated hardware which of course a PC running Linux cannot match. *158* Fragments have a timer associated with them and if it runs out you throw them away. Fragmenting is bad because if you lose even one fragment you must retransmit the entire packet. *159* pppoe added 6 bytes of overhead to every packet, thus blowing the maximum packet size on a lot of routers. Normal case is to send an ICMP packet back to the sender asking it to make packet size smaller. But many dain-bramaged firewalls will not let ICMP packets through, since they can be used for DOS attacks. So the result is that the packet would appear to be simply dropped, and the sender would resend it until timeout. The ISPs solved this by actually changing the mtu size of the packets they ship on the fly, thus messing with content. A horrible hack, but necessary because in many aspects of CS, there's a "If you touched it last, it's your fault" theory. In Linux, you can actually set a per route MTU, as well as a per device one. This is useful in this case. *160* Every fragment consumes kernel memory so we must have a rate limiter. *162* Most folks use the Sun route(8) command, but the iproute(8) command has far more flexibility. You can use it to route by packet source as well as packet destination, for example. *164* Somewhat slow on lookup but frequent queries to this table are cached. *166* A horrid hack that works. Rusty separated that from net filtering. *171* Hanging onto a packet for 30-40 seconds. Also has a congestion window governing how many bytes of outgoing data can be held with no acknowledgement. Van Jacobson did a lot of this based on hydrodynamics. tcp optimizes the use of a shared network automagically, since everyone throttles back on congestion. An evil tcp/ip implementation might use this to be more aggressive than others, thus soaking up what bandwidth is left. Figuring out router technology to detect and penalize cheaters is an active area of research right now. *175* Provides extra knobs into various kernel parameters. How long to wait before closing out a journal transaction? Also turns on debugging &c. Switches turn up in /proc/sys. *176* Great way to speed up edit/compile/debug cycle on device drivers, develop as a module. That way, you don't have to stop and start the kernel most of the time, you can just unload the module. Also you can use a sacrificial machine, optimized for fast boot by hardware and configuration, which contains a tftp bootable floppy to fetch the kernel from your development machine. *178* printk will not work very very early in the boot sequence (before virtual memory) The messages get saved up and printed when the system is up (assuming it comes up). Linus is prejudiced against debuggers because it's easy to use them sloppily. You see that X=3 where it should=4, so rather than figure out the exact cause you simply increment it just before. This is bad. But there are some good kernel debuggers around. kdb is an internal debugger patch, and there's also a way to connect gdb to a kernel running on another machine. In that case, you get full symbolic debugging capabilities with source code numbering and everything. Also, with user-mode linux you can set yourself up to run a kernel as a linux user and simply gdb it. But user mode linux won't work with device handlers. Panic early, panic often. The bosses where TOPS-10 decided that customers were complaining too often about debug halts, so they directed the engineers to replace all debug halts with a debug message and have the OS continue to march. The results were disastrous, as the operating system staggered on mortally wounded, often corrupting data. Much better to go down early and hard at the first sign of a problem. You can set an option in /proc/sys to reboot immediately on panic. You can also use hardware and software watchdog timers in production systems. *179* e.g. no standard Linux system call table. We re-use this stuff from the architecture we port from. So on ix86 stuff we use the minix syscall table. Sparc uses the Solaris one, et cetera. *180* this is out of date now. Scaling numbers will double with 2.6, but we won't probably go up much beyond 32-way if that. ******* Poor-mouthed Java in Eric Raymond's "Zen of Unix" talk and found myself talking to Brandon Wiley (Tristero, http://tristero.sourceforge.net) late in the evening. I poured my heart out to him about the inadequacies of Java in a mixed environment, especially wrt our Troubles on reading the environment and execing programs written in othber languages, and he agreed that Java had major weaknesses there. He did mention that loading a new class in a running jvm was tricky but possible, and that he had some GPL code which might help us. He also asked about a job, although he lives in Austin. Could be a cool dude, I thought. Card says brandon@blanu.net, (512)750-8474, http://blanu.net.