mm1 --- diff/Documentation/Changes 2003-10-09 09:47:33.000000000 +0100 +++ source/Documentation/Changes 2003-12-29 09:30:39.000000000 +0000 @@ -56,7 +56,7 @@ o e2fsprogs 1.29 # tune2fs o jfsutils 1.1.3 # fsck.jfs -V o reiserfsprogs 3.6.3 # reiserfsck -V 2>&1|grep reiserfsprogs -o xfsprogs 2.1.0 # xfs_db -V +o xfsprogs 2.6.0 # xfs_db -V o pcmcia-cs 3.1.21 # cardmgr -V o quota-tools 3.09 # quota -V o PPP 2.4.0 # pppd --version @@ -183,9 +183,8 @@ The latest version of xfsprogs contains mkfs.xfs, xfs_db, and the xfs_repair utilities, among others, for the XFS filesystem. It is architecture independent and any version from 2.0.0 onward should -work correctly with this version of the XFS kernel code. For the new -(v2) log format that has better support for stripe-size aligning on -LVM and MD devices at least xfsprogs 2.1.0 is needed. +work correctly with this version of the XFS kernel code (2.6.0 or +later is recommended, due to some significant improvements). Pcmcia-cs @@ -359,7 +358,7 @@ Xfsprogs -------- -o +o Pcmcia-cs --------- --- diff/Documentation/DocBook/kernel-locking.tmpl 2003-09-30 15:46:10.000000000 +0100 +++ source/Documentation/DocBook/kernel-locking.tmpl 2003-12-29 09:30:39.000000000 +0000 @@ -6,8 +6,7 @@ - Paul - Rusty + Rusty Russell
@@ -18,8 +17,8 @@ - 2000 - Paul Russell + 2003 + Rusty Russell @@ -58,16 +57,17 @@ Welcome, to Rusty's Remarkably Unreliable Guide to Kernel Locking issues. This document describes the locking systems in - the Linux Kernel as we approach 2.4. + the Linux Kernel in 2.6. - It looks like SMP - is here to stay; so everyone hacking on the kernel - these days needs to know the fundamentals of concurrency and locking - for SMP. + With the wide availability of HyperThreading, and preemption in the Linux + Kernel, everyone hacking on the kernel needs to know the + fundamentals of concurrency and locking for + SMP. - + The Problem With Concurrency (Skip this if you know what a Race Condition is). @@ -169,15 +169,23 @@ + + Race Conditions and Critical Regions - This overlap, where what actually happens depends on the - relative timing of multiple tasks, is called a race condition. + This overlap, where the result depends on the + relative timing of multiple tasks, is called a race condition. The piece of code containing the concurrency issue is called a - critical region. And especially since Linux starting running + critical region. And especially since Linux starting running on SMP machines, they became one of the major issues in kernel design and implementation. + Preemption can have the same effect, even if there is only one + CPU: by preempting one task during the critical region, we have + exactly the same race condition. In this case the thread which + preempts might run the critical region itself. + + The solution is to recognize when these simultaneous accesses occur, and use locks to make sure that only one instance can enter the critical region at any time. There are many @@ -185,10 +193,28 @@ And then there are the unfriendly primitives, but I'll pretend they don't exist. - + Locking in the Linux Kernel + + + If I could give you one piece of advice: never sleep with anyone + crazier than yourself. But if I had to give you advice on + locking: keep it simple. + + + + Be reluctant to introduce new locks. + + + + Strangely enough, this last one is the exact reverse of my advice when + you have slept with someone crazier than yourself. + And you should think about getting a big dog. + + + Two Main Types of Kernel Locks: Spinlocks and Semaphores @@ -213,23 +239,32 @@ Neither type of lock is recursive: see - . + . + Locks and Uniprocessor Kernels - For kernels compiled without CONFIG_SMP, spinlocks - do not exist at all. This is an excellent design decision: when - no-one else can run at the same time, there is no reason to - have a lock at all. + For kernels compiled without CONFIG_SMP, and + without CONFIG_PREEMPT spinlocks do not exist at + all. This is an excellent design decision: when no-one else can + run at the same time, there is no reason to have a lock. + + + + If the kernel is compiled without CONFIG_SMP, + but CONFIG_PREEMPT is set, then spinlocks + simply disable preemption, which is sufficient to prevent any + races. For most purposes, we can think of preemption as + equivalent to SMP, and not worry about it separately. You should always test your locking code with CONFIG_SMP - enabled, even if you don't have an SMP test box, because it - will still catch some (simple) kinds of deadlock. + and CONFIG_PREEMPT enabled, even if you don't have an SMP test box, because it + will still catch some kinds of locking bugs. @@ -239,25 +274,6 @@ - - Read/Write Lock Variants - - - Both spinlocks and semaphores have read/write variants: - rwlock_t and struct rw_semaphore. - These divide users into two classes: the readers and the writers. If - you are only reading the data, you can get a read lock, but to write to - the data you need the write lock. Many people can hold a read lock, - but a writer must be sole holder. - - - - This means much smoother locking if your code divides up - neatly along reader/writer lines. All the discussions below - also apply to read/write variants. - - - Locking Only In User Context @@ -289,17 +305,26 @@ - Locking Between User Context and BHs + Locking Between User Context and Softirqs - If a bottom half shares + If a softirq shares data with user context, you have two problems. Firstly, the current - user context can be interrupted by a bottom half, and secondly, the + user context can be interrupted by a softirq, and secondly, the critical region could be entered from another CPU. This is where spin_lock_bh() (include/linux/spinlock.h) is - used. It disables bottom halves on that CPU, then grabs the lock. - spin_unlock_bh() does the reverse. + used. It disables softirqs on that CPU, then grabs the lock. + spin_unlock_bh() does the reverse. (The + '_bh' suffix is a historical reference to "Bottom Halves", the + old name for software interrupts. It should really be + called spin_lock_softirq()' in a perfect world). + + + + Note that you can also use spin_lock_irq() + or spin_lock_irqsave() here, which stop + hardware interrupts as well: see . @@ -307,70 +332,41 @@ as well: the spin lock vanishes, and this macro simply becomes local_bh_disable() (include/linux/interrupt.h), which - protects you from the bottom half being run. + protects you from the softirq being run. - Locking Between User Context and Tasklets/Soft IRQs + Locking Between User Context and Tasklets - This is exactly the same as above, because - local_bh_disable() actually disables all - softirqs and tasklets - on that CPU as well. It should really be - called `local_softirq_disable()', but the name has been preserved - for historical reasons. Similarly, - spin_lock_bh() would now be called - spin_lock_softirq() in a perfect world. + This is exactly the same as above, because tasklets are actually run + from a softirq. - - Locking Between Bottom Halves + + Locking Between User Context and Timers - Sometimes a bottom half might want to share data with - another bottom half (especially remember that timers are run - off a bottom half). + This, too, is exactly the same as above, because timers are actually run from + a softirq. From a locking point of view, tasklets and timers + are identical. - - - The Same BH - - - Since a bottom half is never run on two CPUs at once, you - don't need to worry about your bottom half being run twice - at once, even on SMP. - - - - - Different BHs - - - Since only one bottom half ever runs at a time once, you - don't need to worry about race conditions with other bottom - halves. Beware that things might change under you, however, - if someone changes your bottom half to a tasklet. If you - want to make your code future-proof, pretend you're already - running from a tasklet (see below), and doing the extra - locking. Of course, if it's five years before that happens, - you're gonna look like a damn fool. - - - Locking Between Tasklets + Locking Between Tasklets/Timers - Sometimes a tasklet might want to share data with another - tasklet, or a bottom half. + Sometimes a tasklet or timer might want to share data with + another tasklet or timer. - The Same Tasklet + The Same Tasklet/Timer Since a tasklet is never run on two CPUs at once, you don't need to worry about your tasklet being reentrant (running @@ -379,10 +375,10 @@ - Different Tasklets + Different Tasklets/Timers - If another tasklet (or bottom half, such as a timer) wants - to share data with your tasklet, you will both need to use + If another tasklet/timer wants + to share data with your tasklet or timer , you will both need to use spin_lock() and spin_unlock() calls. spin_lock_bh() is @@ -396,8 +392,8 @@ Locking Between Softirqs - Often a softirq might - want to share data with itself, a tasklet, or a bottom half. + Often a softirq might + want to share data with itself or a tasklet/timer. @@ -421,10 +417,10 @@ Different Softirqs - You'll need to use spin_lock() and - spin_unlock() for shared data, whether it - be a timer (which can be running on a different CPU), bottom half, - tasklet or the same or another softirq. + You'll need to use spin_lock() and + spin_unlock() for shared data, whether it + be a timer, tasklet, different softirq or the same or another + softirq: any of them could be running on a different CPU. @@ -434,13 +430,13 @@ Hard IRQ Context - Hardware interrupts usually communicate with a bottom half, + Hardware interrupts usually communicate with a tasklet or softirq. Frequently this involves putting work in a - queue, which the BH/softirq will take out. + queue, which the softirq will take out. - Locking Between Hard IRQ and Softirqs/Tasklets/BHs + Locking Between Hard IRQ and Softirqs/Tasklets If a hardware irq handler shares data with a softirq, you have @@ -453,6 +449,16 @@ + The irq handler does not to use + spin_lock_irq(), because the softirq cannot + run while the irq handler is running: it can use + spin_lock(), which is slightly faster. The + only exception would be if a different hardware irq handler uses + the same lock: spin_lock_irq() will stop + that from interrupting us. + + + This works perfectly for UP as well: the spin lock vanishes, and this macro simply becomes local_irq_disable() (include/asm/smp.h), which @@ -468,60 +474,766 @@ interrupts are already off) and in softirqs (where the irq disabling is required). + + + Note that softirqs (and hence tasklets and timers) are run on + return from hardware interrupts, so + spin_lock_irq() also stops these. In that + sense, spin_lock_irqsave() is the most + general and powerful locking function. + + + + + Locking Between Two Hard IRQ Handlers + + It is rare to have to share data between two IRQ handlers, but + if you do, spin_lock_irqsave() should be + used: it is architecture-specific whether all interrupts are + disabled inside irq handlers themselves. + - - - Common Techniques + + + Cheat Sheet For Locking - This section lists some common dilemmas and the standard - solutions used in the Linux kernel code. If you use these, - people will find your code simpler to understand. + Pete Zaitcev gives the following summary: + + + + If you are in a process context (any syscall) and want to + lock other process out, use a semaphore. You can take a semaphore + and sleep (copy_from_user*( or + kmalloc(x,GFP_KERNEL)). + + + + + Otherwise (== data can be touched in an interrupt), use + spin_lock_irqsave() and + spin_unlock_irqrestore(). + + + + + Avoid holding spinlock for more than 5 lines of code and + across any function call (except accessors like + readb). + + + - - If I could give you one piece of advice: never sleep with anyone - crazier than yourself. But if I had to give you advice on - locking: keep it simple. - + + Table of Minimum Requirements - - Lock data, not code. + The following table lists the minimum + locking requirements between various contexts. In some cases, + the same context can only be running on one CPU at a time, so + no locking is required for that context (eg. a particular + thread can only run on one CPU at a time, but if it needs + shares data with another thread, locking is required). - - Be reluctant to introduce new locks. + Remember the advice above: you can always use + spin_lock_irqsave(), which is a superset + of all other spinlock primitives. + +Table of Locking Requirements + + + + +IRQ Handler A +IRQ Handler B +Softirq A +Softirq B +Tasklet A +Tasklet B +Timer A +Timer B +User Context A +User Context B + + + +IRQ Handler A +None + + + +IRQ Handler B +spin_lock_irqsave +None + + + +Softirq A +spin_lock_irq +spin_lock_irq +spin_lock + + + +Softirq B +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock + + + +Tasklet A +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock +None + + + +Tasklet B +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock +spin_lock +None + + + +Timer A +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock +spin_lock +spin_lock +None + + + +Timer B +spin_lock_irq +spin_lock_irq +spin_lock +spin_lock +spin_lock +spin_lock +spin_lock +None + + + +User Context A +spin_lock_irq +spin_lock_irq +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +None + + + +User Context B +spin_lock_irq +spin_lock_irq +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +spin_lock_bh +down_interruptible +None + + + + +
+
+ + + Common Examples + +Let's step through a simple example: a cache of number to name +mappings. The cache keeps a count of how often each of the objects is +used, and when it gets full, throws out the least used one. + + + + + All In User Context + +For our first example, we assume that all operations are in user +context (ie. from system calls), so we can sleep. This means we can +use a semaphore to protect the cache and all the objects within +it. Here's the code: + - - Strangely enough, this is the exact reverse of my advice when - you have slept with someone crazier than yourself. - + +#include <linux/list.h> +#include <linux/slab.h> +#include <linux/string.h> +#include <asm/semaphore.h> +#include <asm/errno.h> + +struct object +{ + struct list_head list; + int id; + char name[32]; + int popularity; +}; + +/* Protects the cache, cache_num, and the objects within it */ +static DECLARE_MUTEX(cache_lock); +static LIST_HEAD(cache); +static unsigned int cache_num = 0; +#define MAX_CACHE_SIZE 10 + +/* Must be holding cache_lock */ +static struct object *__cache_find(int id) +{ + struct object *i; + + list_for_each_entry(i, &cache, list) + if (i->id == id) { + i->popularity++; + return i; + } + return NULL; +} - - No Writers in Interrupt Context +/* Must be holding cache_lock */ +static void __cache_delete(struct object *obj) +{ + BUG_ON(!obj); + list_del(&obj->list); + kfree(obj); + cache_num--; +} + +/* Must be holding cache_lock */ +static void __cache_add(struct object *obj) +{ + list_add(&obj->list, &cache); + if (++cache_num > MAX_CACHE_SIZE) { + struct object *i, *outcast = NULL; + list_for_each_entry(i, &cache, list) { + if (!outcast || i->popularity < outcast->popularity) + outcast = i; + } + __cache_delete(outcast); + } +} - - There is a fairly common case where an interrupt handler needs - access to a critical region, but does not need write access. - In this case, you do not need to use - read_lock_irq(), but only - read_lock() everywhere (since if an interrupt - occurs, the irq handler will only try to grab a read lock, which - won't deadlock). You will still need to use - write_lock_irq(). +int cache_add(int id, const char *name) +{ + struct object *obj; + + if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) + return -ENOMEM; + + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; + + down(&cache_lock); + __cache_add(obj); + up(&cache_lock); + return 0; +} + +void cache_delete(int id) +{ + down(&cache_lock); + __cache_delete(__cache_find(id)); + up(&cache_lock); +} + +int cache_find(int id, char *name) +{ + struct object *obj; + int ret = -ENOENT; + + down(&cache_lock); + obj = __cache_find(id); + if (obj) { + ret = 0; + strcpy(name, obj->name); + } + up(&cache_lock); + return ret; +} + + + +Note that we always make sure we have the cache_lock when we add, +delete, or look up the cache: both the cache infrastructure itself and +the contents of the objects are protected by the lock. In this case +it's easy, since we copy the data for the user, and never let them +access the objects directly. + + +There is a slight (and common) optimization here: in +cache_add we set up the fields of the object +before grabbing the lock. This is safe, as no-one else can access it +until we put it in cache. + + + + + Accessing From Interrupt Context + +Now consider the case where cache_find can be +called from interrupt context: either a hardware interrupt or a +softirq. An example would be a timer which deletes object from the +cache. + + +The change is shown below, in standard patch format: the +- are lines which are taken away, and the ++ are lines which are added. + + +--- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100 ++++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100 +@@ -12,7 +12,7 @@ + int popularity; + }; + +-static DECLARE_MUTEX(cache_lock); ++static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED; + static LIST_HEAD(cache); + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 +@@ -55,6 +55,7 @@ + int cache_add(int id, const char *name) + { + struct object *obj; ++ unsigned long flags; + + if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL) + return -ENOMEM; +@@ -63,30 +64,33 @@ + obj->id = id; + obj->popularity = 0; + +- down(&cache_lock); ++ spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +- up(&cache_lock); ++ spin_unlock_irqrestore(&cache_lock, flags); + return 0; + } + + void cache_delete(int id) + { +- down(&cache_lock); ++ unsigned long flags; ++ ++ spin_lock_irqsave(&cache_lock, flags); + __cache_delete(__cache_find(id)); +- up(&cache_lock); ++ spin_unlock_irqrestore(&cache_lock, flags); + } + + int cache_find(int id, char *name) + { + struct object *obj; + int ret = -ENOENT; ++ unsigned long flags; + +- down(&cache_lock); ++ spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); + if (obj) { + ret = 0; + strcpy(name, obj->name); + } +- up(&cache_lock); ++ spin_unlock_irqrestore(&cache_lock, flags); + return ret; + } + + + +Note that the spin_lock_irqsave will turn off +interrupts if they are on, otherwise does nothing (if we are already +in an interrupt handler), hence these functions are safe to call from +any context. + + +Unfortunately, cache_add calls +kmalloc with the GFP_KERNEL +flag, which is only legal in user context. I have assumed that +cache_add is still only called in user context, +otherwise this should become a parameter to +cache_add. + + + + Exposing Objects Outside This File + +If our objects contained more information, it might not be sufficient +to copy the information in and out: other parts of the code might want +to keep pointers to these objects, for example, rather than looking up +the id every time. This produces two problems. + + +The first problem is that we use the cache_lock to +protect objects: we'd need to make this non-static so the rest of the +code can use it. This makes locking trickier, as it is no longer all +in one place. + + +The second problem is the lifetime problem: if another structure keeps +a pointer to an object, it presumably expects that pointer to remain +valid. Unfortunately, this is only guaranteed while you hold the +lock, otherwise someone might call cache_delete +and even worse, add another object, re-using the same address. + + +As there is only one lock, you can't hold it forever: no-one else would +get any work done. + + +The solution to this problem is to use a reference count: everyone who +has a pointer to the object increases it when they first get the +object, and drops the reference count when they're finished with it. +Whoever drops it to zero knows it is unused, and can actually delete it. + + +Here is the code: + + + +--- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100 ++++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100 +@@ -7,6 +7,7 @@ + struct object + { + struct list_head list; ++ unsigned int refcnt; + int id; + char name[32]; + int popularity; +@@ -17,6 +18,35 @@ + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + ++static void __object_put(struct object *obj) ++{ ++ if (--obj->refcnt == 0) ++ kfree(obj); ++} ++ ++static void __object_get(struct object *obj) ++{ ++ obj->refcnt++; ++} ++ ++void object_put(struct object *obj) ++{ ++ unsigned long flags; ++ ++ spin_lock_irqsave(&cache_lock, flags); ++ __object_put(obj); ++ spin_unlock_irqrestore(&cache_lock, flags); ++} ++ ++void object_get(struct object *obj) ++{ ++ unsigned long flags; ++ ++ spin_lock_irqsave(&cache_lock, flags); ++ __object_get(obj); ++ spin_unlock_irqrestore(&cache_lock, flags); ++} ++ + /* Must be holding cache_lock */ + static struct object *__cache_find(int id) + { +@@ -35,6 +65,7 @@ + { + BUG_ON(!obj); + list_del(&obj->list); ++ __object_put(obj); + cache_num--; + } + +@@ -63,6 +94,7 @@ + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; ++ obj->refcnt = 1; /* The cache holds a reference */ + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +@@ -79,18 +111,15 @@ + spin_unlock_irqrestore(&cache_lock, flags); + } + +-int cache_find(int id, char *name) ++struct object *cache_find(int id) + { + struct object *obj; +- int ret = -ENOENT; + unsigned long flags; + + spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); +- if (obj) { +- ret = 0; +- strcpy(name, obj->name); +- } ++ if (obj) ++ __object_get(obj); + spin_unlock_irqrestore(&cache_lock, flags); +- return ret; ++ return obj; + } + + + +We encapsulate the reference counting in the standard 'get' and 'put' +functions. Now we can return the object itself from +cache_find which has the advantage that the user +can now sleep holding the object (eg. to +copy_to_user to name to userspace). + + +The other point to note is that I said a reference should be held for +every pointer to the object: thus the reference count is 1 when first +inserted into the cache. In some versions the framework does not hold +a reference count, but they are more complicated. + + + + Using Atomic Operations For The Reference Count + +In practice, atomic_t would usually be used for +refcnt. There are a number of atomic +operations defined in + +include/asm/atomic.h: these are +guaranteed to be seen atomically from all CPUs in the system, so no +lock is required. In this case, it is simpler than using spinlocks, +although for anything non-trivial using spinlocks is clearer. The +atomic_inc and +atomic_dec_and_test are used instead of the +standard increment and decrement operators, and the lock is no longer +used to protect the reference count itself. + + + +--- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100 ++++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100 +@@ -7,7 +7,7 @@ + struct object + { + struct list_head list; +- unsigned int refcnt; ++ atomic_t refcnt; + int id; + char name[32]; + int popularity; +@@ -18,33 +18,15 @@ + static unsigned int cache_num = 0; + #define MAX_CACHE_SIZE 10 + +-static void __object_put(struct object *obj) +-{ +- if (--obj->refcnt == 0) +- kfree(obj); +-} +- +-static void __object_get(struct object *obj) +-{ +- obj->refcnt++; +-} +- + void object_put(struct object *obj) + { +- unsigned long flags; +- +- spin_lock_irqsave(&cache_lock, flags); +- __object_put(obj); +- spin_unlock_irqrestore(&cache_lock, flags); ++ if (atomic_dec_and_test(&obj->refcnt)) ++ kfree(obj); + } + + void object_get(struct object *obj) + { +- unsigned long flags; +- +- spin_lock_irqsave(&cache_lock, flags); +- __object_get(obj); +- spin_unlock_irqrestore(&cache_lock, flags); ++ atomic_inc(&obj->refcnt); + } + + /* Must be holding cache_lock */ +@@ -65,7 +47,7 @@ + { + BUG_ON(!obj); + list_del(&obj->list); +- __object_put(obj); ++ object_put(obj); + cache_num--; + } + +@@ -94,7 +76,7 @@ + strlcpy(obj->name, name, sizeof(obj->name)); + obj->id = id; + obj->popularity = 0; +- obj->refcnt = 1; /* The cache holds a reference */ ++ atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +@@ -119,7 +101,7 @@ + spin_lock_irqsave(&cache_lock, flags); + obj = __cache_find(id); + if (obj) +- __object_get(obj); ++ object_get(obj); + spin_unlock_irqrestore(&cache_lock, flags); + return obj; + } + + + + + Protecting The Objects Themselves + +In these examples, we assumed that the objects (except the reference +counts) never changed once they are created. If we wanted to allow +the name to change, there are three possibilities: + + + +You can make cache_lock non-static, and tell people +to grab that lock before changing the name in any object. + + + + +You can provide a cache_obj_rename which grabs +this lock and changes the name for the caller, and tell everyone to +use that function. + + + + +You can make the cache_lock protect only the cache +itself, and use another lock to protect the name. + + + - - Similar logic applies to locking between softirqs/tasklets/BHs - which never need a write lock, and user context: - read_lock() and - write_lock_bh(). - - + +Theoretically, you can make the locks as fine-grained as one lock for +every field, for every object. In practice, the most common variants +are: + + + + +One lock which protects the infrastructure (the cache +list in this example) and all the objects. This is what we have done +so far. + + + + +One lock which protects the infrastructure (including the list +pointers inside the objects), and one lock inside the object which +protects the rest of that object. + + + + +Multiple locks to protect the infrastructure (eg. one lock per hash +chain), possibly with a separate per-object lock. + + + - + +Here is the "lock-per-object" implementation: + + +--- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100 ++++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 +@@ -6,11 +6,17 @@ + + struct object + { ++ /* These two protected by cache_lock. */ + struct list_head list; ++ int popularity; ++ + atomic_t refcnt; ++ ++ /* Doesn't change once created. */ + int id; ++ ++ spinlock_t lock; /* Protects the name */ + char name[32]; +- int popularity; + }; + + static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED; +@@ -77,6 +84,7 @@ + obj->id = id; + obj->popularity = 0; + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ ++ spin_lock_init(&obj->lock); + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); + + + +Note that I decide that the popularity +count should be protected by the cache_lock rather +than the per-object lock: this is because it (like the +struct list_head inside the object) is +logically part of the infrastructure. This way, I don't need to grab +the lock of every object in __cache_add when +seeking the least popular. + + + +I also decided that the id member is +unchangeable, so I don't need to grab each object lock in +__cache_find() to examine the +id: the object lock is only used by a +caller who wants to read or write the name +field. + + + +Note also that I added a comment describing what data was protected by +which locks. This is extremely important, as it describes the runtime +behavior of the code, and can be hard to gain from just reading. And +as Alan Cox says, Lock data, not code. + + + + + Common Problems + Deadlock: Simple and Advanced @@ -535,10 +1247,10 @@ For a slightly more complex case, imagine you have a region - shared by a bottom half and user context. If you use a + shared by a softirq and user context. If you use a spin_lock() call to protect it, it is - possible that the user context will be interrupted by the bottom - half while it holds the lock, and the bottom half will then spin + possible that the user context will be interrupted by the softirq + while it holds the lock, and the softirq will then spin forever trying to get the same lock. @@ -558,7 +1270,7 @@ - A more complex problem is the so-called `deadly embrace', + A more complex problem is the so-called 'deadly embrace', involving two or more locks. Say you have a hash table: each entry in the table is a spinlock, and a chain of hashed objects. Inside a softirq handler, you sometimes want to @@ -606,7 +1318,7 @@ their lock. It will look, smell, and feel like a crash. - + Preventing Deadlock @@ -634,7 +1346,6 @@ will do?). Remember, the other programmers are out to get you, so don't do this. - Overzealous Prevention Of Deadlocks @@ -650,266 +1361,481 @@ If you don't see why, please stay the fuck away from my code. - +
- - Per-CPU Data - - - A great technique for avoiding locking which is used fairly - widely is to duplicate information for each CPU. For example, - if you wanted to keep a count of a common condition, you could - use a spin lock and a single counter. Nice and simple. - + + Racing Timers: A Kernel Pastime - If that was too slow [it's probably not], you could instead - use a counter for each CPU [don't], then none of them need an - exclusive lock [you're wasting your time here]. To make sure - the CPUs don't have to synchronize caches all the time, align - the counters to cache boundaries by appending - `__cacheline_aligned' to the declaration - (include/linux/cache.h). - [Can't you think of anything better to do?] + Timers can produce their own special problems with races. + Consider a collection of objects (list, hash, etc) where each + object has a timer which is due to destroy it. - They will need a read lock to access their own counters, - however. That way you can use a write lock to grant exclusive - access to all of them at once, to tally them up. + If you want to destroy the entire collection (say on module + removal), you might do the following: - - - Big Reader Locks + + /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE + HUNGARIAN NOTATION */ + spin_lock_bh(&list_lock); - - A classic example of per-CPU information is Ingo's `big - reader' locks - (linux/include/brlock.h). These - use the Per-CPU Data techniques described above to create a lock which - is very fast to get a read lock, but agonizingly slow for a write - lock. - + while (list) { + struct foo *next = list->next; + del_timer(&list->timer); + kfree(list); + list = next; + } + + spin_unlock_bh(&list_lock); + - Fortunately, there are a limited number of these locks - available: you have to go through a strict interview process - to get one. + Sooner or later, this will crash on SMP, because a timer can + have just gone off before the spin_lock_bh(), + and it will only get the lock after we + spin_unlock_bh(), and then try to free + the element (which has already been freed!). - - - - Avoiding Locks: Read And Write Ordering - Sometimes it is possible to avoid locking. Consider the - following case from the 2.2 firewall code, which inserted an - element into a single linked list in user context: + This can be avoided by checking the result of + del_timer(): if it returns + 1, the timer has been deleted. + If 0, it means (in this + case) that it is currently running, so we can do: - new->next = i->next; - i->next = new; - + retry: + spin_lock_bh(&list_lock); - - Here the author (Alan Cox, who knows what he's doing) assumes - that the pointer assignments are atomic. This is important, - because networking packets would traverse this list on bottom - halves without a lock. Depending on their exact timing, they - would either see the new element in the list with a valid - next pointer, or it would not be in the - list yet. A lock is still required against other CPUs inserting - or deleting from the list, of course. - + while (list) { + struct foo *next = list->next; + if (!del_timer(&list->timer)) { + /* Give timer a chance to delete this */ + spin_unlock_bh(&list_lock); + goto retry; + } + kfree(list); + list = next; + } - - Of course, the writes must be in this - order, otherwise the new element appears in the list with an - invalid next pointer, and any other - CPU iterating at the wrong time will jump through it into garbage. - Because modern CPUs reorder, Alan's code actually read as follows: - - - - new->next = i->next; - wmb(); - i->next = new; + spin_unlock_bh(&list_lock); - The wmb() is a write memory barrier - (include/asm/system.h): neither - the compiler nor the CPU will allow any writes to memory after the - wmb() to be visible to other hardware - before any of the writes before the wmb(). + Another common problem is deleting timers which restart + themselves (by calling add_timer() at the end + of their timer function). Because this is a fairly common case + which is prone to races, you should use del_timer_sync() + (include/linux/timer.h) + to handle this case. It returns the number of times the timer + had to be deleted before we finally stopped it from adding itself back + in. + - - As i386 does not do write reordering, this bug would never - show up on that platform. On other SMP platforms, however, it - will. - + + The Fucked Up Sparc - There is also rmb() for read ordering: to ensure - any previous variable reads occur before following reads. The simple - mb() macro combines both - rmb() and wmb(). + Alan Cox says the irq disable/enable is in the register + window on a sparc. Andi Kleen says when you do + restore_flags in a different function you mess up all the + register windows. - Some atomic operations are defined to act as a memory barrier - (ie. as per the mb() macro, but if in - doubt, be explicit. - - Also, - spinlock operations act as partial barriers: operations after - gaining a spinlock will never be moved to precede the - spin_lock() call, and operations before - releasing a spinlock will never be moved after the - spin_unlock() call. - + So never pass the flags word set by + spin_lock_irqsave() and brethren to another + function (unless it's declared inline). Usually no-one + does this, but now you've been warned. Dave Miller can never do + anything in a straightforward manner (I can say that, because I have + pictures of him and a certain PowerPC maintainer in a compromising + position). - - Avoiding Locks: Atomic Operations + + + + Locking Speed - There are a number of atomic operations defined in - include/asm/atomic.h: these - are guaranteed to be seen atomically from all CPUs in the system, thus - avoiding races. If your shared data consists of a single counter, say, - these operations might be simpler than using spinlocks (although for - anything non-trivial using spinlocks is clearer). +There are three main things to worry about when considering speed of +some code which does locking. First is concurrency: how many things +are going to be waiting while someone else is holding a lock. Second +is the time taken to actually acquire and release an uncontended lock. +Third is using fewer, or smarter locks. I'm assuming that the lock is +used fairly often: otherwise, you wouldn't be concerned about +efficiency. + + +Concurrency depends on how long the lock is usually held: you should +hold the lock for as long as needed, but no longer. In the cache +example, we always create the object without the lock held, and then +grab the lock only when we are ready to insert it in the list. + + +Acquisition times depend on how much damage the lock operations do to +the pipeline (pipeline stalls) and how likely it is that this CPU was +the last one to grab the lock (ie. is the lock cache-hot for this +CPU): on a machine with more CPUs, this likelihood drops fast. +Consider a 700MHz Intel Pentium III: an instruction takes about 0.7ns, +an atomic increment takes about 58ns, a lock which is cache-hot on +this CPU takes 160ns, and a cacheline transfer from another CPU takes +an additional 170 to 360ns. (These figures from Paul McKenney's + Linux +Journal RCU article). + + +These two aims conflict: holding a lock for a short time might be done +by splitting locks into parts (such as in our final per-object-lock +example), but this increases the number of lock acquisitions, and the +results are often slower than having a single lock. This is another +reason to advocate locking simplicity. + + +The third concern is addressed below: there are some methods to reduce +the amount of locking which needs to be done. + + + + Read/Write Lock Variants + + + Both spinlocks and semaphores have read/write variants: + rwlock_t and struct rw_semaphore. + These divide users into two classes: the readers and the writers. If + you are only reading the data, you can get a read lock, but to write to + the data you need the write lock. Many people can hold a read lock, + but a writer must be sole holder. - - Note that the atomic operations do in general not act as memory - barriers. Instead you can insert a memory barrier before or - after atomic_inc() or - atomic_dec() by inserting - smp_mb__before_atomic_inc(), - smp_mb__after_atomic_inc(), - smp_mb__before_atomic_dec() or - smp_mb__after_atomic_dec() - respectively. The advantage of using those macros instead of - smp_mb() is, that they are cheaper on some - platforms. - + + If your code divides neatly along reader/writer lines (as our + cache code does), and the lock is held by readers for + significant lengths of time, using these locks can help. They + are slightly slower than the normal locks though, so in practice + rwlock_t is not usually worthwhile. - - Protecting A Collection of Objects: Reference Counts + + Avoiding Locks: Read Copy Update - Locking a collection of objects is fairly easy: you get a - single spinlock, and you make sure you grab it before - searching, adding or deleting an object. + There is a special method of read/write locking called Read Copy + Update. Using RCU, the readers can avoid taking a lock + altogether: as we expect our cache to be read more often than + updated (otherwise the cache is a waste of time), it is a + candidate for this optimization. - The purpose of this lock is not to protect the individual - objects: you might have a separate lock inside each one for - that. It is to protect the data structure - containing the objects from race conditions. Often - the same lock is used to protect the contents of all the - objects as well, for simplicity, but they are inherently - orthogonal (and many other big words designed to confuse). + How do we get rid of read locks? Getting rid of read locks + means that writers may be changing the list underneath the + readers. That is actually quite simple: we can read a linked + list while an element is being added if the writer adds the + element very carefully. For example, adding + new to a single linked list called + list: + + new->next = list->next; + wmb(); + list->next = new; + + - Changing this to a read-write lock will often help markedly if - reads are far more common that writes. If not, there is - another approach you can use to reduce the time the lock is - held: reference counts. + The wmb() is a write memory barrier. It + ensures that the first operation (setting the new element's + next pointer) is complete and will be seen by + all CPUs, before the second operation is (putting the new + element into the list). This is important, since modern + compilers and modern CPUs can both reorder instructions unless + told otherwise: we want a reader to either not see the new + element at all, or see the new element with the + next pointer correctly pointing at the rest of + the list. + + + Fortunately, there is a function to do this for standard + struct list_head lists: + list_add_rcu() + (include/linux/list.h). + + + Removing an element from the list is even simpler: we replace + the pointer to the old element with a pointer to its successor, + and readers will either see it, or skip over it. + + list->next = old->next; + + + There is list_del_rcu() + (include/linux/list.h) which does this (the + normal version poisons the old object, which we don't want). + + + The reader must also be careful: some CPUs can look through the + next pointer to start reading the contents of + the next element early, but don't realize that the pre-fetched + contents is wrong when the next pointer changes + underneath them. Once again, there is a + list_for_each_entry_rcu() + (include/linux/list.h) to help you. Of + course, writers can just use + list_for_each_entry(), since there cannot + be two simultaneous writers. + + + Our final dilemma is this: when can we actually destroy the + removed element? Remember, a reader might be stepping through + this element in the list right now: it we free this element and + the next pointer changes, the reader will jump + off into garbage and crash. We need to wait until we know that + all the readers who were traversing the list when we deleted the + element are finished. We use call_rcu() to + register a callback which will actually destroy the object once + the readers are finished. + + + But how does Read Copy Update know when the readers are + finished? The method is this: firstly, the readers always + traverse the list inside + rcu_read_lock()/rcu_read_unlock() + pairs: these simply disable preemption so the reader won't go to + sleep while reading the list. + + + RCU then waits until every other CPU has slept at least once: + since readers cannot sleep, we know that any readers which were + traversing the list during the deletion are finished, and the + callback is triggered. The real Read Copy Update code is a + little more optimized than this, but this is the fundamental + idea. + + + +--- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100 ++++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100 +@@ -1,15 +1,18 @@ + #include <linux/list.h> + #include <linux/slab.h> + #include <linux/string.h> ++#include <linux/rcupdate.h> + #include <asm/semaphore.h> + #include <asm/errno.h> + + struct object + { +- /* These two protected by cache_lock. */ ++ /* This is protected by RCU */ + struct list_head list; + int popularity; + ++ struct rcu_head rcu; ++ + atomic_t refcnt; + + /* Doesn't change once created. */ +@@ -40,7 +43,7 @@ + { + struct object *i; + +- list_for_each_entry(i, &cache, list) { ++ list_for_each_entry_rcu(i, &cache, list) { + if (i->id == id) { + i->popularity++; + return i; +@@ -49,19 +52,25 @@ + return NULL; + } + ++/* Final discard done once we know no readers are looking. */ ++static void cache_delete_rcu(void *arg) ++{ ++ object_put(arg); ++} ++ + /* Must be holding cache_lock */ + static void __cache_delete(struct object *obj) + { + BUG_ON(!obj); +- list_del(&obj->list); +- object_put(obj); ++ list_del_rcu(&obj->list); + cache_num--; ++ call_rcu(&obj->rcu, cache_delete_rcu, obj); + } + + /* Must be holding cache_lock */ + static void __cache_add(struct object *obj) + { +- list_add(&obj->list, &cache); ++ list_add_rcu(&obj->list, &cache); + if (++cache_num > MAX_CACHE_SIZE) { + struct object *i, *outcast = NULL; + list_for_each_entry(i, &cache, list) { +@@ -85,6 +94,7 @@ + obj->popularity = 0; + atomic_set(&obj->refcnt, 1); /* The cache holds a reference */ + spin_lock_init(&obj->lock); ++ INIT_RCU_HEAD(&obj->rcu); + + spin_lock_irqsave(&cache_lock, flags); + __cache_add(obj); +@@ -104,12 +114,11 @@ + struct object *cache_find(int id) + { + struct object *obj; +- unsigned long flags; + +- spin_lock_irqsave(&cache_lock, flags); ++ rcu_read_lock(); + obj = __cache_find(id); + if (obj) + object_get(obj); +- spin_unlock_irqrestore(&cache_lock, flags); ++ rcu_read_unlock(); + return obj; + } + + + +Note that the reader will alter the +popularity member in +__cache_find(), and now it doesn't hold a lock. +One solution would be to make it an atomic_t, but for +this usage, we don't really care about races: an approximate result is +good enough, so I didn't change it. + + + +The result is that cache_find() requires no +synchronization with any other functions, so is almost as fast on SMP +as it would be on UP. + + + +There is a furthur optimization possible here: remember our original +cache code, where there were no reference counts and the caller simply +held the lock whenever using the object? This is still possible: if +you hold the lock, noone can delete the object, so you don't need to +get and put the reference count. + + + +Now, because the 'read lock' in RCU is simply disabling preemption, a +caller which always has preemption disabled between calling +cache_find() and +object_put() does not need to actually get and +put the reference count: we could expose +__cache_find() by making it non-static, and +such callers could simply call that. + + +The benefit here is that the reference count is not written to: the +object is not altered in any way, which is much faster on SMP +machines due to caching. + + + + + Per-CPU Data - In this approach, an object has an owner, who sets the - reference count to one. Whenever you get a pointer to the - object, you increment the reference count (a `get' operation). - Whenever you relinquish a pointer, you decrement the reference - count (a `put' operation). When the owner wants to destroy - it, they mark it dead, and do a put. + Another technique for avoiding locking which is used fairly + widely is to duplicate information for each CPU. For example, + if you wanted to keep a count of a common condition, you could + use a spin lock and a single counter. Nice and simple. - Whoever drops the reference count to zero (usually implemented - with atomic_dec_and_test()) actually cleans - up and frees the object. + If that was too slow (it's usually not, but if you've got a + really big machine to test on and can show that it is), you + could instead use a counter for each CPU, then none of them need + an exclusive lock. See DEFINE_PER_CPU(), + get_cpu_var() and + put_cpu_var() + (include/linux/percpu.h). - This means that you are guaranteed that the object won't - vanish underneath you, even though you no longer have a lock - for the collection. + Of particular use for simple per-cpu counters is the + local_t type, and the + cpu_local_inc() and related functions, + which are more efficient than simple code on some architectures + (include/asm/local.h). - Here's some skeleton code: + Note that there is no simple, reliable way of getting an exact + value of such a counter, without introducing more locks. This + is not a problem for some uses. + - - void create_foo(struct foo *x) - { - atomic_set(&x->use, 1); - spin_lock_bh(&list_lock); - ... insert in list ... - spin_unlock_bh(&list_lock); - } - - struct foo *get_foo(int desc) - { - struct foo *ret; - - spin_lock_bh(&list_lock); - ... find in list ... - if (ret) atomic_inc(&ret->use); - spin_unlock_bh(&list_lock); - - return ret; - } + + Data Which Mostly Used By An IRQ Handler - void put_foo(struct foo *x) - { - if (atomic_dec_and_test(&x->use)) - kfree(foo); - } + + If data is always accessed from within the same IRQ handler, you + don't need a lock at all: the kernel already guarantees that the + irq handler will not run simultaneously on multiple CPUs. + + + Manfred Spraul points out that you can still do this, even if + the data is very occasionally accessed in user context or + softirqs/tasklets. The irq handler doesn't use a lock, and + all other accesses are done as so: + + + + spin_lock(&lock); + disable_irq(irq); + ... + enable_irq(irq); + spin_unlock(&lock); + + + The disable_irq() prevents the irq handler + from running (and waits for it to finish if it's currently + running on other CPUs). The spinlock prevents any other + accesses happening at the same time. Naturally, this is slower + than just a spin_lock_irq() call, so it + only makes sense if this type of access happens extremely + rarely. + + + - void destroy_foo(struct foo *x) - { - spin_lock_bh(&list_lock); - ... remove from list ... - spin_unlock_bh(&list_lock); + + What Functions Are Safe To Call From Interrupts? - put_foo(x); - } - + + Many functions in the kernel sleep (ie. call schedule()) + directly or indirectly: you can never call them while holding a + spinlock, or with preemption disabled. This also means you need + to be in user context: calling them from an interrupt is illegal. + - - Macros To Help You - - There are a set of debugging macros tucked inside - include/linux/netfilter_ipv4/lockhelp.h - and listhelp.h: these are very - useful for ensuring that locks are held in the right places to protect - infrastructure. - - - - - - Things Which Sleep + + Some Functions Which Sleep - You can never call the following routines while holding a - spinlock, as they may sleep. This also means you need to be in - user context. + The most common ones are listed below, but you usually have to + read the code to find out if other calls are safe. If everyone + else who calls it can sleep, you probably need to be able to + sleep, too. In particular, registration and deregistration + functions usually expect to be called from user context, and can + sleep. @@ -961,106 +1887,31 @@ - - printk() can be called in - any context, interestingly enough. - - - - - The Fucked Up Sparc + + Some Functions Which Don't Sleep - Alan Cox says the irq disable/enable is in the register - window on a sparc. Andi Kleen says when you do - restore_flags in a different function you mess up all the - register windows. - - - - So never pass the flags word set by - spin_lock_irqsave() and brethren to another - function (unless it's declared inline. Usually no-one - does this, but now you've been warned. Dave Miller can never do - anything in a straightforward manner (I can say that, because I have - pictures of him and a certain PowerPC maintainer in a compromising - position). - - - - - Racing Timers: A Kernel Pastime - - - Timers can produce their own special problems with races. - Consider a collection of objects (list, hash, etc) where each - object has a timer which is due to destroy it. + Some functions are safe to call from any context, or holding + almost any lock. - - If you want to destroy the entire collection (say on module - removal), you might do the following: - - - - /* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE - HUNGARIAN NOTATION */ - spin_lock_bh(&list_lock); - - while (list) { - struct foo *next = list->next; - del_timer(&list->timer); - kfree(list); - list = next; - } - - spin_unlock_bh(&list_lock); - - - - Sooner or later, this will crash on SMP, because a timer can - have just gone off before the spin_lock_bh(), - and it will only get the lock after we - spin_unlock_bh(), and then try to free - the element (which has already been freed!). - - - - This can be avoided by checking the result of - del_timer(): if it returns - 1, the timer has been deleted. - If 0, it means (in this - case) that it is currently running, so we can do: - - - - retry: - spin_lock_bh(&list_lock); - - while (list) { - struct foo *next = list->next; - if (!del_timer(&list->timer)) { - /* Give timer a chance to delete this */ - spin_unlock_bh(&list_lock); - goto retry; - } - kfree(list); - list = next; - } - - spin_unlock_bh(&list_lock); - - - - Another common problem is deleting timers which restart - themselves (by calling add_timer() at the end - of their timer function). Because this is a fairly common case - which is prone to races, you should use del_timer_sync() - (include/linux/timer.h) - to handle this case. It returns the number of times the timer - had to be deleted before we finally stopped it from adding itself back - in. - + + + + printk() + + + + + kfree() + + + + + add_timer() and del_timer() + + + @@ -1101,8 +1952,9 @@ Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul - Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul and Tim - Waugh for proofreading, correcting, flaming, commenting. + Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim + Waugh, Pete Zaitcev, James Morris, Robert Love, Paul McKenney, + John Ashby for proofreading, correcting, flaming, commenting. @@ -1113,12 +1965,27 @@ Glossary + + preemption + + + Prior to 2.5, or when CONFIG_PREEMPT is + unset, processes in user context inside the kernel would not + preempt each other (ie. you had that CPU until you have it up, + except for interrupts). With the addition of + CONFIG_PREEMPT in 2.5.4, this changed: when + in user context, higher priority tasks can "cut in": spinlocks + were changed to disable preemption, even on UP. + + + + bh Bottom Half: for historical reasons, functions with - `_bh' in them often now refer to any software interrupt, e.g. + '_bh' in them often now refer to any software interrupt, e.g. spin_lock_bh() blocks any software interrupt on the current CPU. Bottom halves are deprecated, and will eventually be replaced by tasklets. Only one bottom half will be @@ -1132,8 +1999,7 @@ Hardware interrupt request. in_irq() returns - true in a hardware interrupt handler (it - also returns true when interrupts are blocked). + true in a hardware interrupt handler. @@ -1144,8 +2010,7 @@ Not user context: processing a hardware irq or software irq. Indicated by the in_interrupt() macro - returning true (although it also - returns true when interrupts or BHs are blocked). + returning true. @@ -1161,35 +2026,40 @@ - softirq + Software Interrupt / softirq - Strictly speaking, one of up to 32 enumerated software + Software interrupt handler. in_irq() returns + false; in_softirq() + returns true. Tasklets and softirqs + both fall into the category of 'software interrupts'. + + + Strictly speaking a softirq is one of up to 32 enumerated software interrupts which can run on multiple CPUs at once. - Sometimes used to refer to tasklets and bottom halves as + Sometimes used to refer to tasklets as well (ie. all software interrupts). - - Software Interrupt / Software IRQ + + tasklet - Software interrupt handler. in_irq() returns - false; in_softirq() - returns true. Tasklets, softirqs and - bottom halves all fall into the category of `software interrupts'. + A dynamically-registrable software interrupt, + which is guaranteed to only run on one CPU at a time. - - tasklet + + timer - A dynamically-registrable software interrupt, - which is guaranteed to only run on one CPU at a time. + A dynamically-registrable software interrupt, which is run at + (or close to) a given time. When running, it is just like a + tasklet (in fact, they are called from the TIMER_SOFTIRQ). @@ -1207,10 +2077,11 @@ User Context - The kernel executing on behalf of a particular - process or kernel thread (given by the current() - macro.) Not to be confused with userspace. Can be interrupted by - software or hardware interrupts. + The kernel executing on behalf of a particular process (ie. a + system call or trap) or kernel thread. You can tell which + process with the current macro.) Not to + be confused with userspace. Can be interrupted by software or + hardware interrupts. --- diff/Documentation/SubmittingDrivers 2003-07-11 09:39:49.000000000 +0100 +++ source/Documentation/SubmittingDrivers 2003-12-29 09:30:39.000000000 +0000 @@ -34,14 +34,13 @@ maintainer then please contact Alan Cox Linux 2.4: - The same rules apply as 2.2 but this kernel tree is under active - development. The final contact point for Linux 2.4 submissions is - Marcelo Tosatti . + The same rules apply as 2.2. The final contact point for Linux 2.4 + submissions is Marcelo Tosatti . -Linux 2.5: +Linux 2.6: The same rules apply as 2.4 except that you should follow linux-kernel - to track changes in API's. The final contact point for Linux 2.5 - submissions is Linus Torvalds . + to track changes in API's. The final contact point for Linux 2.6 + submissions is Andrew Morton . What Criteria Determine Acceptance ---------------------------------- --- diff/Documentation/filesystems/Locking 2003-08-26 10:00:51.000000000 +0100 +++ source/Documentation/filesystems/Locking 2003-12-29 09:30:39.000000000 +0000 @@ -420,7 +420,7 @@ prototypes: void (*open)(struct vm_area_struct*); void (*close)(struct vm_area_struct*); - struct page *(*nopage)(struct vm_area_struct*, unsigned long, int); + struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *); locking rules: BKL mmap_sem --- diff/Documentation/filesystems/proc.txt 2003-09-17 12:28:01.000000000 +0100 +++ source/Documentation/filesystems/proc.txt 2003-12-29 09:30:39.000000000 +0000 @@ -900,6 +900,15 @@ Every mounted file system needs a super block, so if you plan to mount lots of file systems, you may want to increase these numbers. +aio-nr and aio-max-nr +--------------------- + +aio-nr is the running total of the number of events specified on the +io_setup system call for all currently active aio contexts. If aio-nr +reaches aio-max-nr then io_setup will fail with EAGAIN. Note that +raising aio-max-nr does not result in the pre-allocation or re-sizing +of any kernel data structures. + 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats ----------------------------------------------------------- --- diff/Documentation/filesystems/xfs.txt 2003-10-27 09:20:36.000000000 +0000 +++ source/Documentation/filesystems/xfs.txt 2003-12-29 09:30:39.000000000 +0000 @@ -29,10 +29,11 @@ The preferred buffered I/O size can also be altered on an individual file basis using the ioctl(2) system call. - ikeep + ikeep/noikeep When inode clusters are emptied of inodes, keep them around - on the disk, this is the old XFS behavior. Default is now to - return the inode cluster to the free space pool. + on the disk (ikeep) - this is the traditional XFS behaviour + and is still the default for now. Using the noikeep option, + inode clusters are returned to the free space pool. logbufs=value Set the number of in-memory log buffers. Valid numbers range @@ -75,6 +76,10 @@ Filesystems mounted "norecovery" must be mounted read-only or the mount will fail. + nouuid + Don't check for double mounted file systems using the file system uuid. + This is useful to mount LVM snapshot volumes. + osyncisosync Make O_SYNC writes implement true O_SYNC. WITHOUT this option, Linux XFS behaves as if an "osyncisdsync" option is used, @@ -108,10 +113,6 @@ The "swidth" option is required if the "sunit" option has been specified, and must be a multiple of the "sunit" value. - nouuid - Don't check for double mounted file systems using the file system uuid. - This is useful to mount LVM snapshot volumes. - sysctls ======= @@ -139,14 +140,14 @@ Causes certain error conditions to call BUG(). Value is a bitmask; AND together the tags which represent errors which should cause panics: - XFS_NO_PTAG 0LL - XFS_PTAG_IFLUSH 0x0000000000000001LL - XFS_PTAG_LOGRES 0x0000000000000002LL - XFS_PTAG_AILDELETE 0x0000000000000004LL - XFS_PTAG_ERROR_REPORT 0x0000000000000008LL - XFS_PTAG_SHUTDOWN_CORRUPT 0x0000000000000010LL - XFS_PTAG_SHUTDOWN_IOERROR 0x0000000000000020LL - XFS_PTAG_SHUTDOWN_LOGERROR 0x0000000000000040LL + XFS_NO_PTAG 0 + XFS_PTAG_IFLUSH 0x00000001 + XFS_PTAG_LOGRES 0x00000002 + XFS_PTAG_AILDELETE 0x00000004 + XFS_PTAG_ERROR_REPORT 0x00000008 + XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010 + XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 + XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040 This option is intended for debugging only. @@ -165,6 +166,21 @@ Controls whether unprivileged users can use chown to "give away" a file to another user. + fs.xfs.inherit_sync (Min: 0 Default: 1 Max 1) + Setting this to "1" will cause the "sync" flag set + by the chattr(1) command on a directory to be + inherited by files in that directory. + + fs.xfs.inherit_nodump (Min: 0 Default: 1 Max 1) + Setting this to "1" will cause the "nodump" flag set + by the chattr(1) command on a directory to be + inherited by files in that directory. + + fs.xfs.inherit_noatime (Min: 0 Default: 1 Max 1) + Setting this to "1" will cause the "noatime" flag set + by the chattr(1) command on a directory to be + inherited by files in that directory. + vm.pagebuf.stats_clear (Min: 0 Default: 0 Max: 1) Setting this to "1" clears accumulated pagebuf statistics in /proc/fs/pagebuf/stat. It then immediately reset to "0". --- diff/Documentation/i386/zero-page.txt 2002-11-11 11:09:30.000000000 +0000 +++ source/Documentation/i386/zero-page.txt 2003-12-29 09:30:39.000000000 +0000 @@ -22,6 +22,7 @@ 0x90000 + contents of CL_OFFSET (only taken, when CL_MAGIC = 0xA33F) 0x40 20 bytes struct apm_bios_info, APM_BIOS_INFO + 0x60 16 bytes Intel SpeedStep (IST) BIOS support information 0x80 16 bytes hd0-disk-parameter from intvector 0x41 0x90 16 bytes hd1-disk-parameter from intvector 0x46 --- diff/Documentation/kernel-parameters.txt 2003-10-09 09:47:33.000000000 +0100 +++ source/Documentation/kernel-parameters.txt 2003-12-29 09:30:39.000000000 +0000 @@ -24,6 +24,7 @@ HW Appropriate hardware is enabled. IA-32 IA-32 aka i386 architecture is enabled. IA-64 IA-64 architecture is enabled. + IOSCHED More than one I/O scheduler is enabled. IP_PNP IP DCHP, BOOTP, or RARP is enabled. ISAPNP ISA PnP code is enabled. ISDN Appropriate ISDN support is enabled. @@ -91,6 +92,23 @@ ht -- run only enough ACPI to enable Hyper Threading See also Documentation/pm.txt. + acpi_pic_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode + Format: { level | edge } + level Force PIC-mode SCI to Level Trigger (default) + edge Force PIC-mode SCI to Edge Trigge + + acpi_irq_balance [HW,ACPI] ACPI will balance active IRQs + default in APIC mode + + acpi_irq_nobalance [HW,ACPI] ACPI will not move active IRQs (default) + default in PIC mode + + acpi_irq_pci= [HW,ACPI] If irq_balance, Clear listed IRQs for use by PCI + Format: ,... + + acpi_irq_isa= [HW,ACPI] If irq_balance, Mark listed IRQs used by ISA + Format: ,... + ad1816= [HW,OSS] Format: ,,, See also Documentation/sound/oss/AD1816. @@ -214,7 +232,7 @@ Forces specified timesource (if avaliable) to be used when calculating gettimeofday(). If specicified timesource is not avalible, it defaults to PIT. - Format: { pit | tsc | cyclone | ... } + Format: { pit | tsc | cyclone | pmtmr } hpet= [IA-32,HPET] option to disable HPET and use PIT. Format: disable @@ -303,6 +321,10 @@ See comment before function elanfreq_setup() in arch/i386/kernel/cpu/cpufreq/elanfreq.c. + elevator= [IOSCHED] + Format: {"as"|"cfq"|"deadline"|"noop"} + See Documentation/as-iosched.txt for details + es1370= [HW,OSS] Format: [,] See also header of sound/oss/es1370.c. @@ -790,7 +812,8 @@ before loading. See Documentation/ramdisk.txt. - psmouse_noext [HW,MOUSE] Disable probing for PS2 mouse protocol extensions + psmouse_proto= [HW,MOUSE] Highest PS2 mouse protocol extension to + probe for (bare|imps|exps). psmouse_resetafter= [HW,MOUSE] Try to reset Synaptics Touchpad after so many --- diff/Documentation/kobject.txt 2003-09-17 12:28:01.000000000 +0100 +++ source/Documentation/kobject.txt 2003-12-29 09:30:39.000000000 +0000 @@ -2,7 +2,7 @@ Patrick Mochel -Updated: 3 June 2003 +Updated: 12 November 2003 Copyright (c) 2003 Patrick Mochel @@ -128,7 +128,33 @@ (like the networking layer). -1.4 sysfs +1.4 Order dependencies + +Fields in a kobject must be initialized before they are used, as +indicated in this table: + + k_name Before kobject_add() (1) + name Before kobject_add() (1) + refcount Before kobject_init() (2) + entry Set by kobject_init() + parent Before kobject_add() (3) + kset Before kobject_init() (4) + ktype Before kobject_add() (4) + dentry Set by kobject_add() + +(1) If k_name isn't already set when kobject_add() is called, +it will be set to name. + +(2) Although kobject_init() sets refcount to 1, a warning will be logged +if refcount is not equal to 0 beforehand. + +(3) If parent isn't already set when kobject_add() is called, +it will be set to kset's embedded kobject. + +(4) kset and ktype are optional. If they are used, they should be set +at the times indicated. + +1.5 sysfs Each kobject receives a directory in sysfs. This directory is created under the kobject's parent directory. @@ -210,7 +236,25 @@ name. The kobject, if found, is returned. -2.3 sysfs +2.3 Order dependencies + +Fields in a kset must be initialized before they are used, as indicated +in this table: + + subsys Before kset_add() + ktype Before kset_add() (1) + list Set by kset_init() + kobj Before kset_init() (2) + hotplug_ops Before kset_add() (1) + +(1) ktype and hotplug_ops are optional. If they are used, they should +be set at the times indicated. + +(2) kobj is passed to kobject_init() during kset_init() and to +kobject_add() during kset_add(); it must initialized accordingly. + + +2.4 sysfs ksets are represented in sysfs when their embedded kobjects are registered. They follow the same rules of parenting, with one @@ -352,7 +396,21 @@ - Sets obj->subsys.kset.kobj.kset to the subsystem's embedded kset. -4.4 sysfs +4.4 Order dependencies + +Fields in a subsystem must be initialized before they are used, +as indicated in this table: + + kset Before subsystem_register() (1) + rwsem Set by subsystem_register() + +(1) kset is minimally initialized by the decl_subsys macro. It is +passed to kset_init() and kset_add() by subsystem_register(). If its +subsys member isn't already set, subsystem_register() sets it to the +containing subsystem. + + +4.5 sysfs subsystems are represented in sysfs via their embedded kobjects. They follow the same rules as previously mentioned with no exceptions. They --- diff/Documentation/scsi/BusLogic.txt 2003-08-20 14:16:23.000000000 +0100 +++ source/Documentation/scsi/BusLogic.txt 2003-12-29 09:30:39.000000000 +0000 @@ -577,7 +577,7 @@ INSMOD Loadable Kernel Module Installation Facility: insmod BusLogic.o \ - 'BusLogic_Options="QueueDepth:[,7,15];QueueDepth:31,BusSettleTime:30"' + 'BusLogic="QueueDepth:[,7,15];QueueDepth:31,BusSettleTime:30"' NOTE: Module Utilities 2.1.71 or later is required for correct parsing of driver options containing commas. --- diff/Documentation/sysctl/fs.txt 2003-01-02 10:43:02.000000000 +0000 +++ source/Documentation/sysctl/fs.txt 2003-12-29 09:30:39.000000000 +0000 @@ -138,3 +138,13 @@ can have. You only need to increase super-max if you need to mount more filesystems than the current value in super-max allows you to. + +============================================================== + +aio-nr & aio-max-nr: + +aio-nr shows the current system-wide number of asynchronous io +requests. aio-max-nr allows you to change the maximum value +aio-nr can grow to. + +============================================================== --- diff/Documentation/watchdog/watchdog-api.txt 2003-09-17 12:28:01.000000000 +0100 +++ source/Documentation/watchdog/watchdog-api.txt 2003-12-29 09:30:39.000000000 +0000 @@ -358,6 +358,15 @@ No bits set in GETSUPPORT +w83627hf_wdt.c -- w83627hf watchdog + + Timeout that defaults to 60 seconds, supports SETTIMEOUT. + + Supports CONFIG_WATCHDOG_NOWAYOUT + + GETSUPPORT returns WDIOF_KEEPALIVEPING and WDIOF_SETTIMEOUT. + The GETSTATUS call returns if the device is open or not. + wdt.c -- ICS WDT500/501 ISA and wdt_pci.c -- ICS WDT500/501 PCI --- diff/MAINTAINERS 2003-11-25 15:24:57.000000000 +0000 +++ source/MAINTAINERS 2003-12-29 09:30:42.000000000 +0000 @@ -73,7 +73,7 @@ 3C359 NETWORK DRIVER P: Mike Phillips M: mikep@linuxtr.net -L: linux-net@vger.rutgers.edu +L: linux-net@vger.kernel.org L: linux-tr@linuxtr.net W: http://www.linuxtr.net S: Maintained @@ -929,8 +929,9 @@ S: Maintained SN-IA64 (Itanium) SUB-PLATFORM -P: John Hesterberg -M: jh@sgi.com +P: Jesse Barnes +M: jbarnes@sgi.com +L: linux-altix@sgi.com L: linux-ia64@linuxia64.org W: http://www.sgi.com/altix S: Maintained @@ -1154,6 +1155,12 @@ W: http://developer.osdl.org/rddunlap/kj-patches/ S: Maintained +KGDB FOR I386 PLATFORM +P: George Anzinger +M: george@mvista.com +L: linux-net@vger.kernel.org +S: Supported + KERNEL NFSD P: Neil Brown M: neilb@cse.unsw.edu.au @@ -1234,8 +1241,8 @@ S: Maintained LSILOGIC/SYMBIOS/NCR 53C8XX and 53C1010 PCI-SCSI drivers -P: Gerard Roudier -M: groudier@free.fr +P: Matthew Wilcox +M: matthew@wil.cx L: linux-scsi@vger.kernel.org S: Maintained @@ -1468,7 +1475,7 @@ P: Willem Riede M: osst@riede.org L: osst@linux1.onstream.nl -L: linux-scsi@vger.rutgers.edu +L: linux-scsi@vger.kernel.org S: Maintained OPL3-SA2, SA3, and SAx DRIVER @@ -1556,11 +1563,8 @@ S: Maintained PCMCIA SUBSYSTEM -P: David Hinds -M: dahinds@users.sourceforge.net -L: linux-kernel@vger.kernel.org -W: http://pcmcia-cs.sourceforge.net -S: Maintained +L: http://lists.infradead.org/mailman/listinfo/linux-pcmcia +S: Unmaintained PCNET32 NETWORK DRIVER P: Thomas Bogendörfer --- diff/Makefile 2003-12-19 09:51:11.000000000 +0000 +++ source/Makefile 2003-12-29 09:30:42.000000000 +0000 @@ -1,7 +1,7 @@ VERSION = 2 PATCHLEVEL = 6 SUBLEVEL = 0 -EXTRAVERSION = +EXTRAVERSION = -mm1 # *DOCUMENTATION* # To see a list of typical targets execute "make help" @@ -275,7 +275,7 @@ CPPFLAGS := -D__KERNEL__ -Iinclude \ $(if $(KBUILD_SRC),-Iinclude2 -I$(srctree)/include) -CFLAGS := -Wall -Wstrict-prototypes -Wno-trigraphs -O2 \ +CFLAGS := -Wall -Wstrict-prototypes -Wno-trigraphs \ -fno-strict-aliasing -fno-common AFLAGS := -D__ASSEMBLY__ @@ -431,6 +431,12 @@ # --------------------------------------------------------------------------- +ifdef CONFIG_CC_OPTIMIZE_FOR_SIZE +CFLAGS += -Os +else +CFLAGS += -O2 +endif + ifndef CONFIG_FRAME_POINTER CFLAGS += -fomit-frame-pointer endif --- diff/README 2003-10-09 09:47:33.000000000 +0100 +++ source/README 2003-12-29 09:30:42.000000000 +0000 @@ -119,7 +119,7 @@ cd /usr/src/linux-2.6.N make O=/home/name/build/kernel menuconfig make O=/home/name/build/kernel - sudo make O=/home/name/build/kernel install_modules install + sudo make O=/home/name/build/kernel modules_install install Please note: If the 'O=output/dir' option is used then it must be used for all invocations of make. --- diff/arch/alpha/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/alpha/kernel/irq.c 2003-12-29 09:30:38.000000000 +0000 @@ -252,9 +252,11 @@ irq_affinity_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - if (count < HEX_DIGITS+1) + int len = cpumask_snprintf(page, count, irq_affinity[(long)data]); + if (count - len < 2) return -EINVAL; - return sprintf (page, "%016lx\n", irq_affinity[(long)data]); + len += sprintf(page + len, "\n"); + return len; } static unsigned int @@ -331,10 +333,11 @@ prof_cpu_mask_read_proc(char *page, char **start, off_t off, int count, int *eof, void *data) { - unsigned long *mask = (unsigned long *) data; - if (count < HEX_DIGITS+1) + int len = cpumask_snprintf(page, count, *(cpumask_t *)data); + if (count - len < 2) return -EINVAL; - return sprintf (page, "%016lx\n", *mask); + len += sprintf(page + len, "\n"); + return len; } static int @@ -529,19 +532,21 @@ #ifdef CONFIG_SMP int j; #endif - int i; + int i = *(loff_t *) v; struct irqaction * action; unsigned long flags; #ifdef CONFIG_SMP - seq_puts(p, " "); - for (i = 0; i < NR_CPUS; i++) - if (cpu_online(i)) - seq_printf(p, "CPU%d ", i); - seq_putc(p, '\n'); + if (i == 0) { + seq_puts(p, " "); + for (i = 0; i < NR_CPUS; i++) + if (cpu_online(i)) + seq_printf(p, "CPU%d ", i); + seq_putc(p, '\n'); + } #endif - for (i = 0; i < ACTUAL_NR_IRQS; i++) { + if (i < ACTUAL_NR_IRQS) { spin_lock_irqsave(&irq_desc[i].lock, flags); action = irq_desc[i].action; if (!action) @@ -568,15 +573,16 @@ seq_putc(p, '\n'); unlock: spin_unlock_irqrestore(&irq_desc[i].lock, flags); - } + } else if (i == ACTUAL_NR_IRQS) { #ifdef CONFIG_SMP - seq_puts(p, "IPI: "); - for (i = 0; i < NR_CPUS; i++) - if (cpu_online(i)) - seq_printf(p, "%10lu ", cpu_data[i].ipi_count); - seq_putc(p, '\n'); + seq_puts(p, "IPI: "); + for (i = 0; i < NR_CPUS; i++) + if (cpu_online(i)) + seq_printf(p, "%10lu ", cpu_data[i].ipi_count); + seq_putc(p, '\n'); #endif - seq_printf(p, "ERR: %10lu\n", irq_err_count); + seq_printf(p, "ERR: %10lu\n", irq_err_count); + } return 0; } --- diff/arch/alpha/kernel/traps.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/alpha/kernel/traps.c 2003-12-29 09:30:38.000000000 +0000 @@ -636,6 +636,7 @@ lock_kernel(); printk("Bad unaligned kernel access at %016lx: %p %lx %ld\n", pc, va, opcode, reg); + dump_stack(); do_exit(SIGSEGV); got_exception: --- diff/arch/arm/Makefile 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/arm/Makefile 2003-12-29 09:30:38.000000000 +0000 @@ -14,8 +14,6 @@ GZFLAGS :=-9 #CFLAGS +=-pipe -CFLAGS :=$(CFLAGS:-O2=-Os) - ifeq ($(CONFIG_FRAME_POINTER),y) CFLAGS +=-fno-omit-frame-pointer -mapcs -mno-sched-prolog endif --- diff/arch/arm/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/arm/kernel/irq.c 2003-12-29 09:30:38.000000000 +0000 @@ -169,11 +169,11 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(loff_t *) v; struct irqaction * action; unsigned long flags; - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { spin_lock_irqsave(&irq_controller_lock, flags); action = irq_desc[i].action; if (!action) @@ -187,12 +187,12 @@ seq_putc(p, '\n'); unlock: spin_unlock_irqrestore(&irq_controller_lock, flags); - } - + } else if (i == NR_IRQS) { #ifdef CONFIG_ARCH_ACORN - show_fiq_list(p, v); + show_fiq_list(p, v); #endif - seq_printf(p, "Err: %10lu\n", irq_err_count); + seq_printf(p, "Err: %10lu\n", irq_err_count); + } return 0; } --- diff/arch/arm26/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/arm26/kernel/irq.c 2003-12-29 09:30:38.000000000 +0000 @@ -135,10 +135,10 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(loff_t *) v; struct irqaction * action; - for (i = 0 ; i < NR_IRQS ; i++) { + if (i < NR_IRQS) { action = irq_desc[i].action; if (!action) continue; @@ -148,10 +148,10 @@ seq_printf(p, ", %s", action->name); } seq_putc(p, '\n'); + } else if (i == NR_IRQS) { + show_fiq_list(p, v); + seq_printf(p, "Err: %10lu\n", irq_err_count); } - - show_fiq_list(p, v); - seq_printf(p, "Err: %10lu\n", irq_err_count); return 0; } --- diff/arch/cris/kernel/irq.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/cris/kernel/irq.c 2003-12-29 09:30:38.000000000 +0000 @@ -89,11 +89,11 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(loff_t *) v; struct irqaction * action; unsigned long flags; - for (i = 0; i < NR_IRQS; i++) { + if (i < NR_IRQS) { local_irq_save(flags); action = irq_action[i]; if (!action) --- diff/arch/h8300/Kconfig 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/h8300/Kconfig 2003-12-29 09:30:38.000000000 +0000 @@ -5,6 +5,10 @@ mainmenu "uClinux/h8300 (w/o MMU) Kernel Configuration" +config H8300 + bool + default y + config MMU bool default n --- diff/arch/h8300/Makefile 2003-08-26 10:00:51.000000000 +0100 +++ source/arch/h8300/Makefile 2003-12-29 09:30:38.000000000 +0000 @@ -34,7 +34,7 @@ ldflags-$(CONFIG_CPU_H8S) := -mh8300self CFLAGS += $(cflags-y) -CFLAGS += -mint32 -fno-builtin -Os +CFLAGS += -mint32 -fno-builtin CFLAGS += -g CFLAGS += -D__linux__ CFLAGS += -DUTS_SYSNAME=\"uClinux\" --- diff/arch/h8300/platform/h8300h/ints.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/h8300/platform/h8300h/ints.c 2003-12-29 09:30:38.000000000 +0000 @@ -228,9 +228,9 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(loff_t *) v; - for (i = 0; i < NR_IRQS; i++) { + if (i < NR_IRQS) { if (irq_list[i]) { seq_printf(p, "%3d: %10u ",i,irq_list[i]->count); seq_printf(p, "%s\n", irq_list[i]->devname); --- diff/arch/h8300/platform/h8s/ints.c 2003-10-27 09:20:36.000000000 +0000 +++ source/arch/h8300/platform/h8s/ints.c 2003-12-29 09:30:38.000000000 +0000 @@ -280,9 +280,9 @@ int show_interrupts(struct seq_file *p, void *v) { - int i; + int i = *(loff_t *) v; - for (i = 0; i < NR_IRQS; i++) { + if (i < NR_IRQS) { if (irq_list[i]) { seq_printf(p, "%3d: %10u ",i,irq_list[i]->count); seq_printf(p, "%s\n", irq_list[i]->devname); --- diff/arch/i386/Kconfig 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/Kconfig 2003-12-29 09:30:38.000000000 +0000 @@ -115,10 +115,15 @@ default y depends on NUMA && (X86_SUMMIT || X86_GENERICARCH) +config X86_SUMMIT_NUMA + bool + default y + depends on NUMA && (X86_SUMMIT || X86_GENERICARCH) + config X86_CYCLONE_TIMER - bool - default y - depends on X86_SUMMIT || X86_GENERICARCH + bool + default y + depends on X86_SUMMIT || X86_GENERICARCH config ES7000_CLUSTERED_APIC bool @@ -397,6 +402,54 @@ depends on MWINCHIP3D || MWINCHIP2 || MWINCHIPC6 default y +config X86_4G + bool "4 GB kernel-space and 4 GB user-space virtual memory support" + help + This option is only useful for systems that have more than 1 GB + of RAM. + + The default kernel VM layout leaves 1 GB of virtual memory for + kernel-space mappings, and 3 GB of VM for user-space applications. + This option ups both the kernel-space VM and the user-space VM to + 4 GB. + + The cost of this option is additional TLB flushes done at + system-entry points that transition from user-mode into kernel-mode. + I.e. system calls and page faults, and IRQs that interrupt user-mode + code. There's also additional overhead to kernel operations that copy + memory to/from user-space. The overhead from this is hard to tell and + depends on the workload - it can be anything from no visible overhead + to 20-30% overhead. A good rule of thumb is to count with a runtime + overhead of 20%. + + The upside is the much increased kernel-space VM, which more than + quadruples the maximum amount of RAM supported. Kernels compiled with + this option boot on 64GB of RAM and still have more than 3.1 GB of + 'lowmem' left. Another bonus is that highmem IO bouncing decreases, + if used with drivers that still use bounce-buffers. + + There's also a 33% increase in user-space VM size - database + applications might see a boost from this. + + But the cost of the TLB flushes and the runtime overhead has to be + weighed against the bonuses offered by the larger VM spaces. The + dividing line depends on the actual workload - there might be 4 GB + systems that benefit from this option. Systems with less than 4 GB + of RAM will rarely see a benefit from this option - but it's not + out of question, the exact circumstances have to be considered. + +config X86_SWITCH_PAGETABLES + def_bool X86_4G + +config X86_4G_VM_LAYOUT + def_bool X86_4G + +config X86_UACCESS_INDIRECT + def_bool X86_4G + +config X86_HIGH_ENTRY + def_bool X86_4G + config HPET_TIMER bool "HPET Timer Support" help @@ -784,6 +837,25 @@ See for more information. +config EFI + bool "Boot from EFI support (EXPERIMENTAL)" + depends on ACPI + default n + ---help--- + + This enables the the kernel to boot on EFI platforms using + system configuration information passed to it from the firmware. + This also enables the kernel to use any EFI runtime services that are + available (such as the EFI variable services). + + This option is only useful on systems that have EFI firmware + and will result in a kernel image that is ~8k larger. In addition, + you must use the latest ELILO loader available at + ftp.hpl.hp.com/pub/linux-ia64/ in order to take advantage of kernel + initialization using EFI information (neither GRUB nor LILO know + anything about EFI). However, even with this option, the resultant + kernel should continue to boot on existing non-EFI platforms. + config HAVE_DEC_LOCK bool depends on (SMP || PREEMPT) && X86_CMPXCHG @@ -793,7 +865,7 @@ # Summit needs it only when NUMA is on config BOOT_IOREMAP bool - depends on ((X86_SUMMIT || X86_GENERICARCH) && NUMA) + depends on (((X86_SUMMIT || X86_GENERICARCH) && NUMA) || (X86 && EFI)) default y endmenu @@ -1030,6 +1102,25 @@ depends on PCI && ((PCI_GODIRECT || PCI_GOANY) || X86_VISWS) default y +config PCI_USE_VECTOR + bool "Vector-based interrupt indexing" + depends on X86_LOCAL_APIC + default n + help + This replaces the current existing IRQ-based index interrupt scheme + with the vector-base index scheme. The advantages of vector base + over IRQ base are listed below: + 1) Support MSI implementation. + 2) Support future IOxAPIC hotplug + + Note that this enables MSI, Message Signaled Interrupt, on all + MSI capable device functions detected if users also install the + MSI patch. Message Signal Interrupt enables an MSI-capable + hardware device to send an inbound Memory Write on its PCI bus + instead of asserting IRQ signal on device IRQ pin. + + If you don't know what to do here, say N. + source "drivers/pci/Kconfig" config ISA @@ -1187,6 +1278,15 @@ This results in a large slowdown, but helps to find certain types of memory corruptions. +config SPINLINE + bool "Spinlock inlining" + depends on DEBUG_KERNEL + help + This will change spinlocks from out of line to inline, making them + account cost to the callers in readprofile, rather than the lock + itself (as ".text.lock.filename"). This can be helpful for finding + the callers of locks. + config DEBUG_HIGHMEM bool "Highmem debugging" depends on DEBUG_KERNEL && HIGHMEM @@ -1203,20 +1303,208 @@ Say Y here only if you plan to use gdb to debug the kernel. If you don't debug the kernel, you can say N. +config LOCKMETER + bool "Kernel lock metering" + depends on SMP + help + Say Y to enable kernel lock metering, which adds overhead to SMP locks, + but allows you to see various statistics using the lockstat command. + config DEBUG_SPINLOCK_SLEEP bool "Sleep-inside-spinlock checking" help If you say Y here, various routines which may sleep will become very noisy if they are called with a spinlock held. +config KGDB + bool "Include kgdb kernel debugger" + depends on DEBUG_KERNEL + help + If you say Y here, the system will be compiled with the debug + option (-g) and a debugging stub will be included in the + kernel. This stub communicates with gdb on another (host) + computer via a serial port. The host computer should have + access to the kernel binary file (vmlinux) and a serial port + that is connected to the target machine. Gdb can be made to + configure the serial port or you can use stty and setserial to + do this. See the 'target' command in gdb. This option also + configures in the ability to request a breakpoint early in the + boot process. To request the breakpoint just include 'kgdb' + as a boot option when booting the target machine. The system + will then break as soon as it looks at the boot options. This + option also installs a breakpoint in panic and sends any + kernel faults to the debugger. For more information see the + Documentation/i386/kgdb.txt file. + +choice + depends on KGDB + prompt "Debug serial port BAUD" + default KGDB_115200BAUD + help + Gdb and the kernel stub need to agree on the baud rate to be + used. Some systems (x86 family at this writing) allow this to + be configured. + +config KGDB_9600BAUD + bool "9600" + +config KGDB_19200BAUD + bool "19200" + +config KGDB_38400BAUD + bool "38400" + +config KGDB_57600BAUD + bool "57600" + +config KGDB_115200BAUD + bool "115200" +endchoice + +config KGDB_PORT + hex "hex I/O port address of the debug serial port" + depends on KGDB + default 3f8 + help + Some systems (x86 family at this writing) allow the port + address to be configured. The number entered is assumed to be + hex, don't put 0x in front of it. The standard address are: + COM1 3f8 , irq 4 and COM2 2f8 irq 3. Setserial /dev/ttySx + will tell you what you have. It is good to test the serial + connection with a live system before trying to debug. + +config KGDB_IRQ + int "IRQ of the debug serial port" + depends on KGDB + default 4 + help + This is the irq for the debug port. If everything is working + correctly and the kernel has interrupts on a control C to the + port should cause a break into the kernel debug stub. + +config DEBUG_INFO + bool + depends on KGDB + default y + +config KGDB_MORE + bool "Add any additional compile options" + depends on KGDB + default n + help + Saying yes here turns on the ability to enter additional + compile options. + + +config KGDB_OPTIONS + depends on KGDB_MORE + string "Additional compile arguments" + default "-O1" + help + This option allows you enter additional compile options for + the whole kernel compile. Each platform will have a default + that seems right for it. For example on PPC "-ggdb -O1", and + for i386 "-O1". Note that by configuring KGDB "-g" is already + turned on. In addition, on i386 platforms + "-fomit-frame-pointer" is deleted from the standard compile + options. + +config NO_KGDB_CPUS + int "Number of CPUs" + depends on KGDB && SMP + default NR_CPUS + help + + This option sets the number of cpus for kgdb ONLY. It is used + to prune some internal structures so they look "nice" when + displayed with gdb. This is to overcome possibly larger + numbers that may have been entered above. Enter the real + number to get nice clean kgdb_info displays. + +config KGDB_TS + bool "Enable kgdb time stamp macros?" + depends on KGDB + default n + help + Kgdb event macros allow you to instrument your code with calls + to the kgdb event recording function. The event log may be + examined with gdb at a break point. Turning on this + capability also allows you to choose how many events to + keep. Kgdb always keeps the lastest events. + +choice + depends on KGDB_TS + prompt "Max number of time stamps to save?" + default KGDB_TS_128 + +config KGDB_TS_64 + bool "64" + +config KGDB_TS_128 + bool "128" + +config KGDB_TS_256 + bool "256" + +config KGDB_TS_512 + bool "512" + +config KGDB_TS_1024 + bool "1024" + +endchoice + +config STACK_OVERFLOW_TEST + bool "Turn on kernel stack overflow testing?" + depends on KGDB + default n + help + This option enables code in the front line interrupt handlers + to check for kernel stack overflow on interrupts and system + calls. This is part of the kgdb code on x86 systems. + +config KGDB_CONSOLE + bool "Enable serial console thru kgdb port" + depends on KGDB + default n + help + This option enables the command line "console=kgdb" option. + When the system is booted with this option in the command line + all kernel printk output is sent to gdb (as well as to other + consoles). For this to work gdb must be connected. For this + reason, this command line option will generate a breakpoint if + gdb has not yet connected. After the gdb continue command is + given all pent up console output will be printed by gdb on the + host machine. Neither this option, nor KGDB require the + serial driver to be configured. + +config KGDB_SYSRQ + bool "Turn on SysRq 'G' command to do a break?" + depends on KGDB + default y + help + This option includes an option in the SysRq code that allows + you to enter SysRq G which generates a breakpoint to the KGDB + stub. This will work if the keyboard is alive and can + interrupt the system. Because of constraints on when the + serial port interrupt can be enabled, this code may allow you + to interrupt the system before the serial port control C is + available. Just say yes here. + config FRAME_POINTER bool "Compile the kernel with frame pointers" + default KGDB help If you say Y here the resulting kernel image will be slightly larger and slower, but it will give very useful debugging information. If you don't debug the kernel, you can say N, but we may not be able to solve problems without frame pointers. +config MAGIC_SYSRQ + bool + depends on KGDB_SYSRQ + default y + config X86_EXTRA_IRQS bool depends on X86_LOCAL_APIC || X86_VOYAGER --- diff/arch/i386/Makefile 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/Makefile 2003-12-29 09:30:38.000000000 +0000 @@ -84,6 +84,9 @@ # default subarch .h files mflags-y += -Iinclude/asm-i386/mach-default +mflags-$(CONFIG_KGDB) += -gdwarf-2 +mflags-$(CONFIG_KGDB_MORE) += $(shell echo $(CONFIG_KGDB_OPTIONS) | sed -e 's/"//g') + head-y := arch/i386/kernel/head.o arch/i386/kernel/init_task.o libs-y += arch/i386/lib/ --- diff/arch/i386/boot/setup.S 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/boot/setup.S 2003-12-29 09:30:38.000000000 +0000 @@ -162,7 +162,7 @@ # can be located anywhere in # low memory 0x10000 or higher. -ramdisk_max: .long MAXMEM-1 # (Header version 0x0203 or later) +ramdisk_max: .long __MAXMEM-1 # (Header version 0x0203 or later) # The highest safe address for # the contents of an initrd --- diff/arch/i386/kernel/Makefile 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/Makefile 2003-12-29 09:30:38.000000000 +0000 @@ -7,13 +7,14 @@ obj-y := process.o semaphore.o signal.o entry.o traps.o irq.o vm86.o \ ptrace.o i8259.o ioport.o ldt.o setup.o time.o sys_i386.o \ pci-dma.o i386_ksyms.o i387.o dmi_scan.o bootflag.o \ - doublefault.o + doublefault.o entry_trampoline.o obj-y += cpu/ obj-y += timers/ obj-$(CONFIG_ACPI_BOOT) += acpi/ obj-$(CONFIG_X86_BIOS_REBOOT) += reboot.o obj-$(CONFIG_MCA) += mca.o +obj-$(CONFIG_KGDB) += kgdb_stub.o obj-$(CONFIG_X86_MSR) += msr.o obj-$(CONFIG_X86_CPUID) += cpuid.o obj-$(CONFIG_MICROCODE) += microcode.o @@ -24,12 +25,13 @@ obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o obj-$(CONFIG_X86_IO_APIC) += io_apic.o obj-$(CONFIG_X86_NUMAQ) += numaq.o -obj-$(CONFIG_X86_SUMMIT) += summit.o +obj-$(CONFIG_X86_SUMMIT_NUMA) += summit.o obj-$(CONFIG_EDD) += edd.o obj-$(CONFIG_MODULES) += module.o obj-y += sysenter.o vsyscall.o obj-$(CONFIG_ACPI_SRAT) += srat.o obj-$(CONFIG_HPET_TIMER) += time_hpet.o +obj-$(CONFIG_EFI) += efi.o efi_stub.o EXTRA_AFLAGS := -traditional --- diff/arch/i386/kernel/acpi/boot.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/acpi/boot.c 2003-12-29 09:30:38.000000000 +0000 @@ -26,6 +26,7 @@ #include #include #include +#include #include #include #include @@ -40,9 +41,8 @@ #define PREFIX "ACPI: " -extern int acpi_disabled; -extern int acpi_irq; -extern int acpi_ht; +int acpi_noirq __initdata = 0; /* skip ACPI IRQ initialization */ +int acpi_ht __initdata = 1; /* enable HT */ int acpi_lapic = 0; int acpi_ioapic = 0; @@ -249,29 +249,66 @@ #ifdef CONFIG_ACPI_BUS /* - * Set specified PIC IRQ to level triggered mode. + * "acpi_pic_sci=level" (current default) + * programs the PIC-mode SCI to Level Trigger. + * (NO-OP if the BIOS set Level Trigger already) + * + * If a PIC-mode SCI is not recogznied or gives spurious IRQ7's + * it may require Edge Trigger -- use "acpi_pic_sci=edge" + * (NO-OP if the BIOS set Edge Trigger already) * * Port 0x4d0-4d1 are ECLR1 and ECLR2, the Edge/Level Control Registers * for the 8259 PIC. bit[n] = 1 means irq[n] is Level, otherwise Edge. * ECLR1 is IRQ's 0-7 (IRQ 0, 1, 2 must be 0) * ECLR2 is IRQ's 8-15 (IRQ 8, 13 must be 0) - * - * As the BIOS should have done this for us, - * print a warning if the IRQ wasn't already set to level. */ -void acpi_pic_set_level_irq(unsigned int irq) +static int __initdata acpi_pic_sci_trigger; /* 0: level, 1: edge */ + +void __init +acpi_pic_sci_set_trigger(unsigned int irq) { unsigned char mask = 1 << (irq & 7); unsigned int port = 0x4d0 + (irq >> 3); unsigned char val = inb(port); + + printk(PREFIX "IRQ%d SCI:", irq); if (!(val & mask)) { - printk(KERN_WARNING PREFIX "IRQ %d was Edge Triggered, " - "setting to Level Triggerd\n", irq); - outb(val | mask, port); + printk(" Edge"); + + if (!acpi_pic_sci_trigger) { + printk(" set to Level"); + outb(val | mask, port); + } + } else { + printk(" Level"); + + if (acpi_pic_sci_trigger) { + printk(" set to Edge"); + outb(val | mask, port); + } + } + printk(" Trigger.\n"); +} + +int __init +acpi_pic_sci_setup(char *str) +{ + while (str && *str) { + if (strncmp(str, "level", 5) == 0) + acpi_pic_sci_trigger = 0; /* force level trigger */ + if (strncmp(str, "edge", 4) == 0) + acpi_pic_sci_trigger = 1; /* force edge trigger */ + str = strchr(str, ','); + if (str) + str += strspn(str, ", \t"); } + return 1; } + +__setup("acpi_pic_sci=", acpi_pic_sci_setup); + #endif /* CONFIG_ACPI_BUS */ @@ -326,11 +363,48 @@ } #endif +/* detect the location of the ACPI PM Timer */ +#ifdef CONFIG_X86_PM_TIMER +extern u32 pmtmr_ioport; + +static int __init acpi_parse_fadt(unsigned long phys, unsigned long size) +{ + struct fadt_descriptor_rev2 *fadt =0; + + fadt = (struct fadt_descriptor_rev2*) __acpi_map_table(phys,size); + if(!fadt) { + printk(KERN_WARNING PREFIX "Unable to map FADT\n"); + return 0; + } + + if (fadt->revision >= FADT2_REVISION_ID) { + /* FADT rev. 2 */ + if (fadt->xpm_tmr_blk.address_space_id != ACPI_ADR_SPACE_SYSTEM_IO) + return 0; + + pmtmr_ioport = fadt->xpm_tmr_blk.address; + } else { + /* FADT rev. 1 */ + pmtmr_ioport = fadt->V1_pm_tmr_blk; + } + if (pmtmr_ioport) + printk(KERN_INFO PREFIX "PM-Timer IO Port: %#x\n", pmtmr_ioport); + return 0; +} +#endif + + unsigned long __init acpi_find_rsdp (void) { unsigned long rsdp_phys = 0; + if (efi_enabled) { + if (efi.acpi20) + return __pa(efi.acpi20); + else if (efi.acpi) + return __pa(efi.acpi); + } /* * Scan memory looking for the RSDP signature. First search EBDA (low * memory) paragraphs and then search upper memory (E0000-FFFFF). @@ -380,8 +454,10 @@ * Initialize the ACPI boot-time table parser. */ result = acpi_table_init(); - if (result) + if (result) { + acpi_disabled = 1; return result; + } result = acpi_blacklisted(); if (result) { @@ -462,7 +538,7 @@ * If MPS is present, it will handle them, * otherwise the system will stay in PIC mode */ - if (acpi_disabled || !acpi_irq) { + if (acpi_disabled || acpi_noirq) { return 1; } @@ -504,6 +580,8 @@ acpi_irq_model = ACPI_IRQ_MODEL_IOAPIC; + acpi_irq_balance_set(NULL); + acpi_ioapic = 1; #endif /* CONFIG_X86_IO_APIC && CONFIG_ACPI_INTERPRETER */ @@ -519,5 +597,9 @@ acpi_table_parse(ACPI_HPET, acpi_parse_hpet); #endif +#ifdef CONFIG_X86_PM_TIMER + acpi_table_parse(ACPI_FADT, acpi_parse_fadt); +#endif + return 0; } --- diff/arch/i386/kernel/asm-offsets.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/asm-offsets.c 2003-12-29 09:30:38.000000000 +0000 @@ -4,9 +4,11 @@ * to extract and format the required data. */ +#include #include #include #include "sigframe.h" +#include #define DEFINE(sym, val) \ asm volatile("\n->" #sym " %0 " #val : : "i" (val)) @@ -28,4 +30,17 @@ DEFINE(RT_SIGFRAME_sigcontext, offsetof (struct rt_sigframe, uc.uc_mcontext)); + DEFINE(TI_task, offsetof (struct thread_info, task)); + DEFINE(TI_exec_domain, offsetof (struct thread_info, exec_domain)); + DEFINE(TI_flags, offsetof (struct thread_info, flags)); + DEFINE(TI_preempt_count, offsetof (struct thread_info, preempt_count)); + DEFINE(TI_addr_limit, offsetof (struct thread_info, addr_limit)); + DEFINE(TI_real_stack, offsetof (struct thread_info, real_stack)); + DEFINE(TI_virtual_stack, offsetof (struct thread_info, virtual_stack)); + DEFINE(TI_user_pgd, offsetof (struct thread_info, user_pgd)); + + DEFINE(FIX_ENTRY_TRAMPOLINE_0_addr, __fix_to_virt(FIX_ENTRY_TRAMPOLINE_0)); + DEFINE(FIX_VSYSCALL_addr, __fix_to_virt(FIX_VSYSCALL)); + DEFINE(PAGE_SIZE_asm, PAGE_SIZE); + DEFINE(task_thread_db7, offsetof (struct task_struct, thread.debugreg[7])); } --- diff/arch/i386/kernel/cpu/common.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/cpu/common.c 2003-12-29 09:30:38.000000000 +0000 @@ -510,16 +510,20 @@ BUG(); enter_lazy_tlb(&init_mm, current); - load_esp0(t, thread->esp0); + load_esp0(t, thread); set_tss_desc(cpu,t); cpu_gdt_table[cpu][GDT_ENTRY_TSS].b &= 0xfffffdff; load_TR_desc(); - load_LDT(&init_mm.context); + if (cpu) + load_LDT(&init_mm.context); /* Set up doublefault TSS pointer in the GDT */ __set_tss_desc(cpu, GDT_ENTRY_DOUBLEFAULT_TSS, &doublefault_tss); cpu_gdt_table[cpu][GDT_ENTRY_DOUBLEFAULT_TSS].b &= 0xfffffdff; + if (cpu) + trap_init_virtual_GDT(); + /* Clear %fs and %gs. */ asm volatile ("xorl %eax, %eax; movl %eax, %fs; movl %eax, %gs"); --- diff/arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c 2003-09-17 12:28:01.000000000 +0100 +++ source/arch/i386/kernel/cpu/cpufreq/speedstep-centrino.c 2003-12-29 09:30:38.000000000 +0000 @@ -73,6 +73,16 @@ { .frequency = CPUFREQ_TABLE_END } }; +/* Ultra Low Voltage Intel Pentium M processor 1000MHz */ +static struct cpufreq_frequency_table op_1000[] = + { + OP(600, 844), + OP(800, 972), + OP(900, 988), + OP(1000, 1004), + { .frequency = CPUFREQ_TABLE_END } + }; + /* Low Voltage Intel Pentium M processor 1.10GHz */ static struct cpufreq_frequency_table op_1100[] = { @@ -165,6 +175,7 @@ static const struct cpu_model models[] = { _CPU( 900, " 900"), + CPU(1000), CPU(1100), CPU(1200), CPU(1300), --- diff/arch/i386/kernel/cpu/intel.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/cpu/intel.c 2003-12-29 09:30:38.000000000 +0000 @@ -8,11 +8,10 @@ #include #include #include +#include #include "cpu.h" -extern int trap_init_f00f_bug(void); - #ifdef CONFIG_X86_INTEL_USERCOPY /* * Alignment at which movsl is preferred for bulk memory copies. @@ -157,7 +156,7 @@ c->f00f_bug = 1; if ( !f00f_workaround_enabled ) { - trap_init_f00f_bug(); + trap_init_virtual_IDT(); printk(KERN_NOTICE "Intel Pentium with F0 0F bug - workaround enabled.\n"); f00f_workaround_enabled = 1; } @@ -240,6 +239,12 @@ /* SEP CPUID bug: Pentium Pro reports SEP but doesn't have it until model 3 mask 3 */ if ((c->x86<<8 | c->x86_model<<4 | c->x86_mask) < 0x633) clear_bit(X86_FEATURE_SEP, c->x86_capability); + /* + * FIXME: SEP is disabled for 4G/4G for now: + */ +#ifdef CONFIG_X86_HIGH_ENTRY + clear_bit(X86_FEATURE_SEP, c->x86_capability); +#endif /* Names for the Pentium II/Celeron processors detectable only by also checking the cache size. --- diff/arch/i386/kernel/dmi_scan.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/dmi_scan.c 2003-12-29 09:30:38.000000000 +0000 @@ -6,6 +6,7 @@ #include #include #include +#include #include #include #include @@ -16,6 +17,7 @@ int is_sony_vaio_laptop; int is_unsafe_smbus; +int es7000_plat = 0; struct dmi_header { @@ -504,6 +506,7 @@ } +#ifdef CONFIG_ACPI_BOOT extern int acpi_disabled, acpi_force; static __init __attribute__((unused)) int acpi_disable(struct dmi_blacklist *d) @@ -518,8 +521,6 @@ return 0; } - -#ifdef CONFIG_ACPI_BOOT extern int acpi_ht; /* @@ -542,10 +543,8 @@ #ifdef CONFIG_ACPI_PCI static __init int disable_acpi_pci(struct dmi_blacklist *d) { - extern __init void pci_disable_acpi(void) ; - printk(KERN_NOTICE "%s detected: force use of pci=noacpi\n", d->ident); - pci_disable_acpi(); + acpi_noirq_set(); return 0; } #endif @@ -1011,6 +1010,7 @@ printk(KERN_NOTICE "ACPI disabled because your bios is from %s and too old\n", s); printk(KERN_NOTICE "You can enable it with acpi=force\n"); acpi_disabled = 1; + acpi_ht = 0; } } } --- diff/arch/i386/kernel/doublefault.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/doublefault.c 2003-12-29 09:30:38.000000000 +0000 @@ -7,12 +7,13 @@ #include #include #include +#include #define DOUBLEFAULT_STACKSIZE (1024) static unsigned long doublefault_stack[DOUBLEFAULT_STACKSIZE]; #define STACK_START (unsigned long)(doublefault_stack+DOUBLEFAULT_STACKSIZE) -#define ptr_ok(x) ((x) > 0xc0000000 && (x) < 0xc1000000) +#define ptr_ok(x) (((x) > __PAGE_OFFSET && (x) < (__PAGE_OFFSET + 0x01000000)) || ((x) >= FIXADDR_START)) static void doublefault_fn(void) { @@ -38,8 +39,8 @@ printk("eax = %08lx, ebx = %08lx, ecx = %08lx, edx = %08lx\n", t->eax, t->ebx, t->ecx, t->edx); - printk("esi = %08lx, edi = %08lx\n", - t->esi, t->edi); + printk("esi = %08lx, edi = %08lx, ebp = %08lx\n", + t->esi, t->edi, t->ebp); } } --- diff/arch/i386/kernel/entry.S 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/entry.S 2003-12-29 09:30:38.000000000 +0000 @@ -43,11 +43,25 @@ #include #include #include +#include #include #include +#include #include #include #include "irq_vectors.h" + /* We do not recover from a stack overflow, but at least + * we know it happened and should be able to track it down. + */ +#ifdef CONFIG_STACK_OVERFLOW_TEST +#define STACK_OVERFLOW_TEST \ + testl $7680,%esp; \ + jnz 10f; \ + call stack_overflow; \ +10: +#else +#define STACK_OVERFLOW_TEST +#endif #define nr_syscalls ((syscall_table_size)/4) @@ -87,7 +101,102 @@ #define resume_kernel restore_all #endif -#define SAVE_ALL \ +#ifdef CONFIG_X86_HIGH_ENTRY + +#ifdef CONFIG_X86_SWITCH_PAGETABLES + +#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) +/* + * If task is preempted in __SWITCH_KERNELSPACE, and moved to another cpu, + * __switch_to repoints %esp to the appropriate virtual stack; but %ebp is + * left stale, so we must check whether to repeat the real stack calculation. + */ +#define repeat_if_esp_changed \ + xorl %esp, %ebp; \ + testl $0xffffe000, %ebp; \ + jnz 0b +#else +#define repeat_if_esp_changed +#endif + +/* clobbers ebx, edx and ebp */ + +#define __SWITCH_KERNELSPACE \ + cmpl $0xff000000, %esp; \ + jb 1f; \ + \ + /* \ + * switch pagetables and load the real stack, \ + * keep the stack offset: \ + */ \ + \ + movl $swapper_pg_dir-__PAGE_OFFSET, %edx; \ + \ + /* GET_THREAD_INFO(%ebp) intermixed */ \ +0: \ + movl %esp, %ebp; \ + movl %esp, %ebx; \ + andl $0xffffe000, %ebp; \ + andl $0x00001fff, %ebx; \ + orl TI_real_stack(%ebp), %ebx; \ + repeat_if_esp_changed; \ + \ + movl %edx, %cr3; \ + movl %ebx, %esp; \ +1: + +#endif + + +#define __SWITCH_USERSPACE \ + /* interrupted any of the user return paths? */ \ + \ + movl EIP(%esp), %eax; \ + \ + cmpl $int80_ret_start_marker, %eax; \ + jb 33f; /* nope - continue with sysexit check */\ + cmpl $int80_ret_end_marker, %eax; \ + jb 22f; /* yes - switch to virtual stack */ \ +33: \ + cmpl $sysexit_ret_start_marker, %eax; \ + jb 44f; /* nope - continue with user check */ \ + cmpl $sysexit_ret_end_marker, %eax; \ + jb 22f; /* yes - switch to virtual stack */ \ + /* return to userspace? */ \ +44: \ + movl EFLAGS(%esp),%ecx; \ + movb CS(%esp),%cl; \ + testl $(VM_MASK | 3),%ecx; \ + jz 2f; \ +22: \ + /* \ + * switch to the virtual stack, then switch to \ + * the userspace pagetables. \ + */ \ + \ + GET_THREAD_INFO(%ebp); \ + movl TI_virtual_stack(%ebp), %edx; \ + movl TI_user_pgd(%ebp), %ecx; \ + \ + movl %esp, %ebx; \ + andl $0x1fff, %ebx; \ + orl %ebx, %edx; \ +int80_ret_start_marker: \ + movl %edx, %esp; \ + movl %ecx, %cr3; \ + \ + __RESTORE_ALL; \ +int80_ret_end_marker: \ +2: + +#else /* !CONFIG_X86_HIGH_ENTRY */ + +#define __SWITCH_KERNELSPACE +#define __SWITCH_USERSPACE + +#endif + +#define __SAVE_ALL \ cld; \ pushl %es; \ pushl %ds; \ @@ -102,7 +211,7 @@ movl %edx, %ds; \ movl %edx, %es; -#define RESTORE_INT_REGS \ +#define __RESTORE_INT_REGS \ popl %ebx; \ popl %ecx; \ popl %edx; \ @@ -111,29 +220,28 @@ popl %ebp; \ popl %eax -#define RESTORE_REGS \ - RESTORE_INT_REGS; \ -1: popl %ds; \ -2: popl %es; \ +#define __RESTORE_REGS \ + __RESTORE_INT_REGS; \ +111: popl %ds; \ +222: popl %es; \ .section .fixup,"ax"; \ -3: movl $0,(%esp); \ - jmp 1b; \ -4: movl $0,(%esp); \ - jmp 2b; \ +444: movl $0,(%esp); \ + jmp 111b; \ +555: movl $0,(%esp); \ + jmp 222b; \ .previous; \ .section __ex_table,"a";\ .align 4; \ - .long 1b,3b; \ - .long 2b,4b; \ + .long 111b,444b;\ + .long 222b,555b;\ .previous - -#define RESTORE_ALL \ - RESTORE_REGS \ +#define __RESTORE_ALL \ + __RESTORE_REGS \ addl $4, %esp; \ -1: iret; \ +333: iret; \ .section .fixup,"ax"; \ -2: sti; \ +666: sti; \ movl $(__USER_DS), %edx; \ movl %edx, %ds; \ movl %edx, %es; \ @@ -142,10 +250,19 @@ .previous; \ .section __ex_table,"a";\ .align 4; \ - .long 1b,2b; \ + .long 333b,666b;\ .previous +#define SAVE_ALL \ + __SAVE_ALL; \ + __SWITCH_KERNELSPACE; \ + STACK_OVERFLOW_TEST; + +#define RESTORE_ALL \ + __SWITCH_USERSPACE; \ + __RESTORE_ALL; +.section .entry.text,"ax" ENTRY(lcall7) pushfl # We get a different stack layout with call @@ -163,7 +280,7 @@ movl %edx,EIP(%ebp) # Now we move them to their "normal" places movl %ecx,CS(%ebp) # andl $-8192, %ebp # GET_THREAD_INFO - movl TI_EXEC_DOMAIN(%ebp), %edx # Get the execution domain + movl TI_exec_domain(%ebp), %edx # Get the execution domain call *4(%edx) # Call the lcall7 handler for the domain addl $4, %esp popl %eax @@ -208,7 +325,7 @@ cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret - movl TI_FLAGS(%ebp), %ecx + movl TI_flags(%ebp), %ecx andl $_TIF_WORK_MASK, %ecx # is there any work to be done on # int/exception return? jne work_pending @@ -216,18 +333,18 @@ #ifdef CONFIG_PREEMPT ENTRY(resume_kernel) - cmpl $0,TI_PRE_COUNT(%ebp) # non-zero preempt_count ? + cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? jnz restore_all need_resched: - movl TI_FLAGS(%ebp), %ecx # need_resched set ? + movl TI_flags(%ebp), %ecx # need_resched set ? testb $_TIF_NEED_RESCHED, %cl jz restore_all testl $IF_MASK,EFLAGS(%esp) # interrupts off (exception path) ? jz restore_all - movl $PREEMPT_ACTIVE,TI_PRE_COUNT(%ebp) + movl $PREEMPT_ACTIVE,TI_preempt_count(%ebp) sti call schedule - movl $0,TI_PRE_COUNT(%ebp) + movl $0,TI_preempt_count(%ebp) cli jmp need_resched #endif @@ -246,37 +363,50 @@ pushl $(__USER_CS) pushl $SYSENTER_RETURN -/* - * Load the potential sixth argument from user stack. - * Careful about security. - */ - cmpl $__PAGE_OFFSET-3,%ebp - jae syscall_fault -1: movl (%ebp),%ebp -.section __ex_table,"a" - .align 4 - .long 1b,syscall_fault -.previous - pushl %eax SAVE_ALL GET_THREAD_INFO(%ebp) cmpl $(nr_syscalls), %eax jae syscall_badsys - testb $_TIF_SYSCALL_TRACE,TI_FLAGS(%ebp) + testb $_TIF_SYSCALL_TRACE,TI_flags(%ebp) jnz syscall_trace_entry call *sys_call_table(,%eax,4) movl %eax,EAX(%esp) cli - movl TI_FLAGS(%ebp), %ecx + movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx jne syscall_exit_work + +#ifdef CONFIG_X86_SWITCH_PAGETABLES + + GET_THREAD_INFO(%ebp) + movl TI_virtual_stack(%ebp), %edx + movl TI_user_pgd(%ebp), %ecx + movl %esp, %ebx + andl $0x1fff, %ebx + orl %ebx, %edx +sysexit_ret_start_marker: + movl %edx, %esp + movl %ecx, %cr3 +#endif + /* + * only ebx is not restored by the userspace sysenter vsyscall + * code, it assumes it to be callee-saved. + */ + movl EBX(%esp), %ebx + /* if something modifies registers it must also disable sysexit */ + movl EIP(%esp), %edx movl OLDESP(%esp), %ecx + sti sysexit +#ifdef CONFIG_X86_SWITCH_PAGETABLES +sysexit_ret_end_marker: + nop +#endif # system call handler stub @@ -287,7 +417,7 @@ cmpl $(nr_syscalls), %eax jae syscall_badsys # system call tracing in operation - testb $_TIF_SYSCALL_TRACE,TI_FLAGS(%ebp) + testb $_TIF_SYSCALL_TRACE,TI_flags(%ebp) jnz syscall_trace_entry syscall_call: call *sys_call_table(,%eax,4) @@ -296,10 +426,23 @@ cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret - movl TI_FLAGS(%ebp), %ecx + movl TI_flags(%ebp), %ecx testw $_TIF_ALLWORK_MASK, %cx # current->work jne syscall_exit_work restore_all: +#ifdef CONFIG_TRAP_BAD_SYSCALL_EXITS + movl EFLAGS(%esp), %eax # mix EFLAGS and CS + movb CS(%esp), %al + testl $(VM_MASK | 3), %eax + jz resume_kernelX # returning to kernel or vm86-space + + cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ? + jz resume_kernelX + + int $3 + +resume_kernelX: +#endif RESTORE_ALL # perform work that needs to be done immediately before resumption @@ -312,7 +455,7 @@ cli # make sure we don't miss an interrupt # setting need_resched or sigpending # between sampling and the iret - movl TI_FLAGS(%ebp), %ecx + movl TI_flags(%ebp), %ecx andl $_TIF_WORK_MASK, %ecx # is there any work to be done other # than syscall tracing? jz restore_all @@ -327,6 +470,22 @@ # vm86-space xorl %edx, %edx call do_notify_resume + +#if CONFIG_X86_HIGH_ENTRY + /* + * Reload db7 if necessary: + */ + movl TI_flags(%ebp), %ecx + testb $_TIF_DB7, %cl + jnz work_db7 + + jmp restore_all + +work_db7: + movl TI_task(%ebp), %edx; + movl task_thread_db7(%edx), %edx; + movl %edx, %db7; +#endif jmp restore_all ALIGN @@ -382,7 +541,7 @@ */ .data ENTRY(interrupt) -.text +.previous vector=0 ENTRY(irq_entries_start) @@ -392,7 +551,7 @@ jmp common_interrupt .data .long 1b -.text +.previous vector=vector+1 .endr @@ -433,12 +592,17 @@ movl ES(%esp), %edi # get the function address movl %eax, ORIG_EAX(%esp) movl %ecx, ES(%esp) - movl %esp, %edx pushl %esi # push the error code - pushl %edx # push the pt_regs pointer movl $(__USER_DS), %edx movl %edx, %ds movl %edx, %es + +/* clobbers edx, ebx and ebp */ + __SWITCH_KERNELSPACE + + leal 4(%esp), %edx # prepare pt_regs + pushl %edx # push pt_regs + call *%edi addl $8, %esp jmp ret_from_exception @@ -529,7 +693,7 @@ pushl %edx call do_nmi addl $8, %esp - RESTORE_ALL + jmp restore_all nmi_stack_fixup: FIX_STACK(12,nmi_stack_correct, 1) @@ -606,6 +770,8 @@ pushl $do_spurious_interrupt_bug jmp error_code +.previous + .data ENTRY(sys_call_table) .long sys_restart_syscall /* 0 - old "setup()" system call, used for restarting */ --- diff/arch/i386/kernel/head.S 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/head.S 2003-12-29 09:30:38.000000000 +0000 @@ -16,6 +16,7 @@ #include #include #include +#include #define OLD_CL_MAGIC_ADDR 0x90020 #define OLD_CL_MAGIC 0xA33F @@ -330,7 +331,7 @@ /* This is the default interrupt "handler" :-) */ int_msg: - .asciz "Unknown interrupt\n" + .asciz "Unknown interrupt or fault at EIP %p %p %p\n" ALIGN ignore_int: cld @@ -342,9 +343,17 @@ movl $(__KERNEL_DS),%eax movl %eax,%ds movl %eax,%es + pushl 16(%esp) + pushl 24(%esp) + pushl 32(%esp) + pushl 40(%esp) pushl $int_msg call printk popl %eax + popl %eax + popl %eax + popl %eax + popl %eax popl %ds popl %es popl %edx @@ -377,23 +386,27 @@ .fill NR_CPUS-1,8,0 # space for the other GDT descriptors /* - * This is initialized to create an identity-mapping at 0-8M (for bootup - * purposes) and another mapping of the 0-8M area at virtual address + * This is initialized to create an identity-mapping at 0-16M (for bootup + * purposes) and another mapping of the 0-16M area at virtual address * PAGE_OFFSET. */ .org 0x1000 ENTRY(swapper_pg_dir) .long 0x00102007 .long 0x00103007 - .fill BOOT_USER_PGD_PTRS-2,4,0 - /* default: 766 entries */ + .long 0x00104007 + .long 0x00105007 + .fill BOOT_USER_PGD_PTRS-4,4,0 + /* default: 764 entries */ .long 0x00102007 .long 0x00103007 - /* default: 254 entries */ - .fill BOOT_KERNEL_PGD_PTRS-2,4,0 + .long 0x00104007 + .long 0x00105007 + /* default: 252 entries */ + .fill BOOT_KERNEL_PGD_PTRS-4,4,0 /* - * The page tables are initialized to only 8MB here - the final page + * The page tables are initialized to only 16MB here - the final page * tables are set up later depending on memory size. */ .org 0x2000 @@ -402,15 +415,21 @@ .org 0x3000 ENTRY(pg1) +.org 0x4000 +ENTRY(pg2) + +.org 0x5000 +ENTRY(pg3) + /* * empty_zero_page must immediately follow the page tables ! (The * initialization loop counts until empty_zero_page) */ -.org 0x4000 +.org 0x6000 ENTRY(empty_zero_page) -.org 0x5000 +.org 0x7000 /* * Real beginning of normal "text" segment @@ -419,12 +438,12 @@ ENTRY(_stext) /* - * This starts the data section. Note that the above is all - * in the text section because it has alignment requirements - * that we cannot fulfill any other way. + * This starts the data section. */ .data +.align PAGE_SIZE_asm + /* * The Global Descriptor Table contains 28 quadwords, per-CPU. */ @@ -439,7 +458,9 @@ .quad 0x00cf9a000000ffff /* kernel 4GB code at 0x00000000 */ .quad 0x00cf92000000ffff /* kernel 4GB data at 0x00000000 */ #endif - .align L1_CACHE_BYTES + +.align PAGE_SIZE_asm + ENTRY(cpu_gdt_table) .quad 0x0000000000000000 /* NULL descriptor */ .quad 0x0000000000000000 /* 0x0b reserved */ --- diff/arch/i386/kernel/i386_ksyms.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/i386_ksyms.c 2003-12-29 09:30:38.000000000 +0000 @@ -98,7 +98,6 @@ EXPORT_SYMBOL_NOVERS(__down_failed_trylock); EXPORT_SYMBOL_NOVERS(__up_wakeup); /* Networking helper routines. */ -EXPORT_SYMBOL(csum_partial_copy_generic); /* Delay loops */ EXPORT_SYMBOL(__ndelay); EXPORT_SYMBOL(__udelay); @@ -112,13 +111,17 @@ EXPORT_SYMBOL(strpbrk); EXPORT_SYMBOL(strstr); +#if !defined(CONFIG_X86_UACCESS_INDIRECT) EXPORT_SYMBOL(strncpy_from_user); -EXPORT_SYMBOL(__strncpy_from_user); +EXPORT_SYMBOL(__direct_strncpy_from_user); EXPORT_SYMBOL(clear_user); EXPORT_SYMBOL(__clear_user); EXPORT_SYMBOL(__copy_from_user_ll); EXPORT_SYMBOL(__copy_to_user_ll); EXPORT_SYMBOL(strnlen_user); +#else /* CONFIG_X86_UACCESS_INDIRECT */ +EXPORT_SYMBOL(direct_csum_partial_copy_generic); +#endif EXPORT_SYMBOL(dma_alloc_coherent); EXPORT_SYMBOL(dma_free_coherent); --- diff/arch/i386/kernel/i387.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/i387.c 2003-12-29 09:30:38.000000000 +0000 @@ -218,6 +218,7 @@ static int convert_fxsr_to_user( struct _fpstate __user *buf, struct i387_fxsave_struct *fxsave ) { + struct _fpreg tmp[8]; /* 80 bytes scratch area */ unsigned long env[7]; struct _fpreg __user *to; struct _fpxreg *from; @@ -234,23 +235,25 @@ if ( __copy_to_user( buf, env, 7 * sizeof(unsigned long) ) ) return 1; - to = &buf->_st[0]; + to = tmp; from = (struct _fpxreg *) &fxsave->st_space[0]; for ( i = 0 ; i < 8 ; i++, to++, from++ ) { unsigned long *t = (unsigned long *)to; unsigned long *f = (unsigned long *)from; - if (__put_user(*f, t) || - __put_user(*(f + 1), t + 1) || - __put_user(from->exponent, &to->exponent)) - return 1; + *t = *f; + *(t + 1) = *(f+1); + to->exponent = from->exponent; } + if (copy_to_user(buf->_st, tmp, sizeof(struct _fpreg [8]))) + return 1; return 0; } static int convert_fxsr_from_user( struct i387_fxsave_struct *fxsave, struct _fpstate __user *buf ) { + struct _fpreg tmp[8]; /* 80 bytes scratch area */ unsigned long env[7]; struct _fpxreg *to; struct _fpreg __user *from; @@ -258,6 +261,8 @@ if ( __copy_from_user( env, buf, 7 * sizeof(long) ) ) return 1; + if (copy_from_user(tmp, buf->_st, sizeof(struct _fpreg [8]))) + return 1; fxsave->cwd = (unsigned short)(env[0] & 0xffff); fxsave->swd = (unsigned short)(env[1] & 0xffff); @@ -269,15 +274,14 @@ fxsave->fos = env[6]; to = (struct _fpxreg *) &fxsave->st_space[0]; - from = &buf->_st[0]; + from = tmp; for ( i = 0 ; i < 8 ; i++, to++, from++ ) { unsigned long *t = (unsigned long *)to; unsigned long *f = (unsigned long *)from; - if (__get_user(*t, f) || - __get_user(*(t + 1), f + 1) || - __get_user(to->exponent, &from->exponent)) - return 1; + *t = *f; + *(t + 1) = *(f + 1); + to->exponent = from->exponent; } return 0; } --- diff/arch/i386/kernel/i8259.c 2003-06-30 10:07:18.000000000 +0100 +++ source/arch/i386/kernel/i8259.c 2003-12-29 09:30:38.000000000 +0000 @@ -419,8 +419,10 @@ * us. (some of these will be overridden and become * 'special' SMP interrupts) */ - for (i = 0; i < NR_IRQS; i++) { + for (i = 0; i < (NR_VECTORS - FIRST_EXTERNAL_VECTOR); i++) { int vector = FIRST_EXTERNAL_VECTOR + i; + if (i >= NR_IRQS) + break; if (vector != SYSCALL_VECTOR) set_intr_gate(vector, interrupt[i]); } --- diff/arch/i386/kernel/init_task.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/init_task.c 2003-12-29 09:30:38.000000000 +0000 @@ -26,7 +26,7 @@ */ union thread_union init_thread_union __attribute__((__section__(".data.init_task"))) = - { INIT_THREAD_INFO(init_task) }; + { INIT_THREAD_INFO(init_task, init_thread_union) }; /* * Initial task structure. @@ -44,5 +44,5 @@ * section. Since TSS's are completely CPU-local, we want them * on exact cacheline boundaries, to eliminate cacheline ping-pong. */ -struct tss_struct init_tss[NR_CPUS] __cacheline_aligned = { [0 ... NR_CPUS-1] = INIT_TSS }; +struct tss_struct init_tss[NR_CPUS] __attribute__((__section__(".data.tss"))) = { [0 ... NR_CPUS-1] = INIT_TSS }; --- diff/arch/i386/kernel/io_apic.c 2003-10-27 09:20:43.000000000 +0000 +++ source/arch/i386/kernel/io_apic.c 2003-12-29 09:30:38.000000000 +0000 @@ -76,6 +76,14 @@ int apic, pin, next; } irq_2_pin[PIN_MAP_SIZE]; +#ifdef CONFIG_PCI_USE_VECTOR +int vector_irq[NR_IRQS] = { [0 ... NR_IRQS -1] = -1}; +#define vector_to_irq(vector) \ + (platform_legacy_irq(vector) ? vector : vector_irq[vector]) +#else +#define vector_to_irq(vector) (vector) +#endif + /* * The common case is 1:1 IRQ<->pin mappings. Sometimes there are * shared ISA-space IRQs, so we have to support them. We are super @@ -249,7 +257,7 @@ clear_IO_APIC_pin(apic, pin); } -static void set_ioapic_affinity(unsigned int irq, cpumask_t cpumask) +static void set_ioapic_affinity_irq(unsigned int irq, cpumask_t cpumask) { unsigned long flags; int pin; @@ -288,7 +296,7 @@ extern cpumask_t irq_affinity[NR_IRQS]; -static cpumask_t __cacheline_aligned pending_irq_balance_cpumask[NR_IRQS]; +cpumask_t __cacheline_aligned pending_irq_balance_cpumask[NR_IRQS]; #define IRQBALANCE_CHECK_ARCH -999 static int irqbalance_disabled = IRQBALANCE_CHECK_ARCH; @@ -670,13 +678,11 @@ __setup("noirqbalance", irqbalance_disable); -static void set_ioapic_affinity(unsigned int irq, cpumask_t mask); - static inline void move_irq(int irq) { /* note - we hold the desc->lock */ if (unlikely(!cpus_empty(pending_irq_balance_cpumask[irq]))) { - set_ioapic_affinity(irq, pending_irq_balance_cpumask[irq]); + set_ioapic_affinity_irq(irq, pending_irq_balance_cpumask[irq]); cpus_clear(pending_irq_balance_cpumask[irq]); } } @@ -853,7 +859,7 @@ if (irq_entry == -1) continue; irq = pin_2_irq(irq_entry, ioapic, pin); - set_ioapic_affinity(irq, mask); + set_ioapic_affinity_irq(irq, mask); } } @@ -1141,7 +1147,8 @@ /* irq_vectors is indexed by the sum of all RTEs in all I/O APICs. */ u8 irq_vector[NR_IRQ_VECTORS] = { FIRST_DEVICE_VECTOR , 0 }; -static int __init assign_irq_vector(int irq) +#ifndef CONFIG_PCI_USE_VECTOR +int __init assign_irq_vector(int irq) { static int current_vector = FIRST_DEVICE_VECTOR, offset = 0; BUG_ON(irq >= NR_IRQ_VECTORS); @@ -1158,11 +1165,36 @@ } IO_APIC_VECTOR(irq) = current_vector; + return current_vector; } +#endif + +static struct hw_interrupt_type ioapic_level_type; +static struct hw_interrupt_type ioapic_edge_type; -static struct hw_interrupt_type ioapic_level_irq_type; -static struct hw_interrupt_type ioapic_edge_irq_type; +#define IOAPIC_AUTO -1 +#define IOAPIC_EDGE 0 +#define IOAPIC_LEVEL 1 + +static inline void ioapic_register_intr(int irq, int vector, unsigned long trigger) +{ + if (use_pci_vector() && !platform_legacy_irq(irq)) { + if ((trigger == IOAPIC_AUTO && IO_APIC_irq_trigger(irq)) || + trigger == IOAPIC_LEVEL) + irq_desc[vector].handler = &ioapic_level_type; + else + irq_desc[vector].handler = &ioapic_edge_type; + set_intr_gate(vector, interrupt[vector]); + } else { + if ((trigger == IOAPIC_AUTO && IO_APIC_irq_trigger(irq)) || + trigger == IOAPIC_LEVEL) + irq_desc[irq].handler = &ioapic_level_type; + else + irq_desc[irq].handler = &ioapic_edge_type; + set_intr_gate(vector, interrupt[irq]); + } +} void __init setup_IO_APIC_irqs(void) { @@ -1220,13 +1252,7 @@ if (IO_APIC_IRQ(irq)) { vector = assign_irq_vector(irq); entry.vector = vector; - - if (IO_APIC_irq_trigger(irq)) - irq_desc[irq].handler = &ioapic_level_irq_type; - else - irq_desc[irq].handler = &ioapic_edge_irq_type; - - set_intr_gate(vector, interrupt[irq]); + ioapic_register_intr(irq, vector, IOAPIC_AUTO); if (!apic && (irq < 16)) disable_8259A_irq(irq); @@ -1273,7 +1299,7 @@ * The timer IRQ doesn't have to know that behind the * scene we have a 8259A-master in AEOI mode ... */ - irq_desc[0].handler = &ioapic_edge_irq_type; + irq_desc[0].handler = &ioapic_edge_type; /* * Add it to the IO-APIC irq-routing table: @@ -1624,10 +1650,6 @@ unsigned char old_id; unsigned long flags; - if (acpi_ioapic) - /* This gets done during IOAPIC enumeration for ACPI. */ - return; - /* * This is broken; anything with a real cpu count has to * circumvent this idiocy regardless. @@ -1763,9 +1785,6 @@ * that was delayed but this is now handled in the device * independent code. */ -#define enable_edge_ioapic_irq unmask_IO_APIC_irq - -static void disable_edge_ioapic_irq (unsigned int irq) { /* nothing */ } /* * Starting up a edge-triggered IO-APIC interrupt is @@ -1776,7 +1795,6 @@ * This is not complete - we should be able to fake * an edge even if it isn't on the 8259A... */ - static unsigned int startup_edge_ioapic_irq(unsigned int irq) { int was_pending = 0; @@ -1794,8 +1812,6 @@ return was_pending; } -#define shutdown_edge_ioapic_irq disable_edge_ioapic_irq - /* * Once we have recorded IRQ_PENDING already, we can mask the * interrupt for real. This prevents IRQ storms from unhandled @@ -1810,9 +1826,6 @@ ack_APIC_irq(); } -static void end_edge_ioapic_irq (unsigned int i) { /* nothing */ } - - /* * Level triggered interrupts can just be masked, * and shutting down and starting up the interrupt @@ -1834,10 +1847,6 @@ return 0; /* don't check for pending */ } -#define shutdown_level_ioapic_irq mask_IO_APIC_irq -#define enable_level_ioapic_irq unmask_IO_APIC_irq -#define disable_level_ioapic_irq mask_IO_APIC_irq - static void end_level_ioapic_irq (unsigned int irq) { unsigned long v; @@ -1864,6 +1873,7 @@ * The idea is from Manfred Spraul. --macro */ i = IO_APIC_VECTOR(irq); + v = apic_read(APIC_TMR + ((i & ~0x1f) >> 1)); ack_APIC_irq(); @@ -1898,7 +1908,57 @@ } } -static void mask_and_ack_level_ioapic_irq (unsigned int irq) { /* nothing */ } +#ifdef CONFIG_PCI_USE_VECTOR +static unsigned int startup_edge_ioapic_vector(unsigned int vector) +{ + int irq = vector_to_irq(vector); + + return startup_edge_ioapic_irq(irq); +} + +static void ack_edge_ioapic_vector(unsigned int vector) +{ + int irq = vector_to_irq(vector); + + ack_edge_ioapic_irq(irq); +} + +static unsigned int startup_level_ioapic_vector (unsigned int vector) +{ + int irq = vector_to_irq(vector); + + return startup_level_ioapic_irq (irq); +} + +static void end_level_ioapic_vector (unsigned int vector) +{ + int irq = vector_to_irq(vector); + + end_level_ioapic_irq(irq); +} + +static void mask_IO_APIC_vector (unsigned int vector) +{ + int irq = vector_to_irq(vector); + + mask_IO_APIC_irq(irq); +} + +static void unmask_IO_APIC_vector (unsigned int vector) +{ + int irq = vector_to_irq(vector); + + unmask_IO_APIC_irq(irq); +} + +static void set_ioapic_affinity_vector (unsigned int vector, + cpumask_t cpu_mask) +{ + int irq = vector_to_irq(vector); + + set_ioapic_affinity_irq(irq, cpu_mask); +} +#endif /* * Level and edge triggered IO-APIC interrupts need different handling, @@ -1908,26 +1968,25 @@ * edge-triggered handler, without risking IRQ storms and other ugly * races. */ - -static struct hw_interrupt_type ioapic_edge_irq_type = { +static struct hw_interrupt_type ioapic_edge_type = { .typename = "IO-APIC-edge", - .startup = startup_edge_ioapic_irq, - .shutdown = shutdown_edge_ioapic_irq, - .enable = enable_edge_ioapic_irq, - .disable = disable_edge_ioapic_irq, - .ack = ack_edge_ioapic_irq, - .end = end_edge_ioapic_irq, + .startup = startup_edge_ioapic, + .shutdown = shutdown_edge_ioapic, + .enable = enable_edge_ioapic, + .disable = disable_edge_ioapic, + .ack = ack_edge_ioapic, + .end = end_edge_ioapic, .set_affinity = set_ioapic_affinity, }; -static struct hw_interrupt_type ioapic_level_irq_type = { +static struct hw_interrupt_type ioapic_level_type = { .typename = "IO-APIC-level", - .startup = startup_level_ioapic_irq, - .shutdown = shutdown_level_ioapic_irq, - .enable = enable_level_ioapic_irq, - .disable = disable_level_ioapic_irq, - .ack = mask_and_ack_level_ioapic_irq, - .end = end_level_ioapic_irq, + .startup = startup_level_ioapic, + .shutdown = shutdown_level_ioapic, + .enable = enable_level_ioapic, + .disable = disable_level_ioapic, + .ack = mask_and_ack_level_ioapic, + .end = end_level_ioapic, .set_affinity = set_ioapic_affinity, }; @@ -1947,7 +2006,13 @@ * 0x80, because int 0x80 is hm, kind of importantish. ;) */ for (irq = 0; irq < NR_IRQS ; irq++) { - if (IO_APIC_IRQ(irq) && !IO_APIC_VECTOR(irq)) { + int tmp = irq; + if (use_pci_vector()) { + if (!platform_legacy_irq(tmp)) + if ((tmp = vector_to_irq(tmp)) == -1) + continue; + } + if (IO_APIC_IRQ(tmp) && !IO_APIC_VECTOR(tmp)) { /* * Hmm.. We don't have an entry for this, * so default to an old-fashioned 8259 @@ -2217,12 +2282,14 @@ /* * Set up IO-APIC IRQ routing. */ - setup_ioapic_ids_from_mpc(); + if (!acpi_ioapic) + setup_ioapic_ids_from_mpc(); sync_Arb_IDs(); setup_IO_APIC_irqs(); init_IO_APIC_traps(); check_timer(); - print_IO_APIC(); + if (!acpi_ioapic) + print_IO_APIC(); } /* @@ -2379,10 +2446,12 @@ "IRQ %d Mode:%i Active:%i)\n", ioapic, mp_ioapics[ioapic].mpc_apicid, pin, entry.vector, irq, edge_level, active_high_low); + if (use_pci_vector() && !platform_legacy_irq(irq)) + irq = IO_APIC_VECTOR(irq); if (edge_level) { - irq_desc[irq].handler = &ioapic_level_irq_type; + irq_desc[irq].handler = &ioapic_level_type; } else { - irq_desc[irq].handler = &ioapic_edge_irq_type; + irq_desc[irq].handler = &ioapic_edge_type; } set_intr_gate(entry.vector, interrupt[irq]); --- diff/arch/i386/kernel/irq.c 2003-10-27 09:20:43.000000000 +0000 +++ source/arch/i386/kernel/irq.c 2003-12-29 09:30:38.000000000 +0000 @@ -45,6 +45,7 @@ #include #include #include +#include /* * Linux has a controller-independent x86 interrupt architecture. @@ -138,17 +139,19 @@ int show_interrupts(struct seq_file *p, void *v) { - int i, j; + int i = *(loff_t *) v, j; struct irqaction * action; unsigned long flags; - seq_printf(p, " "); - for (j=0; j HEX_DIGITS) - count = HEX_DIGITS; - if (copy_from_user(hexnum, buffer, count)) - return -EFAULT; - - /* - * Parse the first HEX_DIGITS characters as a hex string, any non-hex char - * is end-of-string. '00e1', 'e1', '00E1', 'E1' are all the same. - */ - - for (i = 0; i < count; i++) { - unsigned int c = hexnum[i]; - int k; - - switch (c) { - case '0' ... '9': c -= '0'; break; - case 'a' ... 'f': c -= 'a'-10; break; - case 'A' ... 'F': c -= 'A'-10; break; - default: - goto out; - } - cpus_shift_left(value, value, 4); - for (k = 0; k < 4; ++k) - if (test_bit(k, (unsigned long *)&c)) - cpu_set(k, value); - } -out: - *ret = value; - return 0; -} - #ifdef CONFIG_SMP static struct proc_dir_entry *smp_affinity_entry[NR_IRQS]; @@ -949,20 +925,10 @@ static int irq_affinity_read_proc(char *page, char **start, off_t off, int count, int *eof, void *data) { - int k, len; - cpumask_t tmp = irq_affinity[(long)data]; - - if (count < HEX_DIGITS+1) + int len = cpumask_snprintf(page, count, irq_affinity[(long)data]); + if (count - len < 2) return -EINVAL; - - len = 0; - for (k = 0; k < sizeof(cpumask_t)/sizeof(u16); ++k) { - int j = sprintf(page, "%04hx", (u16)cpus_coerce(tmp)); - len += j; - page += j; - cpus_shift_right(tmp, tmp, 16); - } - len += sprintf(page, "\n"); + len += sprintf(page + len, "\n"); return len; } @@ -975,7 +941,7 @@ if (!irq_desc[irq].handler->set_affinity) return -EIO; - err = parse_hex_value(buffer, count, &new_value); + err = cpumask_parse(buffer, count, new_value); if (err) return err; @@ -1000,10 +966,11 @@ static int prof_cpu_mask_read_proc (char *page, char **start, off_t off, int count, int *eof, void *data) { - unsigned long *mask = (unsigned long *) data; - if (count < HEX_DIGITS+1) + int len = cpumask_snprintf(page, count, *(cpumask_t *)data); + if (count - len < 2) return -EINVAL; - return sprintf (page, "%08lx\n", *mask); + len += sprintf(page + len, "\n"); + return len; } static int prof_cpu_mask_write_proc (struct file *file, const char __user *buffer, @@ -1013,7 +980,7 @@ unsigned long full_count = count, err; cpumask_t new_value; - err = parse_hex_value(buffer, count, &new_value); + err = cpumask_parse(buffer, count, new_value); if (err) return err; --- diff/arch/i386/kernel/ldt.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/ldt.c 2003-12-29 09:30:38.000000000 +0000 @@ -2,7 +2,7 @@ * linux/kernel/ldt.c * * Copyright (C) 1992 Krishna Balasubramanian and Linus Torvalds - * Copyright (C) 1999 Ingo Molnar + * Copyright (C) 1999, 2003 Ingo Molnar */ #include @@ -18,6 +18,8 @@ #include #include #include +#include +#include #ifdef CONFIG_SMP /* avoids "defined but not used" warnig */ static void flush_ldt(void *null) @@ -29,34 +31,31 @@ static int alloc_ldt(mm_context_t *pc, int mincount, int reload) { - void *oldldt; - void *newldt; - int oldsize; + int oldsize, newsize, i; if (mincount <= pc->size) return 0; + /* + * LDT got larger - reallocate if necessary. + */ oldsize = pc->size; mincount = (mincount+511)&(~511); - if (mincount*LDT_ENTRY_SIZE > PAGE_SIZE) - newldt = vmalloc(mincount*LDT_ENTRY_SIZE); - else - newldt = kmalloc(mincount*LDT_ENTRY_SIZE, GFP_KERNEL); - - if (!newldt) - return -ENOMEM; - - if (oldsize) - memcpy(newldt, pc->ldt, oldsize*LDT_ENTRY_SIZE); - oldldt = pc->ldt; - memset(newldt+oldsize*LDT_ENTRY_SIZE, 0, (mincount-oldsize)*LDT_ENTRY_SIZE); - pc->ldt = newldt; - wmb(); + newsize = mincount*LDT_ENTRY_SIZE; + for (i = 0; i < newsize; i += PAGE_SIZE) { + int nr = i/PAGE_SIZE; + BUG_ON(i >= 64*1024); + if (!pc->ldt_pages[nr]) { + pc->ldt_pages[nr] = alloc_page(GFP_HIGHUSER); + if (!pc->ldt_pages[nr]) + return -ENOMEM; + clear_highpage(pc->ldt_pages[nr]); + } + } pc->size = mincount; - wmb(); - if (reload) { #ifdef CONFIG_SMP cpumask_t mask; + preempt_disable(); load_LDT(pc); mask = cpumask_of_cpu(smp_processor_id()); @@ -67,21 +66,20 @@ load_LDT(pc); #endif } - if (oldsize) { - if (oldsize*LDT_ENTRY_SIZE > PAGE_SIZE) - vfree(oldldt); - else - kfree(oldldt); - } return 0; } static inline int copy_ldt(mm_context_t *new, mm_context_t *old) { - int err = alloc_ldt(new, old->size, 0); - if (err < 0) + int i, err, size = old->size, nr_pages = (size*LDT_ENTRY_SIZE + PAGE_SIZE-1)/PAGE_SIZE; + + err = alloc_ldt(new, size, 0); + if (err < 0) { + new->size = 0; return err; - memcpy(new->ldt, old->ldt, old->size*LDT_ENTRY_SIZE); + } + for (i = 0; i < nr_pages; i++) + copy_user_highpage(new->ldt_pages[i], old->ldt_pages[i], 0); return 0; } @@ -96,6 +94,7 @@ init_MUTEX(&mm->context.sem); mm->context.size = 0; + memset(mm->context.ldt_pages, 0, sizeof(struct page *) * MAX_LDT_PAGES); old_mm = current->mm; if (old_mm && old_mm->context.size > 0) { down(&old_mm->context.sem); @@ -107,23 +106,21 @@ /* * No need to lock the MM as we are the last user + * Do not touch the ldt register, we are already + * in the next thread. */ void destroy_context(struct mm_struct *mm) { - if (mm->context.size) { - if (mm == current->active_mm) - clear_LDT(); - if (mm->context.size*LDT_ENTRY_SIZE > PAGE_SIZE) - vfree(mm->context.ldt); - else - kfree(mm->context.ldt); - mm->context.size = 0; - } + int i, nr_pages = (mm->context.size*LDT_ENTRY_SIZE + PAGE_SIZE-1) / PAGE_SIZE; + + for (i = 0; i < nr_pages; i++) + __free_page(mm->context.ldt_pages[i]); + mm->context.size = 0; } static int read_ldt(void __user * ptr, unsigned long bytecount) { - int err; + int err, i; unsigned long size; struct mm_struct * mm = current->mm; @@ -138,8 +135,25 @@ size = bytecount; err = 0; - if (copy_to_user(ptr, mm->context.ldt, size)) - err = -EFAULT; + /* + * This is necessary just in case we got here straight from a + * context-switch where the ptes were set but no tlb flush + * was done yet. We rather avoid doing a TLB flush in the + * context-switch path and do it here instead. + */ + __flush_tlb_global(); + + for (i = 0; i < size; i += PAGE_SIZE) { + int nr = i / PAGE_SIZE, bytes; + char *kaddr = kmap(mm->context.ldt_pages[nr]); + + bytes = size - i; + if (bytes > PAGE_SIZE) + bytes = PAGE_SIZE; + if (copy_to_user(ptr + i, kaddr, size - i)) + err = -EFAULT; + kunmap(mm->context.ldt_pages[nr]); + } up(&mm->context.sem); if (err < 0) return err; @@ -158,7 +172,7 @@ err = 0; address = &default_ldt[0]; - size = 5*sizeof(struct desc_struct); + size = 5*LDT_ENTRY_SIZE; if (size > bytecount) size = bytecount; @@ -200,7 +214,15 @@ goto out_unlock; } - lp = (__u32 *) ((ldt_info.entry_number << 3) + (char *) mm->context.ldt); + /* + * No rescheduling allowed from this point to the install. + * + * We do a TLB flush for the same reason as in the read_ldt() path. + */ + preempt_disable(); + __flush_tlb_global(); + lp = (__u32 *) ((ldt_info.entry_number << 3) + + (char *) __kmap_atomic_vaddr(KM_LDT_PAGE0)); /* Allow LDTs to be cleared by the user. */ if (ldt_info.base_addr == 0 && ldt_info.limit == 0) { @@ -221,6 +243,7 @@ *lp = entry_1; *(lp+1) = entry_2; error = 0; + preempt_enable(); out_unlock: up(&mm->context.sem); @@ -248,3 +271,26 @@ } return ret; } + +/* + * load one particular LDT into the current CPU + */ +void load_LDT_nolock(mm_context_t *pc, int cpu) +{ + struct page **pages = pc->ldt_pages; + int count = pc->size; + int nr_pages, i; + + if (likely(!count)) { + pages = &default_ldt_page; + count = 5; + } + nr_pages = (count*LDT_ENTRY_SIZE + PAGE_SIZE-1) / PAGE_SIZE; + + for (i = 0; i < nr_pages; i++) { + __kunmap_atomic_type(KM_LDT_PAGE0 - i); + __kmap_atomic(pages[i], KM_LDT_PAGE0 - i); + } + set_ldt_desc(cpu, (void *)__kmap_atomic_vaddr(KM_LDT_PAGE0), count); + load_LDT_desc(); +} --- diff/arch/i386/kernel/mpparse.c 2003-11-25 15:24:57.000000000 +0000 +++ source/arch/i386/kernel/mpparse.c 2003-12-29 09:30:38.000000000 +0000 @@ -668,7 +668,7 @@ * Read the physical hardware table. Anything here will * override the defaults. */ - if (!smp_read_mpc((void *)mpf->mpf_physptr)) { + if (!smp_read_mpc((void *)phys_to_virt(mpf->mpf_physptr))) { smp_found_config = 0; printk(KERN_ERR "BIOS bug, MP table errors detected!...\n"); printk(KERN_ERR "... disabling SMP support. (tell your hw vendor)\n"); @@ -962,7 +962,8 @@ */ for (i = 0; i < mp_irq_entries; i++) { if ((mp_irqs[i].mpc_dstapic == intsrc.mpc_dstapic) - && (mp_irqs[i].mpc_srcbusirq == intsrc.mpc_srcbusirq)) { + && (mp_irqs[i].mpc_srcbusirq == intsrc.mpc_srcbusirq) + && (mp_irqs[i].mpc_irqtype == intsrc.mpc_irqtype)) { mp_irqs[i] = intsrc; found = 1; break; @@ -1081,8 +1082,14 @@ ioapic_pin = irq - mp_ioapic_routing[ioapic].irq_start; + /* + * MPS INTI flags: + * trigger: 0=default, 1=edge, 3=level + * polarity: 0=default, 1=high, 3=low + * Per ACPI spec, default for SCI means level/low. + */ io_apic_set_pci_routing(ioapic, ioapic_pin, irq, - (flags.trigger >> 1) , (flags.polarity >> 1)); + (flags.trigger == 1 ? 0 : 1), (flags.polarity == 1 ? 0 : 1)); } #ifdef CONFIG_ACPI_PCI @@ -1129,8 +1136,11 @@ continue; ioapic_pin = irq - mp_ioapic_routing[ioapic].irq_start; - if (!ioapic && (irq < 16)) - irq += 16; + if (es7000_plat) { + if (!ioapic && (irq < 16)) + irq += 16; + } + /* * Avoid pin reprogramming. PRTs typically include entries * with redundant pin->irq mappings (but unique PCI devices); @@ -1147,21 +1157,29 @@ if ((1<irq = irq; + if (use_pci_vector() && !platform_legacy_irq(irq)) + irq = IO_APIC_VECTOR(irq); + entry->irq = irq; continue; } mp_ioapic_routing[ioapic].pin_programmed[idx] |= (1<irq = irq; - + if (!io_apic_set_pci_routing(ioapic, ioapic_pin, irq, edge_level, active_high_low)) { + if (use_pci_vector() && !platform_legacy_irq(irq)) + irq = IO_APIC_VECTOR(irq); + entry->irq = irq; + } printk(KERN_DEBUG "%02x:%02x:%02x[%c] -> %d-%d -> IRQ %d\n", entry->id.segment, entry->id.bus, entry->id.device, ('A' + entry->pin), mp_ioapic_routing[ioapic].apic_id, ioapic_pin, entry->irq); } + + print_IO_APIC(); + + return; } #endif /*CONFIG_ACPI_PCI*/ --- diff/arch/i386/kernel/nmi.c 2003-10-09 09:47:16.000000000 +0100 +++ source/arch/i386/kernel/nmi.c 2003-12-29 09:30:38.000000000 +0000 @@ -31,7 +31,16 @@ #include #include +#ifdef CONFIG_KGDB +#include +#ifdef CONFIG_SMP +unsigned int nmi_watchdog = NMI_IO_APIC; +#else +unsigned int nmi_watchdog = NMI_LOCAL_APIC; +#endif +#else unsigned int nmi_watchdog = NMI_NONE; +#endif static unsigned int nmi_hz = HZ; unsigned int nmi_perfctr_msr; /* the MSR to reset in NMI handler */ extern void show_registers(struct pt_regs *regs); @@ -408,6 +417,9 @@ for (i = 0; i < NR_CPUS; i++) alert_counter[i] = 0; } +#ifdef CONFIG_KGDB +int tune_watchdog = 5*HZ; +#endif void nmi_watchdog_tick (struct pt_regs * regs) { @@ -421,12 +433,24 @@ sum = irq_stat[cpu].apic_timer_irqs; +#ifdef CONFIG_KGDB + if (! in_kgdb(regs) && last_irq_sums[cpu] == sum ) { + +#else if (last_irq_sums[cpu] == sum) { +#endif /* * Ayiee, looks like this CPU is stuck ... * wait a few IRQs (5 seconds) before doing the oops ... */ alert_counter[cpu]++; +#ifdef CONFIG_KGDB + if (alert_counter[cpu] == tune_watchdog) { + kgdb_handle_exception(2, SIGPWR, 0, regs); + last_irq_sums[cpu] = sum; + alert_counter[cpu] = 0; + } +#endif if (alert_counter[cpu] == 5*nmi_hz) { spin_lock(&nmi_print_lock); /* --- diff/arch/i386/kernel/process.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/process.c 2003-12-29 09:30:38.000000000 +0000 @@ -47,6 +47,7 @@ #include #include #include +#include #ifdef CONFIG_MATH_EMULATION #include #endif @@ -302,6 +303,9 @@ struct task_struct *tsk = current; memset(tsk->thread.debugreg, 0, sizeof(unsigned long)*8); +#ifdef CONFIG_X86_HIGH_ENTRY + clear_thread_flag(TIF_DB7); +#endif memset(tsk->thread.tls_array, 0, sizeof(tsk->thread.tls_array)); /* * Forget coprocessor state.. @@ -315,9 +319,8 @@ if (dead_task->mm) { // temporary debugging check if (dead_task->mm->context.size) { - printk("WARNING: dead process %8s still has LDT? <%p/%d>\n", + printk("WARNING: dead process %8s still has LDT? <%d>\n", dead_task->comm, - dead_task->mm->context.ldt, dead_task->mm->context.size); BUG(); } @@ -352,7 +355,17 @@ p->thread.esp = (unsigned long) childregs; p->thread.esp0 = (unsigned long) (childregs+1); + /* + * get the two stack pages, for the virtual stack. + * + * IMPORTANT: this code relies on the fact that the task + * structure is an 8K aligned piece of physical memory. + */ + p->thread.stack_page0 = virt_to_page((unsigned long)p->thread_info); + p->thread.stack_page1 = virt_to_page((unsigned long)p->thread_info + PAGE_SIZE); + p->thread.eip = (unsigned long) ret_from_fork; + p->thread_info->real_stack = p->thread_info; savesegment(fs,p->thread.fs); savesegment(gs,p->thread.gs); @@ -504,10 +517,41 @@ __unlazy_fpu(prev_p); +#ifdef CONFIG_X86_HIGH_ENTRY + /* + * Set the ptes of the virtual stack. (NOTE: a one-page TLB flush is + * needed because otherwise NMIs could interrupt the + * user-return code with a virtual stack and stale TLBs.) + */ + __kunmap_atomic_type(KM_VSTACK0); + __kunmap_atomic_type(KM_VSTACK1); + __kmap_atomic(next->stack_page0, KM_VSTACK0); + __kmap_atomic(next->stack_page1, KM_VSTACK1); + + /* + * NOTE: here we rely on the task being the stack as well + */ + next_p->thread_info->virtual_stack = + (void *)__kmap_atomic_vaddr(KM_VSTACK0); + +#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP) + /* + * If next was preempted on entry from userspace to kernel, + * and now it's on a different cpu, we need to adjust %esp. + * This assumes that entry.S does not copy %esp while on the + * virtual stack (with interrupts enabled): which is so, + * except within __SWITCH_KERNELSPACE itself. + */ + if (unlikely(next->esp >= TASK_SIZE)) { + next->esp &= THREAD_SIZE - 1; + next->esp |= (unsigned long) next_p->thread_info->virtual_stack; + } +#endif +#endif /* * Reload esp0, LDT and the page table pointer: */ - load_esp0(tss, next->esp0); + load_virtual_esp0(tss, next_p); /* * Load the per-thread Thread-Local Storage descriptor. --- diff/arch/i386/kernel/reboot.c 2003-10-09 09:47:33.000000000 +0100 +++ source/arch/i386/kernel/reboot.c 2003-12-29 09:30:38.000000000 +0000 @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include "mach_reboot.h" @@ -154,12 +155,11 @@ CMOS_WRITE(0x00, 0x8f); spin_unlock_irqrestore(&rtc_lock, flags); - /* Remap the kernel at virtual address zero, as well as offset zero - from the kernel segment. This assumes the kernel segment starts at - virtual address PAGE_OFFSET. */ - - memcpy (swapper_pg_dir, swapper_pg_dir + USER_PGD_PTRS, - sizeof (swapper_pg_dir [0]) * KERNEL_PGD_PTRS); + /* + * Remap the first 16 MB of RAM (which includes the kernel image) + * at virtual address zero: + */ + setup_identity_mappings(swapper_pg_dir, 0, 16*1024*1024); /* * Use `swapper_pg_dir' as our page directory. @@ -263,7 +263,12 @@ disable_IO_APIC(); #endif - if(!reboot_thru_bios) { + if (!reboot_thru_bios) { + if (efi_enabled) { + efi.reset_system(EFI_RESET_COLD, EFI_SUCCESS, 0, 0); + __asm__ __volatile__("lidt %0": :"m" (no_idt)); + __asm__ __volatile__("int3"); + } /* rebooting needs to touch the page at absolute addr 0 */ *((unsigned short *)__va(0x472)) = reboot_mode; for (;;) { @@ -273,6 +278,8 @@ __asm__ __volatile__("int3"); } } + if (efi_enabled) + efi.reset_system(EFI_RESET_WARM, EFI_SUCCESS, 0, 0); machine_real_restart(jump_to_bios, sizeof(jump_to_bios)); } @@ -287,6 +294,8 @@ void machine_power_off(void) { + if (efi_enabled) + efi.reset_system(EFI_RESET_SHUTDOWN, EFI_SUCCESS, 0, 0); if (pm_power_off) pm_power_off(); } --- diff/arch/i386/kernel/setup.c 2003-10-27 09:20:43.000000000 +0000 +++ source/arch/i386/kernel/setup.c 2003-12-29 09:30:38.000000000 +0000 @@ -36,6 +36,8 @@ #include #include #include +#include +#include #include