|
From Jan Bottorff (this post was only available on
Google via Cache. Since it was so impressively
well-written and comprehensive, I've replicated
it here):
There are a variety of hardware sources than can
assert the NMI signal, including:
1) NMI signal on the ISA bus
2) assorted transfer errors on PCI bus, It's not totally clear how these
make it the interrupt controller
3) a watchdog timer often can assert NMI
4) the internal APIC can generate a processor NMI
5) processor support chipset's can assert NMI on memory parity errors and
such (may invoke SMM interrupt)
I believe when the OS get's an NMI, it has to go around looking for
potential reasons, by checking hardware it knows about. The OS also
probably doesn't know how to figure out some of the sources. For example, a
watchdog timer expiration may be from proprietary hardware (which often
asserts reset instead of NMI, because the system knows what to do
). To understand about memory errors, the OS would need to know
details about the specific support chipset. Drivers also really should have
an entry point for "did you cause this NMI error" and "please fix you NMI
error". We could probably have a little religious war here about what
should happen on a serious hardware fault, like a bus error. Almost always,
if you access an I/O port address that isn't claimed by anybody, all that
happens is the transfer is thrown away or filled with a dummy value. The
hardware (especially PCI busses) know something was wrong.
My belief is the OS really should have a way to interface with all these
hardware devices. On a non-correctable memory error especially, I believe
the OS could ask the memory controller the exact location of the failure,
and terminate just the application with the bad memory, allowing everything
else to keep running. For ECC memory, the OS really should trap on a
correctable error and "scrub" the memory. If the error is repeatable, it
should take that physical page out of service, and migrate it's corrected
contents to a good page. Also log the memory error to the event log. I
don't know if NT is able to do any of this. I've never seen any messages in
the event log saying it noticed a correctable memory error.
Handling of these hardware errors is probably one of the differences
between NT and a higher end OS. I would almost bet money, that an access to
a non-existing bus addresses on an IBM mainframe causes some sort of error
processing to happen in the OS. It seems like Microsoft will have to
address this issue eventually. PC's have tended to gloss over these little
details, in the name of generic hardware compatability.
- Jan
|