Weblog
Articles
Film Reviews
Recipes
NMI Parity Check Error

Emerald Bay Photography

Resume
About
Contact

NMI Explained

From Jan Bottorff (this post was only available on Google via Cache. Since it was so impressively well-written and comprehensive, I've replicated it here):

There are a variety of hardware sources than can assert the NMI signal, including:

1) NMI signal on the ISA bus

2) assorted transfer errors on PCI bus, It's not totally clear how these make it the interrupt controller

3) a watchdog timer often can assert NMI

4) the internal APIC can generate a processor NMI

5) processor support chipset's can assert NMI on memory parity errors and such (may invoke SMM interrupt)

I believe when the OS get's an NMI, it has to go around looking for potential reasons, by checking hardware it knows about. The OS also probably doesn't know how to figure out some of the sources. For example, a watchdog timer expiration may be from proprietary hardware (which often asserts reset instead of NMI, because the system knows what to do ). To understand about memory errors, the OS would need to know details about the specific support chipset. Drivers also really should have an entry point for "did you cause this NMI error" and "please fix you NMI error". We could probably have a little religious war here about what should happen on a serious hardware fault, like a bus error. Almost always, if you access an I/O port address that isn't claimed by anybody, all that happens is the transfer is thrown away or filled with a dummy value. The hardware (especially PCI busses) know something was wrong.

My belief is the OS really should have a way to interface with all these hardware devices. On a non-correctable memory error especially, I believe the OS could ask the memory controller the exact location of the failure, and terminate just the application with the bad memory, allowing everything else to keep running. For ECC memory, the OS really should trap on a correctable error and "scrub" the memory. If the error is repeatable, it should take that physical page out of service, and migrate it's corrected contents to a good page. Also log the memory error to the event log. I don't know if NT is able to do any of this. I've never seen any messages in the event log saying it noticed a correctable memory error.

Handling of these hardware errors is probably one of the differences between NT and a higher end OS. I would almost bet money, that an access to a non-existing bus addresses on an IBM mainframe causes some sort of error processing to happen in the OS. It seems like Microsoft will have to address this issue eventually. PC's have tended to gloss over these little details, in the name of generic hardware compatability.

- Jan