The Linux EDAC project comprises a series of Linux kernel modules, which make use of error detection facilities of computer hardware, currently hardware which detects the following errors is supported:. Oh dear, my laptop sound device seems to be broken! As you can see from the above, PCI error checking is turned off by default, and needs to be turned on using the "echo" statement above. Please try and check out the possibilities listed here, and elsewhere on this wiki, before you either open a new bug report, or post to the mailing list.

If you think you've found a bug, please search the EDAC Bugzilla to see if it has already been reported you can then add yourself to the cc list for that bug, so that you are automatically informed of updates etc.

Most of the EDAC developers keep an eye on the EDAC mailing list hosted by Sourceforge to a greater or lesser extent, but please remember that not many of them work on EDAC as part of their job, and if they do, then they are paid to keep their employer's systems runningso check the Wiki, the bug database, and the mailing list archives for both the current and the previous mailing lists for your problem first.

If you have exhausted these possibilities, then by all means post to the mailing list If you get a reply, or find things out which weren't known about before, please add the information to this Wiki, in order to help others. There is a userspace API via sysfs in 2. If you want a more recent version than the version in your current kernel, you can download a quilt stack from the sourceforge download page see belowor by anonymous SVN checkout:.

Prior to Maythings can be found in CVS. See the sourceforge main page for CVS information. You will need a recent Linux kernel tree to apply the patches to.

Or, if you just want to have a look at the recent changes, you can browse the SVN at:. The Bluesmoke code was created by Thayne Harbaugh. Please see the individual driver pages for information on supported revisions, motherboard-specific information etc.

You can help by working out the relationship for your hardware, and adding the info to the MemorySlotLabels page. PCI Parity error reporting facilities are included in the PCI specification, and the majority of add-in cards and chips which are capable of being included in either add-in, or on-motherboard designs support the PCI parity error detection, and reporting functionality. The driver currently only support error detection via polling.

Polling all of the PCI devices' error status registers can be time consuming, especially on machines which have many devices.In the interest of creating a common ground for discussion, terms and their definitions will be established.

The individual DRAM chips on a memory stick. These devices commonly output 4 and 8 bits each x4, x8.

A printed circuit board that aggregates multiple memory devices in parallel. A physical connector on the motherboard that accepts a single memory stick. A memory controller channel, responsible to communicate with a group of DIMMs. Each channel has its own independent control command and data bus, and can be used independently or grouped with other channels.

Typically, it contains two channels. Two channels at the same branch can be used in single mode or in lockstep mode.

When lockstep is enabled, the cacheline is doubled, but it generally brings some performance penalty. Also, it is generally not possible to point to just one memory stick when an error occurs, as the error correction code is calculated using two DIMMs instead of one.

Due to that, it is capable of correcting more errors than on single mode. The data accessed by the memory controller is contained into one dimm only.

The data size accessed by the memory controller is interlaced into two dimms, accessed at the same time. Common chip-select rows for single channel are 64 bits, for dual channel bits. It may not be visible by the memory controller, as some DIMM types have a memory buffer that can hide direct access to it from the Memory Controller. A Single-ranked stick has 1 chip-select row of memory.

Motherboards commonly drive two chip-select pins to a memory stick. A single-ranked stick, will occupy only one of those rows. The other will be unused. A double-ranked stick has two chip-select rows which access different sets of memory devices.

The two rows cannot be accessed concurrently. A double-sided stick has two chip-select rows which access different sets of memory devices.Error-correcting code memory ECC memory is a type of computer data storage that can detect and correct the most-common kinds of internal data corruption.

ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing.

Typically, ECC memory maintains a memory system immune to single-bit errors: the data that is read from each word is always the same as the data that had been written to it, even if one of the bits actually stored has been flipped to the wrong state.

ECC protects against undetected memory data corruption, and is used in computers where such corruption is unacceptable, for example in some scientific and financial computing applications, or in file servers. ECC also reduces the number of crashes that are especially unacceptable in multi-user server applications and maximum-availability systems.

Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory DRAM to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research has shown that the majority of one-off soft errors in DRAM chips occur as a result of background radiationchiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read or write to them.

As an example, the spacecraft Cassini—Huygenslaunched incontained two identical flight recorders, each with 2. Thanks to built-in EDAC functionality, spacecraft's engineering telemetry reported the number of correctable single-bit-per-word errors and uncorrectable double-bit-per-word errors. During the first 2. However, on November 6,during the first month in space, the number of errors increased by more than a factor of four for that single day.

This was attributed to a solar particle event that had been detected by the satellite GOES 9. There was some concern that as DRAM density increases further, and thus the components on chips get smaller, while at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently—since lower-energy particles will be able to change a memory cell's state.

Recent studies [6] show that single-event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded. The consequence of a memory error is system-dependent. In systems without ECC, an error can lead either to a crash or to corruption of data; in large-scale production sites, memory errors are one of the most-common hardware causes of machine crashes.

A simulation study showed that, for a web browser, only a small fraction of memory errors caused data corruption, although, as many memory errors are intermittent and correlated, the effects of memory errors were greater than would be expected for independent soft errors.

Some tests conclude that the isolation of DRAM memory cells can be circumvented by unintended side effects of specially crafted accesses to adjacent cells. Thus, accessing data stored in DRAM causes memory cells to leak their charges and interact electrically, as a result of high cell density in modern memory, altering the content of nearby memory rows that actually were not addressed in the original memory access.

This effect is known as row hammerand it has also been used in some privilege escalation computer security exploits.In information theory and coding theory with applications in computer science and telecommunicationerror detection and correction or error control are techniques that enable reliable delivery of digital data over unreliable communication channels. Many communication channels are subject to channel noiseand thus errors may be introduced during transmission from the source to a receiver.

Error detection techniques allow detecting such errors, while error correction enables reconstruction of the original data in many cases. Error detection is the detection of errors caused by noise or other impairments during transmission from the transmitter to the receiver. Error correction is the detection of errors and reconstruction of the original, error-free data.

The modern development of error correction codes is credited to Richard Hamming in All error-detection and correction schemes add some redundancy i. Error-detection and correction schemes can be either systematic or non-systematic.

In a systematic scheme, the transmitter sends the original data, and attaches a fixed number of check bits or parity datawhich are derived from the data bits by some deterministic algorithm. If only error detection is required, a receiver can simply apply the same algorithm to the received data bits and compare its output with the received check bits; if the values do not match, an error has occurred at some point during the transmission.

In a system that uses a non-systematic code, the original message is transformed into an encoded message carrying the same information and that has at least as many bits as the original message. Good error control performance requires the scheme to be selected based on the characteristics of the communication channel.

Common channel models include memoryless models where errors occur randomly and with a certain probability, and dynamic models where errors occur primarily in bursts.

Some codes can also be suitable for a mixture of random errors and burst errors. If the channel characteristics cannot be determined, or are highly variable, an error-detection scheme may be combined with a system for retransmissions of erroneous data.

This is known as automatic repeat request ARQand is most notably used in the Internet. An alternate approach for error control is hybrid automatic repeat request HARQwhich is a combination of ARQ and error-correction coding.

There are three major types of error correction. An acknowledgment is a message sent by the receiver to indicate that it has correctly received a data frame. Usually, when the transmitter does not receive the acknowledgment before the timeout occurs i.

ARQ is appropriate if the communication channel has varying or unknown capacitysuch as is the case on the Internet. However, ARQ requires the availability of a back channelresults in possibly increased latency due to retransmissions, and requires the maintenance of buffers and timers for retransmissions, which in the case of network congestion can put a strain on the server and overall network capacity.

Forward error correction FEC is a process of adding redundant data such as an error-correcting code ECC to a message so that it can be recovered by a receiver even when a number of errors up to the capability of the code being used were introduced, either during the process of transmission, or on storage. Since the receiver does not have to ask the sender for retransmission of the data, a backchannel is not required in forward error correction, and it is therefore suitable for simplex communication such as broadcasting.

Error-correcting codes are frequently used in lower-layer communication, as well as for reliable storage in media such as CDsDVDshard disksand RAM.

Error-correcting codes are usually distinguished between convolutional codes and block codes :. Shannon's theorem is an important theorem in forward error correction, and describes the maximum information rate at which reliable communication is possible over a channel that has a certain error probability or signal-to-noise ratio SNR.

This strict upper limit is expressed in terms of the channel capacity. More specifically, the theorem says that there exist codes such that with increasing encoding length the probability of error on a discrete memoryless channel can be made arbitrarily small, provided that the code rate is smaller than the channel capacity.

The actual maximum code rate allowed depends on the error-correcting code used, and may be lower. This is because Shannon's proof was only of existential nature, and did not show how to construct codes which are both optimal and have efficient encoding and decoding algorithms.

There are two basic approaches: [5]. The latter approach is particularly attractive on an erasure channel when using a rateless erasure code. Error detection is most commonly realized using a suitable hash function or specifically, a checksumcyclic redundancy check or other algorithm. A hash function adds a fixed-length tag to a message, which enables receivers to verify the delivered message by recomputing the tag and comparing it with the one provided.

