Maker Pro
Maker Pro

EEPROM checksum error

R

Robert Monsen

Jan 1, 1970
0
Charles said:
For you, this might be a design issue. As power ramps up/ramps down,
different components react differently. Some have internal caps that
make them hold state a little longer than others, or are just more
sensitive to power supply levels. Think about the programming sequence.
What could provide it in your circuit. What could PREVENT it in your
circuit!

I have often found that start up conditions are not fully considered in
design. You just assume that the power comes up all at once, smoothly.
In reality, different voltage rails come up differently. Filter caps
take time to charge up to voltage. Good design takes that into account,
sometimes adding POR circuits to make sure that power is steady before
starting things up, and quick shut down sequences to turn everything off
before the power goes below limits. It's like preventing race
conditions and logic glitches. Sometimes, you just have to take a good
look at the failure modes...

Typically, parallel eeproms have active low write enable and chip select
pins. If those pins are slow to come up, I guess it can cause problems.

However, one typical example is the atmel parallel eeproms. They require
an 0xAA, followed by an 0x55 to be written to special addresses before
they'll go into write enabled mode. The address pins need to be set to
1555 for the 0xAA, and 0xAAA for the 0x55. Getting this to happen
because of a startup seems incredibly far-fetched.

There is also a set of hardware features that protect against
inadvertent writes. The AT28BV64B has a power on delay of 10ms after Vcc
comes up; thus, these glitches would have to be at least 10ms. If OE is
low, writes are inhibited; thus, the glitch would have to be restricted
to the WR and CE pins. If either CE or WE are high, writes are
inhibited; thus, the glitch would have to be on both of these, but not
on OE. Also, pulses of less than 15ns on either WE or CE won't initiate
a write cycle. After this dance, one has to go through the software
programming sequence to get it to really go into write mode.

Again, startup writes to these things sounds incredibly far-fetched.
Perhaps after trillions of power cycles inducing random noise (1,000,000
monkeys?).

I'm guessing your example eeprom wasn't one of these, and was protected
in some other way. Either that, or the chip firmware was flawed, and the
sequence above could be circumvented in some situations.

The OP hasn't indicated what EEPROM he is using, or what kind of usage
(whether it's getting programmed in the field or not). Thus, we are all
speculating without any real information.

--
Regards,
Robert Monsen

"Your Highness, I have no need of this hypothesis."
- Pierre Laplace (1749-1827), to Napoleon,
on why his works on celestial mechanics make no mention of God.
 
D

Dummy

Jan 1, 1970
0
Robert Monsen said:
Typically, parallel eeproms have active low write enable and chip select
pins. If those pins are slow to come up, I guess it can cause problems.

However, one typical example is the atmel parallel eeproms. They require
an 0xAA, followed by an 0x55 to be written to special addresses before
they'll go into write enabled mode. The address pins need to be set to
1555 for the 0xAA, and 0xAAA for the 0x55. Getting this to happen
because of a startup seems incredibly far-fetched.

There is also a set of hardware features that protect against
inadvertent writes. The AT28BV64B has a power on delay of 10ms after Vcc
comes up; thus, these glitches would have to be at least 10ms. If OE is
low, writes are inhibited; thus, the glitch would have to be restricted
to the WR and CE pins. If either CE or WE are high, writes are
inhibited; thus, the glitch would have to be on both of these, but not
on OE. Also, pulses of less than 15ns on either WE or CE won't initiate
a write cycle. After this dance, one has to go through the software
programming sequence to get it to really go into write mode.

Again, startup writes to these things sounds incredibly far-fetched.
Perhaps after trillions of power cycles inducing random noise (1,000,000
monkeys?).

I'm guessing your example eeprom wasn't one of these, and was protected
in some other way. Either that, or the chip firmware was flawed, and the
sequence above could be circumvented in some situations.

The OP hasn't indicated what EEPROM he is using, or what kind of usage
(whether it's getting programmed in the field or not). Thus, we are all
speculating without any real information.

Atmel EEPROM AT25128 is used in the radio. Radio programming was
performaed in the field but EEPROM checksum error did not happen after
radio programming. It happened during normal operation of radio. Of
course, the EEPROM checksum error message won't appear during
operating the radio because the checksum will only be checked during
radio startup. There'll be read/write operation involved at EEPROM
while operating the radio.
 
R

Robert Monsen

Jan 1, 1970
0
Dummy said:
Atmel EEPROM AT25128 is used in the radio. Radio programming was
performaed in the field but EEPROM checksum error did not happen after
radio programming. It happened during normal operation of radio. Of
course, the EEPROM checksum error message won't appear during
operating the radio because the checksum will only be checked during
radio startup. There'll be read/write operation involved at EEPROM
while operating the radio.

Ok, this is an SPI serial eeprom. It has various levels of hardware
protection, but it also has a software write disable command. You first
enable it with a write enable, make your change, and then disable it.
Are you disabling it after each write?

The spi interface is such that if there is noise on the clock pin, it's
conceivable that if you left the thing selected, and in write mode, that
it could write a random byte. The write opcode is the cleverly chosen
0000x010. Clocking the thing enough times to output the address and data
(then raising CS at just the right time) might cause problems.

However, I'm still guessing it is a software problem. You are writing
the bad data. You can prove this to yourself by keeping a shadow copy of
the data elsewhere in the eeprom, assuming there is enough room. If it's
bad in both places, you are writing it.

Also, disable writes after you change the data, if you aren't already
doing it.

One other idea is that if you continually rewrite the same cell, it can
run out of lives. The cells will only last for a certain (large) number
of writes. Flash file systems typically rotate the flash pages to
prevent this sort of problem. If, for example, you are putting the
checksum in the same place, and rewriting it again and again on a moment
to moment basis, you can cause the cell to fail.

--
Regards,
Robert Monsen

"Your Highness, I have no need of this hypothesis."
- Pierre Laplace (1749-1827), to Napoleon,
on why his works on celestial mechanics make no mention of God.
 
P

Pooh Bear

Jan 1, 1970
0
Dummy said:
Atmel EEPROM AT25128 is used in the radio. Radio programming was
performaed in the field but EEPROM checksum error did not happen after
radio programming. It happened during normal operation of radio. Of
course, the EEPROM checksum error message won't appear during
operating the radio because the checksum will only be checked during
radio startup. There'll be read/write operation involved at EEPROM
while operating the radio.

So the problem happens most likely when the user powers off the radio whilst the EEprom is being
written to ! Supply rails fall in an uncontrolled manner

That creates questionable data and hence a checksum error.

I got over this one by using multiple writes to different locations and comparing contents.


Graham
 
R

Rich Webb

Jan 1, 1970
0
So the problem happens most likely when the user powers off the radio whilst the EEprom is being
written to ! Supply rails fall in an uncontrolled manner

That creates questionable data and hence a checksum error.

I got over this one by using multiple writes to different locations and comparing contents.

Or, write to the EEPROM as a circular buffer (assuming that the length
of the information to be stored is less than half the EEPROM length). On
bootup, fetch the highest indexed data block that has a valid checksum.
If the user does a store-to-EEPROM and then quickly turns off the power,
the startup values revert to the last good set.
 
D

Dummy

Jan 1, 1970
0
Robert Monsen said:
Ok, this is an SPI serial eeprom. It has various levels of hardware
protection, but it also has a software write disable command. You first
enable it with a write enable, make your change, and then disable it.
Are you disabling it after each write?

The spi interface is such that if there is noise on the clock pin, it's
conceivable that if you left the thing selected, and in write mode, that
it could write a random byte. The write opcode is the cleverly chosen
0000x010. Clocking the thing enough times to output the address and data
(then raising CS at just the right time) might cause problems.

However, I'm still guessing it is a software problem. You are writing
the bad data. You can prove this to yourself by keeping a shadow copy of
the data elsewhere in the eeprom, assuming there is enough room. If it's
bad in both places, you are writing it.

Also, disable writes after you change the data, if you aren't already
doing it.

One other idea is that if you continually rewrite the same cell, it can
run out of lives. The cells will only last for a certain (large) number
of writes. Flash file systems typically rotate the flash pages to
prevent this sort of problem. If, for example, you are putting the
checksum in the same place, and rewriting it again and again on a moment
to moment basis, you can cause the cell to fail.

How do I introduce glitches to supply line?
Pulse generator can be used, but it doesn't have enough current
capability to turn on the radio. I coupled the pulse through a 0.1uF
cap to supply line, but the amplitude was greatly reduced when
measured at radio supply line.
 
F

Fred Bloggs

Jan 1, 1970
0
Robert said:
Ok, this is an SPI serial eeprom. It has various levels of hardware
protection, but it also has a software write disable command. You first
enable it with a write enable, make your change, and then disable it.
Are you disabling it after each write?

The spi interface is such that if there is noise on the clock pin, it's
conceivable that if you left the thing selected, and in write mode, that
it could write a random byte. The write opcode is the cleverly chosen
0000x010. Clocking the thing enough times to output the address and data
(then raising CS at just the right time) might cause problems.

However, I'm still guessing it is a software problem. You are writing
the bad data. You can prove this to yourself by keeping a shadow copy of
the data elsewhere in the eeprom, assuming there is enough room. If it's
bad in both places, you are writing it.

Also, disable writes after you change the data, if you aren't already
doing it.

One other idea is that if you continually rewrite the same cell, it can
run out of lives. The cells will only last for a certain (large) number
of writes. Flash file systems typically rotate the flash pages to
prevent this sort of problem. If, for example, you are putting the
checksum in the same place, and rewriting it again and again on a moment
to moment basis, you can cause the cell to fail.

He needs to keep a read-verify table in EPROM that records each write
was verified by read back. Then the checksum error does not fail
power-on self-test when a verify bit was not set. He can encode the
verify recording any number of ways to put the probability of false
verify bit setting at 1e-24 if he wants.
 
F

Fred Bloggs

Jan 1, 1970
0
Robert said:
* A company I used to work for decided to save 10 cents a ram chip and
forgo individual testing of the chips by the manufacturer. Sadly, it
turned out that those chips were bad 5 to 10 percent of the time. They
were selling high availability purple ethernet switches for hundreds of
thousands of dollars each. The engineer responsible was of course
promoted to VP, and given vast new responsibilites.

Isn't that typical. And we're supposed to feel sorry for these people
losing their jobs to overseas operations? No way- may they die a slow
death and never return- the US blew it.
 
R

Robert Monsen

Jan 1, 1970
0
Fred said:
He needs to keep a read-verify table in EPROM that records each write
was verified by read back. Then the checksum error does not fail
power-on self-test when a verify bit was not set. He can encode the
verify recording any number of ways to put the probability of false
verify bit setting at 1e-24 if he wants.

Right, there are also error correcting codes that can be used. Total
redundancy, with three copies, is the ultimate in error correcting
codes. I believe I recall that the space shuttle software uses that
scheme to verify results; however, for them, they use 3 independently
designed and tested systems, and a voting scheme.

--
Regards,
Robert Monsen

"Your Highness, I have no need of this hypothesis."
- Pierre Laplace (1749-1827), to Napoleon,
on why his works on celestial mechanics make no mention of God.
 
Top