On one large production deployment, it was seen that about 0.002% of devices somehow didn't complete a 128KB write to the U-Boot environment block. The final 32KB of the block was all zeros. Redundant U-Boot environment blocks make this non-fatal, but you still lose a change. If not using a redundant U-Boot environment, you'll be stuck since Nerves.Runtime.KVBackend.UBootEnv always wants a successful read before a change.
This issue is actually easy to detect if the A/B partition choice is stored in the U-Boot environment block since a firmware update will see something unexpected and fail. Unfortunately, the error message isn't great right now and it manifests for us as an incorrect product/architecture error.
The workaround is to store the cached data:
UBootEnv.write(Nerves.Runtime.KV.get_all())
There should be a better way of detecting and correcting this automatically, since when the U-Boot environment isn't right, the device is in a fragile place where the wrong thing could happen on the next reboot. If not using redundant environment blocks, I'd expect this to brick the device due to such important info being lost.
On one large production deployment, it was seen that about 0.002% of devices somehow didn't complete a 128KB write to the U-Boot environment block. The final 32KB of the block was all zeros. Redundant U-Boot environment blocks make this non-fatal, but you still lose a change. If not using a redundant U-Boot environment, you'll be stuck since
Nerves.Runtime.KVBackend.UBootEnvalways wants a successful read before a change.This issue is actually easy to detect if the A/B partition choice is stored in the U-Boot environment block since a firmware update will see something unexpected and fail. Unfortunately, the error message isn't great right now and it manifests for us as an incorrect product/architecture error.
The workaround is to store the cached data:
There should be a better way of detecting and correcting this automatically, since when the U-Boot environment isn't right, the device is in a fragile place where the wrong thing could happen on the next reboot. If not using redundant environment blocks, I'd expect this to brick the device due to such important info being lost.