Fixing intermittent invalid checksum messages#160
Conversation
# Problem ESP32 starts reading UART mid-message after boot. Since the LG CN-REMO protocol has no start-of-frame marker, the 13-byte framing window gets permanently shifted. This causes checksum errors intermittently. ## Fix: Two changes: ### 1: Startup re-flush (prevents the problem) Added `startup_flush_done_` member variable and a guard at the top of `update()` that re-flushes the UART buffer on the first call. This clears bytes that accumulated during the 10-second setup-to-update gap (root cause). ### 2: Sliding window (recovers from misalignment) On checksum failure, instead of discarding all 13 bytes, drops the oldest byte and slides remaining 12 bytes left. The next incoming byte completes a new 13-byte window for immediate re-evaluation. Converges within a single `update()` cycle since multiple bytes are typically buffered. Uses the existing `calc_checksum()` function — no duplicated checksum logic.
|
My LG system is currently broken for unrelated reasons, so I can't test this myself at the moment, but if this works, that's great to see! Interestingly, I specifically see the checksum problem on my LMN079HVT indoor units, but do not see the problem on my LSN090HSV5 and KNSAL091A indoor units. |
|
I have a few units but these two MT11R.NU1 and MT09R.NU1 refused to work until I fixed the buffer start logic. I've tested it so far for a couple of days and it seems to be ok. I hope the code change is generally non destructive and shouldn't change anything for other unit types. One other small concern is it is probably good to fix those deprecation warnings at some point before it breaks in about 6-7 months. I've looked into dynamic hiding of unused entities and the easiest fix might be to consider using set_disabled_by_default for all non-supported device capabilities. It won't hide the as nicely as it is done now but at least would not keep them floating around and move it to the disabled entities list. Haven't tried it myself and wanted to check if this is something you have considered as a potential fix? Another alternative is register_sensor dynamically for all capabilities that are supported but it feels a lot more work. |
|
My LG system has been repaired, so I tried this out. From the description, I thought that the "Startup re-flush" would fix the original problem, although it looks like I'm seeing lots of "Checksum mismatch" messages after startup. It looks like in some cases, there are 13 "Checksum mismatch" messages in a row; since the expected messages are 13 bytes, don't know if that's a bug of some sort, although the PR at a quick glance seems reasonable to me.
When there is a checksum mismatch, it may be useful to log the actual message, since in my case, I can't tell what data is being discarded, whereas before, the message with a bad checksum would be logged. |
|
I went back to JanM321:main, and took a look at my logs, and see: [00:19:58.373][D][lg-controller:954]: received 00.00.00.00.00.00.00.00.00.00.00.00.00 (13) I assume that with pkhodak:main, such padding messages are the source of the 13 consecutive "Checksum mismatch" messages, so it would also be useful to account for such messages, rather than warning and shifting 13 times. |
|
I'm working on https://github.com/kchen/esphome-lg-controller/tree/handle-corrupted-messages , which may be a better fix to resolve the problem in the original description. I'm not quite ready for a PR yet (planning to add one more feature and get some more testing in), but I expect to create a PR in the next few days. |
Problem
ESP32 starts reading UART mid-message after boot. Since the LG CN-REMO protocol has no start-of-frame marker, the 13-byte framing window gets permanently shifted. This causes checksum errors intermittently. I have 3 LG indoor units with wall controls running in a slave mode and 2 of them were nearly always going into inflnite loop of messages like these:
[10:56:54.029][D][lg-controller:954]: received 00.00.00.00.00.00.00.00.00.00.F9.A8.20 (13)[10:56:54.029][E][lg-controller:964]: invalid checksum 00.00.00.00.00.00.00.00.00.00.F9.A8.20 (13)[10:56:54.033][D][lg-controller:954]: received 00.00.00.00.25.14.00.00.00.00.54.A8.20 (13)[10:56:54.036][E][lg-controller:964]: invalid checksum 00.00.00.00.25.14.00.00.00.00.54.A8.20 (13)[10:56:54.040][D][lg-controller:954]: received 00.00.00.00.25.14.00.00.00.00.54.AC.00 (13)[10:56:54.044][E][lg-controller:964]: invalid checksum 00.00.00.00.25.14.00.00.00.00.54.AC.00 (13)[10:56:54.047][D][lg-controller:954]: received 00.00.00.00.00.00.00.00.00.00.F9.AC.00 (13)[10:56:54.051][E][lg-controller:964]: invalid checksum 00.00.00.00.00.00.00.00.00.00.F9.AC.00 (13)[10:57:00.026][D][lg-controller:1360]: update
The fix below sorted it out and the messages are typically synced within a few cycles.
Fix: Two changes:
1: Startup re-flush (prevents the problem)
Added
startup_flush_done_member variable and a guard at the top ofupdate()that re-flushes the UART buffer on the first call. This clears bytes that accumulated during the 10-second setup-to-update gap (root cause).2: Sliding window (recovers from misalignment)
On checksum failure, instead of discarding all 13 bytes, drops the oldest byte and slides remaining 12 bytes left. The next incoming byte completes a new 13-byte window for immediate re-evaluation. Converges within a single
update()cycle since multiple bytes are typically buffered.Uses the existing
calc_checksum()function — no duplicated checksum logic.