Comment by thrdbndndn

6 months ago

I don't get the fix.

Why reading it multiple times will fix the issue?

Is it just because reading takes time, therefore reading multiple time makes the needed time from writing to reading passes? If so, it sounds like a worse solution than just extending waiting delay longer like the author did initially.

If not, then I would like to know the reason.

(Needless to say, a great article!)

13 comments

thrdbndndn

adrian_b 6 months ago

The article says that the buggy timer has 2 different methods for reading.

When reading directly, the value may be completely wrong, because the timer is incremented continuously and the updating of its bits is not synchronous with the reading signal. Therefore any bit in the value that is read may be wrong, because it has been read exactly during a transition between valid values.

The workaround in this case is to read multiple times and accept as good a value that is approximately the same for multiple reads. The more significant bits of the timer value change much less frequently than the least significant bits, so at most attempts of reading, only a few bits can be wrong. Only seldom the read value can be complete garbage, when comparing it with the other read values will reject it.

The second reading method was to use a separate capture register. After giving a timer capture command, reading an unchanging value from the capture register should have caused no problems. Except that in this buggy timer, it is unpredictable when the capture is actually completed. This requires the insertion of an empirically determined delay time before reading the capture register, hopefully allowing enough time for the capture to be complete.

Dylan16807 6 months ago
> The workaround in this case is to read multiple times and accept as good a value that is approximately the same for multiple reads.
It's only incrementing at 3.25MHz, right? Shouldn't you be able to get exactly the same value for multiple reads? That seems both simpler and faster than using this very slow capture register, but maybe I'm missing something.
- adrian_b 6 months ago
  
  In this specific case, yes, if none of two successive readings is corrupted and when you did not straddle a transition, they should be the same.
  In general, when reading a timer that increments faster, you may want to mask some of the least significant bits, to ensure that you can have the same values on successive readings.

dougg3 6 months ago

Author here. Thanks! I believe the register reads are just extending the delay, although the new approach does have a side effect of reading from the hardware multiple times. I don't think the multiple reads really matter though.

I went with the multiple reads because that's what Marvell's own kernel fork does. My reasoning was that people have been using their fork, not only on the PXA168, but on the newer PXAxxxx series, so it would be best to retain Marvell's approach. I could have just increased the delay loop, but I didn't have any way of knowing if the delay I chose would be correct on newer PXAxxx models as well, like the chip used in the OLPC. Really wish they had more/better documentation!

rep_lodsb 6 months ago

It's possible that actually reading the register takes (significantly) more time than an empty countdown loop. A somewhat extreme example of that would be on x86, where accessing legacy I/O ports for e.g. the timer goes through a much lower-clocked emulated ISA bus.

However, a more likely explanation is the use of "volatile" (which only appears in the working version of the code). Without it, the compiler might even have completely removed the loop?

deng 6 months ago
> However, a more likely explanation is the use of "volatile" (which only appears in the working version of the code). Without it, the compiler might even have completely removed the loop?
No, because the loop calls cpu_relax(), which is a compiler barrier. It cannot be optimized away.
And yes, reading via the memory bus is much, much slower than a barrier. It's absolutely likely that reading 4 times from main memory on such an old embedded system takes several hundred cycles.
- Karliss 6 months ago
  
  From what I understand the timer registers should be on APB(1) bus which operates at fixed 26MHz clock. That should be much closer to the scale of fast timer clocks compared to cpu_relax() and main CPU clock running somewhere in the range of 0.5-1GHz and potentially doing some dynamic frequency scaling for power saving purpose.
  The silliest part of this mess is that 26Mhz clock for APB1 bus is derived from the same source as 13Mhz, 6.5Mhz 3.25Mhz, 1Mhz clocks usable by fast timers.
- rep_lodsb 6 months ago
  
  You're right, didn't account for that. Though even when declared volatile, the counter variable would be on the stack, and thus already in the CPU cache (at least 32K according to the datasheet)?
  Looking at the assembly code for both versions of this delay loop might clear it up.
  
  3 replies →

mastax 6 months ago

Karliss above found docs which mention:

> This request requires up to three timer clock cycles. If the selected timer is working at slow clock, the request could take longer.

Let's ignore the weirdly ambiguous second sentence and say for pedagogical purposes it takes up to three timer clock cycles full stop. Timer clock cycles aren't CPU clock cycles, so we can't just do `nop; nop; nop;`. How do we wait three timer clock cycles? Well a timer register read is handled by the timer peripheral which runs at the timer clock, so reading (or writing) a timer register will take until at least the end of the next timer clock.

This is a very common pattern when dealing with memory mapped peripheral registers.

---

I'm making some reasonable assumptions about how the clock peripheral works. I haven't actually dug into the Marvell documentation.

deng 6 months ago

> Is it just because reading takes time, therefore reading multiple time makes the needed time from writing to reading passes?

Yes.

> If so, it sounds like a worse solution than just extending waiting delay longer like the author did initially.

Yeah, it's a judgement call. Previously, the code called cpu_relax() for waiting, which is also dependent on how this is defined (can be simply NOP or barrier(), for instance). The reading of the timer register maybe has the advantage that it is dependent on the actual memory bus speed, but I wouldn't know for sure. Hardware at that level is just messy, and especially niche platforms have their fair share of bugs where you need to do ugly workarounds like these.

What I'm rather wondering is why they didn't try the other solution that was mentioned by the manufacturer: reading the timer directly two times and compare it, until you get a stable output.