Comment by veltas

6 months ago

It doesn't feel like reading 4 times is necessarily a portable solution, if there will be more versions at different speeds and different I/O architectures; or how this will work under more load, and whether the original change was done to fix some other performance problem OP is not aware of, but not sure what else can be done. Unfortunately many vendors like Marvell can seriously under-document crucial features like this. If anything it would be good to put some of this info in the comment itself, not very elegant but how else practically are we meant to keep track of this, is the mailing list part of the documentation?

Doesn't look like there's a lot of discussion on the mailing list, but I don't know if I'm reading the thread view correctly.

8 comments

veltas

adrian_b 6 months ago

This is a workaround for a hardware bug of a certain CPU.

Therefore it cannot really be portable, because other timers in other devices will have different memory maps and different commands for reading.

The fault is with the designers of these timers, who have failed to provide a reliable way to read their value.

It in hard to believe that this still happens in this century, because reading correct values despite the fact that the timer is incremented or decremented continuously is an essential goal in the design of any timer that may be read, and how to do it has been well known for more than 3 quarters of century.

The only way to make such a workaround somewhat portable is to parametrize it, e.g. with the number of retries for direct reading or with the delay time when reading the auxiliary register. This may be portable between different revisions of the same buggy timer, but the buggy timers in other unrelated CPU designs will need different workarounds anyway.

stkdump 6 months ago
> how to do it has been well known for more than 3 quarters of century
Don't leave me hanging! How to do it?
- adrian_b 6 months ago
  
  Direct reading without the risk of reading incorrect values is possible only when the timer is implemented using a synchronous counter instead of an asynchronous counter and the synchronous counter must be fast enough to ensure a stable correct value by the time when it is read, and the reading signal must be synchronized with the timer clock signal.
  Synchronous counters are more expensive in die area than asynchronous counters, especially at high clock frequencies. Moreover, it may be difficult to also synchronize the reading signal with the timer clock. Therefore the second solution may be preferable, which uses a separate capture register for reading the timer value.
  This was implemented in the timer described in TFA, but it was done in a wrong way.
  The capture register must either ensure that the capture is already complete by the time when it is possible to read its value after giving a capture command, or it must have some extra bit that indicates when its value is valid.
  In this case, one can read the capture register until the valid bit is on, having a complete certainty that the end value is correct.
  When adding some arbitrary delay between the capture command and reading the capture register, you can never be certain that the delay value is good.
  Even when the chosen delay is 100% effective during testing, it can result in failures on other computers or when the ambient temperature is different.
veltas 6 months ago

> This is a workaround for a hardware bug of a certain CPU.
What about different variants, revisions, and speeds of this CPU?

Karliss 6 months ago

The related part of doc has one more note "This request requires up to three timer clock cycles. If the selected timer is working at slow clock, the request could take longer." From the way doc is formatted it's not fully clear what "this request" refers to. It might explain where 3-5 attempts come from, and that it might not be pulled completely out of thin air. But the part about taking up to but sometimes more clock cycles makes it impossible to have a "proper" solution without guesswork or further clarifications from vendor.

"working at slow clock" part, might explain why some other implementations had different code path for 32.768 KHz clocks. According to docs there are two available clock sources "Fast clock" and "32768 Hz" which could mean that "slow clock" refers to specific hardware functionality is not just a vague phrase.

As for portability concerns, this is already low level hardware specific register access. If Marvell releases new SOC not only there is no assurance that will require same timing, it might was well have different set of registers which require completely different read and setup procedure not just different timing.

One thing that slightly confuses me - the old implementation had 100 cycles of "cpu_relax()" which is unrelated to specific timer clock, but neither is reading of TMR_CVWR register. Since 3-5 of cycles of that worked better than 100 cycles of cpu_relex, it clearly takes more time unless cpu_relax part got completely optimized out. At least I didn't find any references mentioning that timer clock affects read time of TMR_CVWR.

veltas 6 months ago
It sounds like this is an old CPU(?), so no need to worry about the future here.
> I didn't find any references mentioning that timer clock affects read time of TMR_CVWR.
Reading the register might be related to the timer's internal clock, as it would have to wait for the timer's bus to respond. This is essentially implied if Marvell recommend re-reading this register, or if their reference implementation did so. My main complaint is it's all guesswork, because Marvell's docs aren't that good.
- MBCook 6 months ago
  
  The Chumby hardware I’m thinking of is from 2010 or so. So if that’s it, it would certainly be old. And it would explain a possible relation with the OLPC having a similar chip.
  https://en.wikipedia.org/wiki/Chumby

_nalply 6 months ago

I also wondered about this, but there's a crucial differnce, no idea if it matters: in that loop it reads the register, so the register is read at least 4 times.