Timeout on init of multiple SMuRF carriers simultaneously

XMLWordPrintable

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major
    • Component/s: FW, SW
    • None
    • Environment:

      I've only observed this issue on one specific SMuRF crate + server node at the LAT, other crates I've tried have not run into it. It occurs when I start up all 6 slots simultaneously, with at least two or three of the slots crashing during initialization. I haven't seen it happen when I start them up individually (although I have a smaller sample size for that).

      The failure is a a `TimeoutError`in the `Root._read` call. Typically, something like the following.

       

      Traceback (most recent call last):
        File "/usr/local/src/smurf-streamer/scripts/stream.py", line 140, in <module>
          main()
        File "/usr/local/src/smurf-streamer/scripts/stream.py", line 133, in main
          with CmbRoot(pcie=pcie, **root_kwargs):
        File "/usr/local/src/rogue/python/pyrogue/_Root.py", line 174, in __enter__
          self.start()
        File "/usr/local/src/pysmurf/python/pysmurf/core/roots/Common.py", line 189, in start
          pyrogue.Root.start(self)
        File "/usr/local/src/rogue/python/pyrogue/_Root.py", line 420, in start
          self._read()
        File "/usr/local/src/rogue/python/pyrogue/_Root.py", line 732, in _read
          self.checkBlocks(recurse=True)
        File "/usr/local/src/rogue/python/pyrogue/_Device.py", line 646, in checkBlocks
          value.checkBlocks(recurse=True, **kwargs)
        File "/usr/local/src/rogue/python/pyrogue/_Device.py", line 646, in checkBlocks
          value.checkBlocks(recurse=True, **kwargs)
        File "/usr/local/src/rogue/python/pyrogue/_Device.py", line 646, in checkBlocks
          value.checkBlocks(recurse=True, **kwargs)
        [Previous line repeated 3 more times]
        File "/usr/local/src/rogue/python/pyrogue/_Device.py", line 642, in checkBlocks
          pr.checkTransaction(block, **kwargs)
        File "/usr/local/src/rogue/python/pyrogue/_Block.py", line 71, in checkTransaction
          block._checkTransaction()
      rogue.GeneralError: Block::checkTransaction: General Error: Transaction error for block AMCc.FpgaTopLevel.AppTop.AppCore.RtmCryoDet.LutCtrl.Lut[1].MEM[244] with address 0x823203d0. Error Timeout (5.000000s) waiting for register transaction 7205 message response. 

      Although the precise register that times out varies.

       

      After the read, when `checkBlocks` is called on the entire tree, a timeout occurs at some step. In this case, `checkTransaction` should not be doing anything, since only read operations have been performed. It must be failing when waiting on an existing transaction to complete. I don't understand how it encounters this state, because from reading the code, it looks like each transaction waits for existing ones to finish (`waitTransaction(0)` at the start of the `startTransaction` function). So it would have to be the very last one that times out? I don't think I understand the rogue code well enough to make that conclusion. The timeout is set to 5s.

      Reverting back to the old version of the code (based on rogue 4), I have not encountered this issue despite trying multiple times to reproduce it.

              Assignee:
              Unassigned
              Reporter:
              Pinsonneault-Marotte, Tristan
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: