>>>getInfo.sh on the SBC window.
A typical output is here . The line dma_interrupts_OK=6 implies that 6 events have been successfully transferred to the SBC. It's also useful to note that the following are true:
dma_interrupts_ERR=0
suspicious_wc_cnt=0
Download the FRC in run mode:
>>>frc_config_run.py stt0.
This downloads the FRC in x70 in run-mode. Similarly, do frc_config_run.py stt2 to download the FRC in x72, etc.
>>>frcoff.py stt0
If the number of words/blocks is not increasing, you have to make sure again that there is data-flow and have to check if the TFCs are hung in anyway. If this is the case, both L2 and L3 are hung.
If the number of words/blocks is increasing, it implies that only L3 is hung. Do the following:
frc_status.py stt0
A typical status should look like this . Many of the numbers might be different ofcourse. The important thing to note is if any of the BC status registers shows up as 0xffffffff. This would indicate that the PCI bus on that card (FRC/STC/TFC) has hung and that that is the reason for the crash. There are three cases here:
In that case do a mb_reset.py stt0 slot# for that slot in which that particular card/cards sits in. Using the download Gui, choose the option that does not download all the cards on a reboot. And then reboot the cpu by typing in reboot on the IOC terminal.
Then check the status of that card. Do a python xstc_mon.py stt0 slot# and also do a python STCchk.py stt0 0 if it's a STC or a tfc00 (or tfc01) debug if it's a TFC.
Reinitialize all the cards.
>>>tfc00p init allskips daq (or tfc00n init allskips daq depending on the magnet polarity). The same for TFC1.
>>> python stc_reset_all.py stt0
>>> frc_config_run.py stt0
Start the system running again by doing a frcon.py stt0 . If there are still no events getting into the SBC and you've done all the checks listed above again, then the crate might have to be power-cycled.
(ii) The PCI bus on the FRC is hung.
You may notice that doing a getInfo.sh shows dma_interrupts_ERR=1 . Reset the pci bus on the FRC by doing a mb_reset.py stt0 13 . Then use the download Gui to choose the option that does not download all cards on a reboot. And then reboot the cpu. If the run had still been going on during this period, do a frcoff.py stt0 after the reboot finishes, dump the frc_mon.py information and then reinitialize all cards as above and start again. If L3 readout continues to not work, power-cycle the crate and start fresh again.
(iii) If none of the PCI busses have crashed, do the following:
If it is a STC Read, follow the instructions in Problems.txt.
In a nutshell, check the TFC status dump and make a note of which STC (mipsite, channel#) has fewer blocks than the others. Do a status dump for that STC.Do a python xstc_mon.py stt0 slot# and also do a python STCchk.py stt0 0 . Re-initialize all the cards as described above (without rebooting or power-cycling)and start again. If the TFC crashes in STC Read again, and especially if it's the same STC again, power-cycle the crate and start again. If the problem persists with the same STC, ensure that the LVDS cables from that STC and into the corresponding TFC are plugged in properly.
If the TFC hangs in FRC Read, read out the diagnostic LRB on the FRC by doing:
>>>frc_lrb_read.py stt0 .
The output will be found in the file outputfiles/LRB_output.txt . A typical output is here . Check to see if the first CTT header word (0x2101030d ) is present or no. To fully understand the data format, please look at Hal's T/R document.
In all TFC non-pci bus hang cases, do a TFC debug dump (there is no need to reboot since the bus is not hung). Ask for an sclinit and start again. If that does not work, reinitialize all cards and start again.
FRC: frc_l3_read.py stt0 > l3out.txt .
A format of the output should look like this . If, instead of this, you see only one word repeated over and over again (for eg. like this ), it implies that there is no data in the FRC L3 fifo.
STC: python xstc_mon.py stt0 slot# .
Check the line ADVERTISING L3 DATA : if it's status is yes , it implies that that particular STC has L3 data. Do the same for all STCs.
TFC: tfc00p status.
The L3FIFO counter should have a non-zero value if there are events in the TFC's L3 fifo. Do the same for TFC1.
If none of the 12 cards is advertising L3 data, it might mean that all the cards had their data successfully transferred to the BC and that the crash had happened at a later stage. If all but say one of the 12 cards is not advertising L3 data, then that card is responsible for the L3 readout hang. Please make a note of that.
>>>bc_mon.py stt0 slot#
A typical output looks like this for when there is no problem with the BC.
Note: If the PCI bus has hung on any MB, you will need to reset the MB first (mb_reset.py), remove the download of all cards from the start-up script and then reboot the cpu before you do the debug for that card. The frc_status.py command should tell you right away if the PCI3 bus on any MB has crashed, so that command should be the first one to use.
frc_status.py stt0
frc_mon.py stt0
tfc00p debug
tfc01p debug
python xstc_mon.py stt0 slot# (for all slots)
python STCchk.py stt0 0
bc_mon.py stt0 slot# (for all slots)
frc_lrb_read.py stt0
python stc_l3_read.py stt0 slot#
>>>reset_all.sh stop
>>>reset_all.sh start
You should prime the SBC again(as described above) if you've reset it.
We should try and note which of the following is the problem when we think we have a CTT input problem. This will help the CTT group to debug their firmware and hence provide us more stable inputs.
(1) BX/TURN mismatch between SCL and CTT
(2) Missing EOE or BOE
(3) Totally missing CTT events
(4) Data corruption
We can get information on the first two problems by doing a frc_mon.py (the top two lines in the dump: BOE/EOE MISSING ERROR, BX/TURN MISMATCH ERROR ). In my experience, a sclinit has almost always solved this sort of problem but we need to gather more statistics.
The third problem above occurs when the DAV goes missing totally. The typical signature is we go L1 & L2 busy with the following set in frc_mon.py:
L1_AF_inBM
L2_AF_inBM
These, ofcourse, might be set due to STT problems as well, so we have to be careful in reaching the conclusion that it's a CTT problem. Looking at the DAV signal is the more conclusive check. This problem is solved by either a sclinit or a FixCTT depending on what the cause of the problem was in the first place.
Follow all the steps for L3 Readout in non-VBD mode for getting started and running the system and troubleshooting. The parallel cable that sends information back to the SCL hub should always be plugged in. In addition, the following steps have to be taken in order to include the STT crates in VBD mode.