[BLAST_SHIFTS] Trouble last night

From: Tancredi Botto (tancredi@mitlns.mit.edu)
Date: Wed Nov 06 2002 - 11:26:36 EST


I think that what happened last night can be explained. There were two
problems. The first problem had to do with coda

a) After then power failure (Nov 4 14:51), when coda was restarted and
   reloaded its database it started with run # 2330. At that time
   the run number count should have been 2512. Run 2511 was taken monday
   early afternoon before the power failure.

b) Coda simply overwrote existing data. From Nov 4 22:48 (monday night) until
   Nov 5 22:09 (yesterday) we re-took runs # 2330-2345. Luckily these are
   only runs with the paddle scintillators and they are (forever) lost.
    
c) thanks to doug that (incidentally) noticed it!

* Action:
_ I do not understand why coda re-started from # 2330, I am trying to find
  out how that happened. But I know it happened with the power failure

_ I am also trying to understand how to prevent coda from overwrite data.
  That is clearly our first security. More on this later. However, now
  the run count is correct and have no reasons to believe it will jump again

_ Also, the data was automatically transferred to spuds. We can not avoid
  overwriting with scp because we want to keep updating files for the
  online analysis. Hence, runs 2330-2345 were also lost on the spuds.

---------------

The second problem had to do with data transfer. Every 3 minutes there
is an automatic check of the data on dblast07 and the spuds.

Yesterday, for the first time, we took two very large runs, run # 2343 (at
21:48) and run # 2602 (Nov 5 23:16). Both are close to the 1M events mark
and both are about 1.1 Gbytes in size.

Just now (we are not taking data) to transfer files of that size it took
2 mins and 58 s and 2 mins and 46 s respectively. Which is very close.
Like nikolas correctly diagnosed we were stuck in an infinite loop.
Stopping data taking, runcontrol, probably gave enough air to breathe
for the cron jobs to finish.

That infinite loop could have been enough to completely jam dblast07 memory.
One sympton was that the trigger program was not able to load .so objects.
Although you can not interrupt the data transfer without administrator
privileges I think there is a solution

A) The CODA max event limit should be 500,000. Runs automatically stop
   after that.

B) The data transfer rate has been increased to 5 minutes.

so we have ca 66% more time to transfer files approximately 50% smaller
in size. That should be enough.

Because of the data transfer problem we lost about 4 hrs of beam last
night (but we lost no data).

Regards,
-- tancredi

________________________________________________________________________________
Tancredi Botto, phone: +1-617-253-9204 mobile: +1-978-490-4124
research scientist MIT/Bates, 21 Manning Av Middleton MA, 01949
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On Wed, 6 Nov 2002, Nikolas C Meitanis wrote:

> Problems inherited from the previous shifts kept multiplying like
> mushrooms in a moist tree shade. Scores of scp processes from dblast07 to
> the spuds where eating up the memory and data acquisition was hindered, to
> put it mildly. The Cross MLU did not seem to be working properly
> so we spent time in the Dtunnel playing with cables which seemed to fix
> things. We found out that the Right Wchamber crate was having some
> problems. We cycled it and put both crates on standby for the rest of the
> night. Manually killing all scp processes, while the kept popping up,
> seems to have fixed the overloading problem with dblast07. The problem was
> with the 2 oversized runs 2343 and 2602 taken by the previous shift.
>
> When the problems were fixed, we finished the 65mV threshold run and
> raised the voltages by 10%, then took long runs at 25,45,65 mV as
> indicated in the logbook.
> Could not analyze the data for efficiencies on the cerenkovs since the
> ntuple.C needs to be fixed.
>
> Analysis on previous runs for cerenkov efficiency seems to agree within
> the error bars with an analysis on page 136 of logbook#4 for
> cerenkov Left_0. Our results are
> presented in the logbook for comments. The cuts are clearly indicated.
> We were unable to compare our cuts with those of others. And we did read
> recent emails concerning the issue.
>
> Tavi Filoti, nm
>
>
>



This archive was generated by hypermail 2.1.2 : Mon Feb 24 2014 - 14:07:28 EST