Re: [BLAST_SHIFTS] Trouble last night

From: Chris Crawford (chris2@lns.mit.edu)
Date: Wed Nov 06 2002 - 12:35:09 EST


Tancredi Botto wrote:
>
> I think that what happened last night can be explained. There were two
> problems. The first problem had to do with coda
>

Sounds like when coda crashed, it didn't have a chance to update the
last run. This happened once before in July. Maybe we could have a
script run at the end of each run which forces coda to update its
database, or simply write-protect the most recent run.

> a) After then power failure (Nov 4 14:51), when coda was restarted and
> reloaded its database it started with run # 2330. At that time
> the run number count should have been 2512. Run 2511 was taken monday
> early afternoon before the power failure.
>
> b) Coda simply overwrote existing data. From Nov 4 22:48 (monday night) until
> Nov 5 22:09 (yesterday) we re-took runs # 2330-2345. Luckily these are
> only runs with the paddle scintillators and they are (forever) lost.
>
> c) thanks to doug that (incidentally) noticed it!
>
> * Action:
> _ I do not understand why coda re-started from # 2330, I am trying to find
> out how that happened. But I know it happened with the power failure
>
> _ I am also trying to understand how to prevent coda from overwrite data.
> That is clearly our first security. More on this later. However, now
> the run count is correct and have no reasons to believe it will jump again
>
> _ Also, the data was automatically transferred to spuds. We can not avoid
> overwriting with scp because we want to keep updating files for the
> online analysis. Hence, runs 2330-2345 were also lost on the spuds.
>
> ---------------
>
> The second problem had to do with data transfer. Every 3 minutes there
> is an automatic check of the data on dblast07 and the spuds.
>

  I still think that the elegant solution would be to install an event
recorder on the spuds. That way data would be writing instantaneously
to the spuds, and we would not have to wait for 5 minutes after the run
to analyze it.
  We have to put up with inconvenience of have coda split up into small
parts (et, eb, er, ...) so we might as well put it to our advantage.
--jmtb, chris

> Yesterday, for the first time, we took two very large runs, run # 2343 (at
> 21:48) and run # 2602 (Nov 5 23:16). Both are close to the 1M events mark
> and both are about 1.1 Gbytes in size.
>
> Just now (we are not taking data) to transfer files of that size it took
> 2 mins and 58 s and 2 mins and 46 s respectively. Which is very close.
> Like nikolas correctly diagnosed we were stuck in an infinite loop.
> Stopping data taking, runcontrol, probably gave enough air to breathe
> for the cron jobs to finish.
>
> That infinite loop could have been enough to completely jam dblast07 memory.
> One sympton was that the trigger program was not able to load .so objects.
> Although you can not interrupt the data transfer without administrator
> privileges I think there is a solution
>
> A) The CODA max event limit should be 500,000. Runs automatically stop
> after that.
>
> B) The data transfer rate has been increased to 5 minutes.
>
> so we have ca 66% more time to transfer files approximately 50% smaller
> in size. That should be enough.
>
> Because of the data transfer problem we lost about 4 hrs of beam last
> night (but we lost no data).
>
> Regards,
> -- tancredi
>
> ________________________________________________________________________________
> Tancredi Botto, phone: +1-617-253-9204 mobile: +1-978-490-4124
> research scientist MIT/Bates, 21 Manning Av Middleton MA, 01949
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> On Wed, 6 Nov 2002, Nikolas C Meitanis wrote:
>
> > Problems inherited from the previous shifts kept multiplying like
> > mushrooms in a moist tree shade. Scores of scp processes from dblast07 to
> > the spuds where eating up the memory and data acquisition was hindered, to
> > put it mildly. The Cross MLU did not seem to be working properly
> > so we spent time in the Dtunnel playing with cables which seemed to fix
> > things. We found out that the Right Wchamber crate was having some
> > problems. We cycled it and put both crates on standby for the rest of the
> > night. Manually killing all scp processes, while the kept popping up,
> > seems to have fixed the overloading problem with dblast07. The problem was
> > with the 2 oversized runs 2343 and 2602 taken by the previous shift.
> >
> > When the problems were fixed, we finished the 65mV threshold run and
> > raised the voltages by 10%, then took long runs at 25,45,65 mV as
> > indicated in the logbook.
> > Could not analyze the data for efficiencies on the cerenkovs since the
> > ntuple.C needs to be fixed.
> >
> > Analysis on previous runs for cerenkov efficiency seems to agree within
> > the error bars with an analysis on page 136 of logbook#4 for
> > cerenkov Left_0. Our results are
> > presented in the logbook for comments. The cuts are clearly indicated.
> > We were unable to compare our cuts with those of others. And we did read
> > recent emails concerning the issue.
> >
> > Tavi Filoti, nm
> >
> >
> >



This archive was generated by hypermail 2.1.2 : Mon Feb 24 2014 - 14:07:28 EST