Re: [BLAST_ANAWARE] [BLAST_SHIFTS] autocruncher double crunch 100kC worth of data affected

From: Chi Zhang (zhangchi@MIT.EDU)
Date: Sat Oct 16 2004 - 21:58:46 EDT


Hi Taylan and Adrian

first, the problem seem to pertain only to the Blast online crunching. So
the recrunch should be fine as seen.

second, the problem is universal for the files crunched online. It s just
so happened that I copied a part of status list that are acompanied by
other problems. I sampled one every 100 runs, with a few more arround
11295 and a few more arround 11905. all bad.

since you bring it up, I wrote a small program to check it and I just
committed into CVS. it is called checkdst.C. you can easily run it in a
place where libBlast.so is loadable.

it checks for repeated event number in dst event tree and whenever it
finds one it yells out "##### BAAAAAAAAAAAAAAAAAD". and boy the result is
ugly. pretty much EVERY run is double crunched. from 11298 to
11771, 351 ABS D2 runs among 354 had repeated events amnong the
first 10 entries. something bad happened to the cruncher or its
environment. period.

BTW, I also tested the script on runs crunched before (11003-11247), they
passed the test so the script itself does not seem to give false alarms.

Yes we should be more careful but I also don't think it fair to require on
shift people to keep track of the gazillion things a lot of which only
experts know.

about large scale panic, I am simply more irritable than two years ago.

Chi

On Sat, 16 Oct 2004, Taylan Akdogan wrote:

> Hi Chi,
>
> I suggest checking each run before conclude that "all" off them
> are this way. I saw auto cruncher submitting multiple processes
> for the same run. As an example, it submitted 6 jobs for a single
> run. However, my experienced during the shifts showed that it
> does not happen at very high rate.
>
> For the portions of the status file you attached below: We had a
> full-analysis-disc on Friday morning, several jobs crashed due to
> this reason (code 30s), so I restarted them. It was hard to keep
> track of the runs which were crunching, which were crashed. So, I
> sorted the status file according to the run number for that
> portion. So, although they appear one after another, they were
> not submitted at the same time.
>
> So, it is better to check the runs one by one before recrunching
> all of them. I don;t think all of them are bad; probably most of
> them are fine. But, yes, we should be careful with the auto
> cruncher...
>
> There is no need for a large scale panic!
>
> Taylan
>
>
> ---=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=---
> Taylan Akdogan Massachusetts Institute of Technology
> akdogan@mit.edu Department of Physics
> Phn:+1-617-258-0801 Laboratory for Nuclear Science
> Fax:+1-617-258-5440 Room 26-402b
> ---=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=---
>
> On Sat, 16 Oct 2004, Chi Zhang wrote:
>
> >
> > Hi all,
> >
> > sorry for breaking this bad news but ever since September 22nd, the
> > auto-cruncher has been crunching the same runs for multiple times.
> >
> > the sympton being: in status_list.txt multiple entries for same run appear
> > and they are ON DIFFERENT CPUs!!!!!!!!!! When dst is openend and the
> > following command in root " dst->Scan("fNEvent") is issued, one can see
> > same CODA event umber appears multiple times!!!!!!!!! this continues until
> > all but one lrn crashed out.
> >
> > see this section of status_list.txt:
> > 11922 spud2.bates.daq 30
> > 11923 bud06.bates.daq 1
> > 11923 spud4.bates.daq 30
> > 11924 bud23.bates.daq 1
> > 11924 spud1.bates.daq 30
> > 11925 spud1.bates.daq 30
> > 11925 spud3.bates.daq 1
> > 11926 spud2.bates.daq 30
> > 11927 spud3.bates.daq 1
> > 11927 spud5.bates.daq 30
> > 11928 bud22.bates.daq 1
> >
> > the last run crunched normally is run 11297 finished at 3:05 of Sep 22nd.
> > The following runs and later are all crunched multiple times and
> > unfortunately at the same time:
> >
> > 143635915 Sep 22 03:50 /net/data/4/Analysis/data//dst-11296.root
> > 249868274 Sep 22 07:22 /net/data/4/Analysis/data//dst-11298.root
> > 253659856 Sep 22 07:36 /net/data/4/Analysis/data//dst-11293.root
> > 251871484 Sep 22 07:56 /net/data/4/Analysis/data//dst-11295.root
> > 253803099 Sep 22 08:19 /net/data/4/Analysis/data//dst-11294.root
> > 254026042 Sep 22 22:12 /net/data/4/Analysis/data//dst-11299.root
> >
> > I stopped the cruncher daemon on dlbast09 and there does not seem to be
> > another cruncher running at the same time since the runlist is modified
> > only by elog, not cruncher (run numbers being written in, not taken out).
> >
> > all these runs up to 11960 will have to be recunched with
> > lrn!!!!!!!!!!!!!!!!!!!
> >
> > For people going to Chicago, we need to figure out what shall we present.
> > For people went to Triesta, hope your "PRELIMINARY" stamps are BIG enough.
> >
> > Chi
> >
> >
> > keywords: FAILURE
> >
> > P.S. I don't have the stomach to debug the cruncher, I turned it off and
> > am crunching runs from 11962 manually. cruncher experts please
> > investigate.
> >
>
>



This archive was generated by hypermail 2.1.2 : Mon Feb 24 2014 - 14:07:31 EST