Re: [BLAST_ANAWARE] [BLAST_SHIFTS] autocruncher double crunch 100kC worth of data affected

From: Taylan Akdogan (akdogan@MIT.EDU)
Date: Sat Oct 16 2004 - 20:50:42 EDT


Hi Chi,

I suggest checking each run before conclude that "all" off them
are this way. I saw auto cruncher submitting multiple processes
for the same run. As an example, it submitted 6 jobs for a single
run. However, my experienced during the shifts showed that it
does not happen at very high rate.

For the portions of the status file you attached below: We had a
full-analysis-disc on Friday morning, several jobs crashed due to
this reason (code 30s), so I restarted them. It was hard to keep
track of the runs which were crunching, which were crashed. So, I
sorted the status file according to the run number for that
portion. So, although they appear one after another, they were
not submitted at the same time.

So, it is better to check the runs one by one before recrunching
all of them. I don;t think all of them are bad; probably most of
them are fine. But, yes, we should be careful with the auto
cruncher...

There is no need for a large scale panic!

Taylan

---=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=---
Taylan Akdogan Massachusetts Institute of Technology
akdogan@mit.edu Department of Physics
Phn:+1-617-258-0801 Laboratory for Nuclear Science
Fax:+1-617-258-5440 Room 26-402b
---=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=---

On Sat, 16 Oct 2004, Chi Zhang wrote:

>
> Hi all,
>
> sorry for breaking this bad news but ever since September 22nd, the
> auto-cruncher has been crunching the same runs for multiple times.
>
> the sympton being: in status_list.txt multiple entries for same run appear
> and they are ON DIFFERENT CPUs!!!!!!!!!! When dst is openend and the
> following command in root " dst->Scan("fNEvent") is issued, one can see
> same CODA event umber appears multiple times!!!!!!!!! this continues until
> all but one lrn crashed out.
>
> see this section of status_list.txt:
> 11922 spud2.bates.daq 30
> 11923 bud06.bates.daq 1
> 11923 spud4.bates.daq 30
> 11924 bud23.bates.daq 1
> 11924 spud1.bates.daq 30
> 11925 spud1.bates.daq 30
> 11925 spud3.bates.daq 1
> 11926 spud2.bates.daq 30
> 11927 spud3.bates.daq 1
> 11927 spud5.bates.daq 30
> 11928 bud22.bates.daq 1
>
> the last run crunched normally is run 11297 finished at 3:05 of Sep 22nd.
> The following runs and later are all crunched multiple times and
> unfortunately at the same time:
>
> 143635915 Sep 22 03:50 /net/data/4/Analysis/data//dst-11296.root
> 249868274 Sep 22 07:22 /net/data/4/Analysis/data//dst-11298.root
> 253659856 Sep 22 07:36 /net/data/4/Analysis/data//dst-11293.root
> 251871484 Sep 22 07:56 /net/data/4/Analysis/data//dst-11295.root
> 253803099 Sep 22 08:19 /net/data/4/Analysis/data//dst-11294.root
> 254026042 Sep 22 22:12 /net/data/4/Analysis/data//dst-11299.root
>
> I stopped the cruncher daemon on dlbast09 and there does not seem to be
> another cruncher running at the same time since the runlist is modified
> only by elog, not cruncher (run numbers being written in, not taken out).
>
> all these runs up to 11960 will have to be recunched with
> lrn!!!!!!!!!!!!!!!!!!!
>
> For people going to Chicago, we need to figure out what shall we present.
> For people went to Triesta, hope your "PRELIMINARY" stamps are BIG enough.
>
> Chi
>
>
> keywords: FAILURE
>
> P.S. I don't have the stomach to debug the cruncher, I turned it off and
> am crunching runs from 11962 manually. cruncher experts please
> investigate.
>



This archive was generated by hypermail 2.1.2 : Mon Feb 24 2014 - 14:07:31 EST