[BLAST_SHIFTS] Shift summary 09/11/2003 B (4-8)

From: Electronic Log Book (elog@blast05.lns.mit.edu)
Date: Fri Sep 12 2003 - 08:05:22 EDT


Operators: milner zhangchi ----

4:00am, continue to battle the data transfer problem. rebooted spud4 and 5. soft reboot from terminal failed on spud5. could not mount /home/blast. took the key and pushed the buttom. raid5 ran for a long time.

problem NOT solved.

data are not transfered onto spud4. confirmed data are stored on dblast07 transfered to spud5. number of data transfer processes on spud4 and 5 seem to be growing constantly.

finally spotted the problem at 7:20am:
run 2477 has huge file size: 2.14G. all other transfering jobs on spud 4 are blocked by the 30+ copying jobs copying 2477 from dblast07.

moved 2477 from dblast07:$DATADIR to dblast07:/scratch/dblast07/blast/data/2477/
Hope all the jobs copying 2477 will die down gradually and data flow would return to normal.

It also worth notice that many of the runs have more than one transfering jobs copying them from dblast07 to spud4/5. I am not sure about the result of this. Maybe the time interval for transfer jobs should be made larger.

recommend making password to daqstore user available in counting bay so superious scp's can be killed if similar situation happened in future.

and it may worth investigating how such a humongous data file was collected.

6:14am. Beam tripped and tripped 3 out of 4 HV crate. After that, having problem enabling auto ring fill. CCR or counting bay can enable individually when the other is disabled, pressing enable when the other is already enabled disables both immediately. This happens both when HV in standby and operate. CCR recovered in 15 minutes

During the entire shift: coda(ET) hang quite a couple of times
WC L6 tripped a dozen times. did NOT lower its HV.

Otherwise, smooth running. :)

 



This archive was generated by hypermail 2.1.2 : Mon Feb 24 2014 - 14:07:30 EST