E&IG useful documents: TCS crash events

GS TCS crash troubleshooting spreadsheet

December 2012 event
Below you will find the email thread of the 2012 TCS crash event.
Hola Arturo,

FYI.

The following email thread shows the type of problems we experienced
with the s/w versions. We had the very same behavior but with different
symptoms on the Feb run. On both the TCS version was the one
loaded/unloaded/loaded...after a huge amount on h/w bds.
swapping/config.changes/etc....

Saludos,

Ramón.

===========================================================================================================

From: Ramon Galvez
Sent: Wednesday, December 05, 2012 12:42 PM
To: Maxime Boccas
Cc: Gustavo Arriagada; Rolando Rogers; Andrew Serio; Cristian Urrutia;
Benoit Neichel; Rodrigo Carrasco; Javier Luhrs; Pablo Diaz; Gaston
Gausachs; Fabrice Vidal
Subject: Re: NGSWFS Status

Hola Maxime,

We did the swap of version on Monday and we had no issues so far with the new one. We have no clear understanding on what may have caused the problem we had this run. Since the only change we had between the previous run ( Oct 30 --> Nov 03 ) was the TCS s/w version we installed it and since that time we had no issues but as aforementioned going again to the new one ( Monday morning ) has not give problems either though it was tested w/o offlads and w/o closing loops so more test can be done in those conditions.

On the h/w side we have a complete pwr. & grndg. prev. maintenance scheduled for this subsystem to be carried out tomorrow morning.

It is very important to say that the NGSWFS probes have been used extensively in all the runs we have had since 2010 with the highly demanding Probe Mapping - I have seen demands of ~10nm ! - and all that time it has been successful. I have talked to Benoit about this and we think that the demand rate is unnecessarily high. We do have some minor timeouts - that are easily solved on the fly - that we are working on.

The weird behavior we saw on Friday and on Saturday was completely different, i.e. the OMS58 controller (8 axis controll.) was completely lost and this was highly intermittent making it very difficult to diagnose. We have had no more problems since Sunday and a lot of Probe Mapping was successfully done that night. Maybe Benoit wants to comment on this.

Saludos,

Ramón.

On 05-12-2012 12:08, Maxime Boccas wrote:

> I was a bit behind the news with my last email asking about this.... It seems confirmed there was a SW version issue then? Shall we review our change control protocol?
> Thanks
> M
>
> -----Original Message-----
> From: Gustavo Arriagada
> Sent: Monday, December 03, 2012 11:09 AM
> To: Ramon Galvez; Rolando Rogers
> Cc: Cristian Urrutia; Benoit Neichel; Rodrigo Carrasco; Javier Luhrs;
> Pablo Diaz; Maxime Boccas; Gaston Gausachs; Fabrice Vidal
> Subject: RE: NGSWFS Status
>
> These are very encouraging news, good job everyone!!!, hopefully today's tests will confirm your suspicions.
>
> Saludos
>
> Gustavo
>
>
> ________________________________________
> From: Ramon Galvez
> Sent: Monday, December 03, 2012 1:24 AM
> To: Gustavo Arriagada; Rolando Rogers
> Cc: Cristian Urrutia; Benoit Neichel; Rodrigo Carrasco; Javier Luhrs;
> Pablo Diaz
> Subject: NGSWFS Status
>
> HOla Rolando, Gustavo y Todos,
>
> Today we had to come up to work on the NGSWFS issues that were reported by Benoit. As I've mentioned to some of you I saw similar problems during h/w & s/w development, but that was about 4 years ago. From that time this is the 1st occasion that that type of highly intermittent issues showed up. Mainly while in Probe Tracking Mode. All this started on this particular run, nothing like this was seen on the run we had on Oct 30 --> Nov 03 or in any run since we started with the Commissioning Runs back in 2010.
>
> As reported, we have had the following issues :
>
> 1. Sudden stalled mechanism, i.e friday night P2Y stalled at -1.4[mm] while going from 0.0 --> 14[mm] .
> 2. Unable to do a very basic Index on some probes.
> 3. Spiral routine getting stalled by a Hardware Sensor Limit read by the s/w (not a hard Hardware Pwr. Shutdown Limit). Today we had P2Y reaching one of those limits with no explanation and we verified with Cristian that it was really reaching it physically. This was after having a demand from the TCS due to a CWFS2 following selection.
>
> To rule out the s/w TCS version in use I asked to have the version we used on the aforementioned previous run. Once Rodrigo was done with his GSAOI tasks it was remotely installed by Javier @ ~11pm. Then - the night time crew Benoit, Rodrigo, Drew - started to use the Tracking Mode for real science Probe Mapping on 3 different targets. No issues have showed up; this implies continued use of all 3 Probe Tracking for 2 hrs and it is actually on-going.
>
> We - Ramón and Cristián - want to do more similar tests during daytime so I entered an urgent TR for tomorrow to continue the tests prior to change to the new TCS version for further tests from the SOS Group.
>
> Saludos,
>
> Ramón, Cristián
>
Saludos,

Ramón.

--
Ramon L. Galvez
Senior Electronics Engineer
Gemini Observatory
www.gemini.edu
Phone : +56 51 205678
Recept. : +56 51 205600
Fax : +56 51 205655

February 2013 event

The troubleshooting step by step

The team:

Ramon Galvez, Vanessa Montes, John White, Pedro Gigoux, Roberto Rojas, Cristian Urrutia, Chris Morrison, Cristian Silva, Jose Varas, Benoit Neichel, Ariel Lopez, Pedro Ojeda, Herman Diaz

May 12-13 event

The team:

Ramon Galvez, Roberto Rojas, Cristian Urrutia, Chris Morrison, Cristian Silva, Jose Varas, Arturo Nunez, William Rambold, Javier Luhrs, Gustavo Arriagada

SW and IS tools compilation document
In the link above you will find all the tools SW G and IS G have identified to be used during this troubleshooting process.

May 12-13 TCS crash troubleshooting plans
In the link above you will find the troubleshooting plans that each group is proposing.

2 comments:

Gustavo AMay 13, 2013 at 1:13 PM

Hi

Today I also talked to Angelic about this problem, and interestingly they
have seen similar problems at GN, and found out that the number of clients
connected to the TCS were too many - these can be checked with the casr
command as described on the page. As we started software testing, it is
possible that people have left additional dm-screens/CA clients running.

We definitively need to ask people to close DM screens that are not being
used, so the TCS is not swamped with unnecessary CA requests.

Cristan/Roberto, could you run casr on the TCS to get an estimate of
"normal" connections we have in the TCS? At GN, that number is ~3700, when
the problem showed up there, it went up to 5000+. Look at this tool that
Tom put together to help diagnose this particular situation:

http://hbfgealabs.hi.gemini.edu/~tcumming/casr.php

This only works for GN - it archives casr output every hour or so, so you
can use it to identify changes. I'm working with Angelic to enable these
things for GS. Gustavo, these are tools that probably John mentioned to
you.

Cheers

--
Arturo
Gustavo AMay 14, 2013 at 8:32 AM
Hola

Mas información en base a experimentos que hicimos ayer con Javier y Cristian.

Los sistemas que mas afectan la carga del TCS son las aplicaciones Tcl/Tk (TCC, seqexec, TSD). Solo el TCC abre ~1000 canales con el TCS, el seqexec ~500 y el TSD ~300. Los DM screens se portan un poco mejor, porque solo abren los canales que necesitan (uno por cada valor epics que se esta monitoreando, I.e, si tienes una patalla con 10 valores, vas a tener 10 canales abiertos). Por supuesto, muchos DM screens van a saturar al TCS eventualmente, asi que sigue siendo buena idea mantenerlos bajo control. Por lo tanto: Evitar tener sesiones del TCC, TSD o seqexec abiertas que no se necesitan! Yo recomendaría que las dejen abajo siempre cuando terminen de usarlas, para evitar problemas.

Una corrección en mi lenguaje (que acabo de ver en el correo que escribi ayer). Las conexiones que estoy hablando son conexiones de Channel Access, no conexiones IP, por lo que no se traducen en sockets directamente (no se porque pense eso ayer, pero si fuese asi, seria imposible escalar aplicaciones en EPICS). Cada proceso mantiene solo un socket abierto con el VME (asi que no te preocupes Chris, no estamos saturando la red _de esa manera_ :)).

Sin embargo, lo que si sigue siendo valido es que entre mas conexiones haya entre el TCS y clientes (numero de canales abiertos), mas procesamiento es requerido en la CPU del VME, lo que eventualmente lo satura y vemos el efecto de los WSOD

Vamos a monitorear esto ahora durante operaciones normales para determinar cual es un numero "saludable" para el TCS. Anoche cuando nos fuimos, dejamos abajo las aplicaciones que estaban corriendo en sbfcon02, y el TCS en ese momento estaba manejando ~4800 canales y no tuvimos problemas durante la noche.

Saludos

--
Arturo

Sunday, February 24, 2013

TCS crash events

GS TCS crash troubleshooting spreadsheet

2 comments: