Rumba decides to spontaneously reboot


Today's post contributed by Cluster spacecraft operations manager Bruno Sousa.

Another day in the life of an operations team

On 23 November, at around 19:00 CET, the Spacecraft Controller (Spacon) on shift for the four satellites of the Cluster II mission, had just initiated a ground contact from ESA’s ground station at Villafranca, Spain. He noted the weak signal that was arriving at the spacecraft, but – given that the spacecraft was able to lock onto it anyway – he proceeded to initiate the contact and prepare the download of the scientific data stored onboard. As the first commands were uplinked (including a change in the telemetry bit rate), quite suddenly, the spacecraft went mute. No more telemetry was received.

Cluster team in the DCR. Credit: ESA

Cluster team in the DCR. Credit: ESA

After troubleshooting for potential problems with the ground segment, including a possible misconfiguration of the station as a result of the change in telemetry mode, the Spacon contacted the on-call operations engineer, who quickly came into the control room. Together they initiated the contingency procedure that covers “loss of telemetry”. The procedure included switching on the on-board radio transponder, and this immediately restored the flow of telemetry.

After a preliminary analysis of the data the spacecraft was generating, we determined that its status was consistent with a ‘reboot’ (the status of a software ‘flag’ that disables further reboots was enabled, indicating that one had just taken place). The spacecraft stores in a protected area of memory a list of events that are generated by the software around the reboot. Once we downloaded these, we could immediately tell that the software had entered an exception clause while decoding a command.

The last time we had seen something like this was back in 2010 when a very similar occurrence took place. Luckily, we record very thoroughly all such occurrences including all the investigation efforts done to get to the source of the problem.

It had been determined in those earlier investigations that, upon reception of a command (and only of a  very specific type), the software validates it, and if it finds that the command is reported as having ‘size 0’ (indicating that the command has become corrupted), it then immediately triggers a reboot of the on-board processor – and this happens even before that function has a chance to do a validation of the command’s checksum, which would have identified the corruption and cause it to simply be rejected.

The corruption of the command was most likely due to the poor signal strength received onboard. The reason the signal was so weak was not possible to determine. We did note that, once we changed the station from Villafranca to Maspalomas station, the signal improved quite significantly, but we couldn’t determine any problem with the first station.  

Maspalomas station. Credit: ESA/F. Macia

Maspalomas station. Credit: ESA/F. Macia

Following the reboot, the spacecraft remained in the so-called ‘Nominal Survival Mode’, which includes leaving the transponder off, and hence, we saw no more telemetry from the spacecraft. Another consequence of the reboot is that the onboard solid state memory, where the science data is recorded, is also switched off. Because the technology for this hardware dates from the mid-90’s, a switch off means that all data stored there at that time is lost; in this case that represented more than 40 hours of recorded data, which unfortunately cannot be recovered.

The Cluster flight control team communicates avidly via mobile text messaging, and soon the on-call engineer was being assisted by two other colleagues who came in to provide assistance. When running long and complicated procedures, it’s always handy to have another pair of eyes looking over your shoulder so that you don’t forget anything.

Together, they swiftly proceeded to recover the nominal configuration for the spacecraft, including reactivating most of the payloads. As the visibility from the ground station was coming to an end, the team had to select the payloads that could still be re-activated within the available time, with the remaining activations carried on the following station pass. At around 01:00 AM, the team concluded its intervention and the engineers went home to a well-deserved rest.

And that was another day in the life of an operations team!



from Rocket Science http://ift.tt/2gHU9NV
v

Today's post contributed by Cluster spacecraft operations manager Bruno Sousa.

Another day in the life of an operations team

On 23 November, at around 19:00 CET, the Spacecraft Controller (Spacon) on shift for the four satellites of the Cluster II mission, had just initiated a ground contact from ESA’s ground station at Villafranca, Spain. He noted the weak signal that was arriving at the spacecraft, but – given that the spacecraft was able to lock onto it anyway – he proceeded to initiate the contact and prepare the download of the scientific data stored onboard. As the first commands were uplinked (including a change in the telemetry bit rate), quite suddenly, the spacecraft went mute. No more telemetry was received.

Cluster team in the DCR. Credit: ESA

Cluster team in the DCR. Credit: ESA

After troubleshooting for potential problems with the ground segment, including a possible misconfiguration of the station as a result of the change in telemetry mode, the Spacon contacted the on-call operations engineer, who quickly came into the control room. Together they initiated the contingency procedure that covers “loss of telemetry”. The procedure included switching on the on-board radio transponder, and this immediately restored the flow of telemetry.

After a preliminary analysis of the data the spacecraft was generating, we determined that its status was consistent with a ‘reboot’ (the status of a software ‘flag’ that disables further reboots was enabled, indicating that one had just taken place). The spacecraft stores in a protected area of memory a list of events that are generated by the software around the reboot. Once we downloaded these, we could immediately tell that the software had entered an exception clause while decoding a command.

The last time we had seen something like this was back in 2010 when a very similar occurrence took place. Luckily, we record very thoroughly all such occurrences including all the investigation efforts done to get to the source of the problem.

It had been determined in those earlier investigations that, upon reception of a command (and only of a  very specific type), the software validates it, and if it finds that the command is reported as having ‘size 0’ (indicating that the command has become corrupted), it then immediately triggers a reboot of the on-board processor – and this happens even before that function has a chance to do a validation of the command’s checksum, which would have identified the corruption and cause it to simply be rejected.

The corruption of the command was most likely due to the poor signal strength received onboard. The reason the signal was so weak was not possible to determine. We did note that, once we changed the station from Villafranca to Maspalomas station, the signal improved quite significantly, but we couldn’t determine any problem with the first station.  

Maspalomas station. Credit: ESA/F. Macia

Maspalomas station. Credit: ESA/F. Macia

Following the reboot, the spacecraft remained in the so-called ‘Nominal Survival Mode’, which includes leaving the transponder off, and hence, we saw no more telemetry from the spacecraft. Another consequence of the reboot is that the onboard solid state memory, where the science data is recorded, is also switched off. Because the technology for this hardware dates from the mid-90’s, a switch off means that all data stored there at that time is lost; in this case that represented more than 40 hours of recorded data, which unfortunately cannot be recovered.

The Cluster flight control team communicates avidly via mobile text messaging, and soon the on-call engineer was being assisted by two other colleagues who came in to provide assistance. When running long and complicated procedures, it’s always handy to have another pair of eyes looking over your shoulder so that you don’t forget anything.

Together, they swiftly proceeded to recover the nominal configuration for the spacecraft, including reactivating most of the payloads. As the visibility from the ground station was coming to an end, the team had to select the payloads that could still be re-activated within the available time, with the remaining activations carried on the following station pass. At around 01:00 AM, the team concluded its intervention and the engineers went home to a well-deserved rest.

And that was another day in the life of an operations team!



from Rocket Science http://ift.tt/2gHU9NV
v

Aucun commentaire:

Enregistrer un commentaire