Some hardware test run very long, like 2 000 000 cycles or more in reliability or life tests. Obviously these tests run for months and have to be monitored. It's not nice to loose 10k cycles because one of the units decided to stop for some reason on Friday@1705h. Or one unit stopped 10 min after the one manual check a day (because normally they run for weeks without a problem).

The classic way is to do check all the units manually and maybe several times a day too and even on weekends. This is kind of time consuming since you can easily have like 20 units running at a time and to be honest it's not that interesting to do too. ;-)

But if you happen to have a nice interface to ask the units for their states and even give them commands to do stuff e.g. some XML based SSH network connection, you can automate a lot of this checking.

And with some additional coding you can get

  • a nice daily status email
  • a web page with the current status (updated every 30 min, we do not want to stress the units too much)
  • an emergency email on Friday@1730h with logs of the troubled unit so you can decide if you should do a remote check and just restart something or have to return to the office or wait until Monday because it's completely burned down.

This saves time which can be spent on different (more challenging) tasks.

No comments

The author does not allow comments to this entry