Friday, March 03, 2006

It just gets worse....

A wise man told me last week that I should be careful what I wished for.

I should have listened.

This week there were two major incidents.

The first of which happened on Wednesday.
We'd just got to the final stage of upgrading our app on the first test run and were preparing to run the final set of scripts.

Then the lights went out.
And the PCs on all the desks went off.

Myself and two of the engineers went running into the computer room to check on the servers.

I was trying to explain that I needed to bring the production databases down cleanly, as quickly as possible, as we didn't have the auto-shutdown software working. As there was no power to the user PCs, they couldn't access the systems anyway so it wouldn't be an issue.

Then someone issued the fatal line
'It's OK. We've probably got at least 30 minutes or so on the UPS boxes'.
The words had barely left his mouth when, yes, you've guessed it, the UPS boxes failed one after the other.


Turn out that some idiot in a JCB on the adjacent construction site had barged straight through all of the cables supplying the Business Park.
The power was out for almost two hours and it took us practically the rest of the afternoon to get everything back up and running as normal.

Luckily, (touch wood, cross your fingers, whatever you do for luck) there was no serious damage done.

SO, Thursday we made another attempt to run the final set of scripts.

While we were preparing to do this, I had one of the engineers installing two brand new servers on which we were going to build what will become the new production system.
The idea being that this current round of testing is to ensure the process works, then we perform the same process again on the new servers.

At Go-Live, the new servers will become the new production system and the existing production system will initially continue to exist as an archive/contingency plan, before being relegated to a test system.

Anyway, we set the script away and it had been running for about 45 minutes, estimated running time was 1 hour 20 minutes.

Suddenly, my connection dropped.
I could still ping the server, but could no longer create a terminal services connection.

A quick jog round to the computer room to check on the server and I soon discovered what had happened.

There, displayed on the console of the server I was working on, was a message highlighting a problem with the IP address.

A chat with the engineer installing the servers revealed that out of the two IP addresses he'd been allocated for the new servers, one of them was the IP address of the exisitng test system.


To top it all off, the script that had been running at the time the connection dropped was provided by the 3rd party who provided the application.

The script looked at a table X that contained both current and historic data.

It created 3 temporary tables, then inserted rows into them based on a select from X.
Once the insert statement completed, the rows were then deleted from table X.
The data in the temporary tables was then updated to reflect the new accounting structure and inserted back into table X and finally deleted from the temporary tables.

After we got the IP issue sorted, we than had to look to find out at what point the script stopped and clean it all up.

There was no data in the temporary tables.

The data was no longer in table X.

We'd lost it.

Luckily, I'm very cautious and had exported the table before running the scipt, but we lost the rest of the afternoon waiting for it to import.

I may have to rename this blog 'The Disaster Diary'.


Anonymous Anonymous said...

Fab stuff. Sorry but this posting better remain anonymous.

We are having a standby generator installed at the moment. This will complement our UPS system which was installed 12 months ago. The specification of the UPS is such that it can provide 30 minutes of power - more than enough to cover the gap between loss of mains power and the stand-by generator kicking in.

We were promised the stand-by generator would be installed at the same time as the UPS, but as expected it never materialised.

Meantime we've suffered a number of power failures, each of which has resulted in uncontrolled shutdowns; it takes us 50 minutes to close all services cleanly. The company secretary (since retired, thankfully) took it upon himself to blame us for not highlighting the risks we faced in the event of a power failure. He of course was the nice man blocking the promised finances to cover the costs of the stand-by generator....

Sometimes I think Dilbert is just a story....

Friday, March 03, 2006 9:33:00 pm  
Blogger shrek said...

ah, darlin'. welcome to the wonderful world of IT.;-)

i can tell a bunch of those stories myself. and will at Hotsos next week. i wonder if i can remember how to be a DBA instead of a data loader?;-)

Friday, March 03, 2006 10:34:00 pm  
Blogger Bill S. said...

You have my sympathies (and my empathies as well ;-D).

For future reference:



Monday, March 06, 2006 2:57:00 pm  

Post a Comment

<< Home