Friday, 13 November 2015

Longford Refinery Disaster - operator error or KM failure?

This book, about the explosion at Esso’s Longford refinery in Australia, is a sobering report of a fatal disaster, and also can be interpreted as a story of an inneffective Knowledge Supply Chain.  It illustrates the potentially appalling consequences of the failure to supply operators with the knowledge they need to operate in a high risk environment.

Image from amazon

"Operator error" might be a first pass conclusion when something goes wrong, but very often you need to look deeper, and understand why the operator made an error.

Did they have all the knowledge they needed to make decisions? Did they have training? Did they have access to expertise?

The story of the Longford refinery disaster explores this question. The disaster is described in Wikipedia as follows;
During the morning of Friday 25 September 1998, a pump supplying heated lean oil to heat exchanger GP905 in Gas Plant No. 1 went offline for four hours, due to an increase in flow from the Marlin Gas Field which caused an overflow of condensate in the absorber. A heat exchanger is a vessel that allows the transfer of heat from a hot stream to a cold stream, and so does not operate at a single temperature, but experiences a range of temperatures throughout the vessel. Temperatures throughout GP905 normally ranged from 60 °C to 230 °C (140 °F to 446 °F). Investigators estimated that, due to the failure of the lean oil pump, parts of GP905 experienced temperatures as low as −48 °C (−54 °F). Ice had formed on the unit, and it was decided to resume pumping heated lean oil in to thaw it.
When the lean oil pump resumed operation, it pumped oil into the heat exchanger at 230 °C (446 °F) - the temperature differential caused a brittle fracture in the exchanger (GP905) at 12.26pm. About 10 metric tonnes of hydrocarbon vapour were immediately vented from the rupture. A vapour cloud formed and drifted downwind. When it reached a set of heaters 170 metres away, it ignited. This caused a deflagration (a burning vapour cloud). The flame front burnt its way through the vapour cloud, without causing an explosion. When the flamefront reached the rupture in the heat exchanger, a fierce jet fire developed that lasted for two days ......
Peter Wilson and John Lowery were killed in the accident and eight others were injured.....Esso blamed the accident on worker negligence, in particular Jim Ward, one of the panel workers on duty on the day of the explosion.  The findings of the Royal Commission, however, cleared Ward of any negligence or wrong-doing. Instead, the Commission found Esso fully responsible for the accident:
So what might cause apparent "worker negligence" (aka operator error) in cases like this?

The disaster happened when hot oil was pumped into the cold exchanger, which was the wrong thing to do, but why did the operators do this? The book mentions what it calls "latent conditions" which can cause operators to make poor decisions, such as "poor design, gaps in supervision, undetected manufacturing defects or maintenance failures, unworkable procedures, clumsy automation, shortalls in training, less than adequate tools and equipment (which) may be present for many years before they combine with local circumstances and activate failures to penetrate the system's many layers of defences".

If an operator does not have the correct training, or the correct procedures, then you could argue that they do not have the knowledge to make the correct decision, and so may end up making mistakes not through error or negligence, but through ignorance. 

If they do not have the knowledge they need to make an effective decision, then this could be seen to be a failure of the knowledge management system, for not providing the operators with the knowledge they need to avoid the error, to make the correct decision, or to take the necessary preventative action when things go wrong.

In knowledge management terms, the investigative commission found these three contributory factors (again, according to Wikipedia), which talk to a lack of knowledge on behalf of the operators, lack of access to more skilled knowledge, and lack of communication of knowledge - all of them potential KM failures
  • inadequate training of personnel in normal operating procedures of a hazardous process;
  • the relocation of plant engineers to Melbourne had reduced the quality of supervision at the plant;
  • poor communication between shifts meant that the pump shutdown was not communicated to the following shift.

The following quote from the book is a statement from the operator himself, and you can hear from the language he uses that this was way outside his experience and knowledge base.
"Things happened on that day that no one had seen at Longford before. A steel cylinder sprang a leak that let liquid hydrocarbon spill onto the ground. A dribble at first, but then, over the course of the morning it developed into a cascade ... Ice formed on pipework that normally was too hot to touch. Pumps that never stopped, ceased flowing and refused to start. Storage tank liquid levels that were normally stable plummeted ... I was in Control Room One when the first explosion ripped apart a 14-tonne steel vessel, 25 metres from where I was standing. It sent shards of steel, dust, debris and liquid hydrocarbon into the atmosphere".
In a situation like this, where the wrong operational decision can be lethal and operator error through ignorance cannot be allowed, effective knowledge management and an effective knowledge supply chain (in the sense of ensuring that people have access to the knowledge they need, at the time they need it, in order to make correct decisions) is not just a nice-to-have; it's a life saver.

1 comment:

Stuart French said...

I did my apprenticeship at the Longford plant and have been in that control room many times. My next door neighbour was one of those badly injured in the incident and once of the older tradesmen from my section was on duty (I suspect that was him you quoted).

I had left just a few years before the accident and at the time it was renowned at the most safe workplace in Australia. Every meeting, from a bunch of apprentices to the executive board started with a five minute session on safety communication and improvement.

Just as I let the business, the tax scheme changed and the company went in to cutback mode. There were strikes and the safety meetings slowly dissipated. People became too busy.

This was long before I found my career in KM, but your comments here seem to ring true to me that there were knowledge breakdowns and the process of moving from expert engineers to detailed procedures for everything (they were even experimenting with Expert Systems while I was there) meant that the new, younger employees only seemed to understand the plant during normal operation and some may have lacked the chemical and process engineering knowledge to understand why the weird things they saw that day were happening and what the results could be.

My first response to this is to insert event and limit triggers in the workflows and systems that require the operator to call an engineer in to review the situation both at a local and a system-wide level so that cascade failures are minimized. But underlying this is the need for that expertise to still be in the business in the first place. With the cut backs, KM and HR together could have been carefully monitoring what competencies and expertise were required as a minimum to keep the plant safe. I wasn't there but based on the culture when I left I am not sure this would have been done as well as it could have. They hired such top people, they did sometimes rely on them to manage well and automation was focused more at the operations level than the knowledge level.

Thanks for making me aware of the report Nick.

Blog Archive