Wednesday, 15 September 2010

Operator error, or KM error?

I have been reading through a book about the explosion at Esso’s Longford refinery in Australia; a sobering report of a fatal disaster, which makes some interesting points about "Operator error". It got me thinking about whether operator error is a cause of failure, or a symptom of a deeper failure - a  failure of the Knowledge Management system.

"Operator error" might be a first pass conclusion when something goes wrong, but the book suggests that very often you need to look deeper, and understand why the operator made an error. Did they have all the knowledge they needed to make decisions? Did they have training? Did they have access to expertise?

Here's what Wikipedia says about the disaster
During the morning of Friday 25 September 1998, a pump supplying heated lean oil to heat exchanger GP905 in Gas Plant No. 1 went offline for four hours, due to an increase in flow from the Marlin Gas Field which caused an overflow of condensate in the absorber. A heat exchanger is a vessel that allows the transfer of heat from a hot stream to a cold stream, and so does not operate at a single temperature, but experiences a range of temperatures throughout the vessel. Temperatures throughout GP905 normally ranged from 60 °C to 230 °C (140 °F to 446 °F). Investigators estimated that, due to the failure of the lean oil pump, parts of GP905 experienced temperatures as low as −48 °C (−54 °F). Ice had formed on the unit, and it was decided to resume pumping heated lean oil in to thaw it.
When the lean oil pump resumed operation, it pumped oil into the heat exchanger at 230 °C (446 °F) - the temperature differential caused a brittle fracture in the exchanger (GP905) at 12.26pm. About 10 metric tonnes of hydrocarbon vapour were immediately vented from the rupture. A vapour cloud formed and drifted downwind. When it reached a set of heaters 170 metres away, it ignited. This caused a deflagration (a burning vapour cloud). The flame front burnt its way through the vapour cloud, without causing an explosion. When the flamefront reached the rupture in the heat exchanger, a fierce jet fire developed that lasted for two days ......
Peter Wilson and John Lowery were killed in the accident and eight others were injured.....Esso blamed the accident on worker negligence, in particular Jim Ward, one of the panel workers on duty on the day of the explosion.  The findings of the Royal Commission, however, cleared Ward of any negligence or wrong-doing. Instead, the Commission found Esso fully responsible for the accident:
So what might cause apparent "operator error" or "worker negligence" as Wikipedia puts it, in cases like this?  The disaster happened when hot oil was pumped into the cold exchanger, which was the wrong thing to do, but why did the operators do this? The book mentions what it calls "latent conditions" which can cause operators to make poor decisions, such as "poor design, gaps in supervision, undetected manufacturing defects or maintenance failures, unworkable procedures, clumsy automation, shortalls in training, less than adequate tools and equipment (which) may be present for many years before they combine with local circumstances and activate failures to penetrate the system's many layers of defences".

If an operator does not have the correct training, or the correct procedures, for example, then they do not have the means to make the correct decision, and so may end up making mistakes through ignorance. If they do not have the knowledge they need to make an effective decision, then any error they make could be argued to be not their fault. The failure could therefore be seen to be a failure of the knowledge management system, for not providing the operators with the knowledge they need to avoid the error, to make the correct decision, or to take the necessary preventative action when things go wrong.

In knowledge management terms, the investigative commission found these three contributory factors (again, according to Wikipedia), which talk to a lack of knowledge on behalf of the operators, lack of access to more skilled knowledge, and lack of communication of knowledge - all of them potential KM failures
  • inadequate training of personnel in normal operating procedures of a hazardous process;
  • the relocation of plant engineers to Melbourne had reduced the quality of supervision at the plant;
  • poor communication between shifts meant that the pump shutdown was not communicated to the following shift.

The following quote from the book is a statement from the operator himself, and you can hear from the language he uses that this was way outside his experience and knowledge base.
"Things happened on that day that no one had seen at Longford before. A steel cylinder sprang a leak that let liquid hydrocarbon spill onto the ground. A dribble at first, but then, over the course of the morning it developed into a cascade ... Ice formed on pipework that normally was too hot to touch. Pumps that never stopped, ceased flowing and refused to start. Storage tank liquid levels that were normally stable plummeted ... I was in Control Room One when the first explosion ripped apart a 14-tonne steel vessel, 25 metres from where I was standing. It sent shards of steel, dust, debris and liquid hydrocarbon into the atmosphere".
In a situation like this, where operator error can be lethal and operator error through ignorance cannot be allowed, effective knowledge management (in the sense of ensuring that people have access to the knowledge they need, at the time they need it, in order to make correct decisions) is not just a nice-to-have; it's a life saver.

No comments:

Blog Archive