Case study

Battery Energy Storage Thermal Runaway Containment Case Study

Energy engineering case study on battery energy storage thermal runaway containment, module isolation, off-gas response, SOC service withdrawal, propagation risk, emergency interface, and return-to-service evidence.

A battery energy storage system thermal runaway event is not only a battery chemistry problem. It is an energy asset problem, a protection problem, a control-system problem, a site-safety problem, and an operations decision. The dangerous engineering error is to treat an alarm as a local maintenance issue while the system is still connected to dispatch commitments, grid-support commands, HVAC operation, emergency ventilation, and nearby energized equipment.

This case study follows a realistic lithium-ion BESS event at a solar-plus-storage site. The event is hypothetical, but the reasoning matches the type of evidence an engineering review board would expect: state of charge, affected energy, thermal trend, off-gas indication, ventilation state, electrical isolation, service withdrawal, emergency interface, and return-to-service hold points.

The central question is:

Should the BESS continue supporting the grid because enough energy remains available, or should it be withdrawn from service because the safety case is no longer valid?

The answer is that dispatch feasibility is not the governing criterion. Once thermal runaway propagation is credible, the asset must be treated as unavailable until isolation, emergency response, inspection, and validation evidence restore the safety case.

Case Context

The site includes a containerized BESS connected to a solar plant and a medium-voltage feeder. The BESS normally supports evening peak shaving, renewable smoothing, and microgrid resilience. The operator receives an alarm from one rack during the afternoon while the grid operator is requesting evening discharge support.

ItemValue or observation
BESS nameplate energy20\ \text{MWh}
BESS power rating10\ \text{MW}
Routine SOC operating window15\% to 90\%
SOC at first safety alarm82\%
Affected rack nominal energy250\ \text{kWh}
Adjacent rack temperature rise9^\circ\text{C} in 8\ \text{min}
Normal rack-to-rack rise limit2^\circ\text{C} in 10\ \text{min}
Off-gas sensor indication1.8 times alarm threshold
Container free volume85\ \text{m}^3
Normal ventilation flow4000\ \text{m}^3/\text{h}
Emergency ventilation flow12000\ \text{m}^3/\text{h}
Requested dispatch6\ \text{MW} for 1.5\ \text{h}
Immediate control stateOne rack blocked, inverter still available

The initial operator temptation is understandable: the affected rack is only a small fraction of a 20 MWh asset, and the requested discharge is well below nameplate power. That spreadsheet view is incomplete. The event has already crossed from normal availability management into abnormal safety containment.

Incident Timeline

At 14{:}12, the battery management system records a high-temperature warning in one rack. At 14{:}15, the off-gas sensor for the same container crosses the alarm threshold. At 14{:}20, adjacent rack temperature has risen by 9^\circ\text{C} over the preceding 8 minutes. The rack contactor is commanded open, but the site controller still shows the container as partially available for dispatch.

The control-room display therefore contains two conflicting messages:

  1. the energy-management system still sees usable stored energy;
  2. the safety system reports a credible thermal and gas-release event.

The engineering decision must be governed by the second message. A BESS that has lost its safety envelope cannot be offered as grid capacity simply because the inverter can still respond.

Stored Energy in the Affected Rack

The first calculation estimates the energy associated with the affected rack at the event SOC:

E_{rack,SOC}=E_{rack,nominal}SOC

With a 250\ \text{kWh} rack at 82\% SOC:

E_{rack,SOC}=250(0.82)=205\ \text{kWh}

This number is not a prediction that all stored energy will be released as heat. It is a scale check. A rack containing about 205\ \text{kWh} of stored electrochemical energy is not a small instrumentation fault. It is a high-consequence source term that can affect adjacent modules, gas generation, emergency access, fire exposure, and electrical isolation.

The calculation also prevents a common error: dismissing a single-rack event because it is only 1.25\% of a 20\ \text{MWh} plant. Consequence is not proportional to fleet percentage. A single rack can create the initiating condition for container-level propagation.

Propagation Temperature Screen

The adjacent rack temperature rise is:

\displaystyle \frac{\Delta T}{\Delta t}=\frac{9^\circ\text{C}}{8\ \text{min}}=1.125^\circ\text{C/min}

The normal rack-to-rack rise limit is:

\displaystyle \left(\frac{\Delta T}{\Delta t}\right)_{limit}=\frac{2^\circ\text{C}}{10\ \text{min}}=0.2^\circ\text{C/min}

The observed rise-rate ratio is therefore:

\displaystyle R_T=\frac{1.125}{0.2}=5.625

An adjacent rack heating more than five times faster than the normal rack-to-rack screen is a propagation warning, not merely a comfort-cooling deviation. The exact threshold depends on the system design, sensor location, and validation basis, but the engineering interpretation is clear: adjacent equipment is responding to the event.

The correct action is not to wait for visible flame. Thermal propagation decisions should be based on early indicators: abnormal temperature gradient, off-gas, module voltage behavior, contactor state, insulation monitoring, ventilation response, and control interlocks.

Ventilation and Off-Gas Response

Ventilation effectiveness can be screened with air changes per hour:

\displaystyle ACH=\frac{Q}{V}

For normal ventilation:

\displaystyle ACH_{normal}=\frac{4000}{85}=47.1\ \text{h}^{-1}

For emergency ventilation:

\displaystyle ACH_{emergency}=\frac{12000}{85}=141.2\ \text{h}^{-1}

The emergency mode gives about:

\displaystyle \frac{141.2}{47.1}=3.0

times the normal air-change rate.

This does not prove that the atmosphere is safe. Air-change calculations are screening calculations; they do not replace gas concentration measurements, ignition-source control, enclosure-specific flow patterns, fire-response procedures, or manufacturer emergency guidance. Ventilation can dilute gas in one region while another region remains stagnant. It can also interact with smoke movement, suppression strategy, and responder access.

The engineering conclusion is narrower and stronger: the ventilation system must transition to the validated emergency state, but ventilation alone is not a return-to-service criterion.

Dispatch Feasibility Check

The requested dispatch energy is:

E_{request}=P t=6(1.5)=9\ \text{MWh}

The stored nameplate energy at 82\% SOC is:

E_{stored}=20(0.82)=16.4\ \text{MWh}

The routine usable energy above the minimum SOC is:

E_{above\ min}=20(0.82-0.15)=13.4\ \text{MWh}

If the affected rack were simply removed from the available energy estimate:

E_{available,screen}=13.4-0.205=13.195\ \text{MWh}

On paper, the requested 9\ \text{MWh} discharge still fits. That is exactly why the calculation is useful: it shows that energy adequacy and safety availability are different decisions.

The BESS should still be withdrawn from service. The reason is not lack of energy. The reason is that the abnormal event invalidates the normal operating assumptions behind the dispatch offer: thermal containment, gas management, protection coordination, contactor status, emergency access, and confirmed isolation.

Electrical Isolation Decision

The affected rack should be electrically isolated using the designed rack, string, or container isolation architecture. The engineering review should confirm:

  • contactor or breaker open indication;
  • absence of unintended backfeed through auxiliary circuits;
  • inverter command block for the affected string or container;
  • DC insulation monitoring status;
  • overcurrent and ground-fault protection status;
  • energy-management system availability flag forced to unavailable;
  • supervisory-control interlock preventing remote dispatch override.

Isolation must be treated as a verified state, not only as a command. A contactor-open command without auxiliary contact feedback, DC bus confirmation, insulation evidence, and control-system lockout is incomplete.

Service Withdrawal Decision

The site operator should declare the BESS unavailable for dispatch until a defined return-to-service package is approved. A partial-power mode may be technically possible after isolating one rack, but it is not justified while thermal propagation and off-gas evidence remain active.

The correct decision sequence is:

  1. block charge and discharge commands for the affected container or asset section;
  2. electrically isolate the affected rack or string using verified devices;
  3. initiate the site emergency response interface;
  4. place ventilation and suppression systems in the validated emergency state;
  5. preserve BMS, inverter, fire panel, gas sensor, thermal sensor, and SCADA logs;
  6. notify the grid operator that the service is withdrawn for a safety-critical event;
  7. define return-to-service evidence before any recommissioning attempt.

The critical distinction is between capacity derating and safety withdrawal. A failed fan or degraded cell may create a derated operating mode if the safety case remains valid. Active off-gas, abnormal adjacent heating, and suspected propagation require withdrawal.

FMEA and RPN Screen

A simple failure-mode review can document the change in risk after controls are applied. Use S for severity, O for occurrence, and D for detection, with higher numbers worse:

RPN=S \times O \times D

Before containment actions, the review assigns:

FactorValueRationale
Severity S10Potential fire propagation, gas release, equipment damage, and responder hazard.
Occurrence O3The initiating event is present but not yet container-wide.
Detection D6Detection exists, but propagation state and isolation completeness are uncertain.

The initial risk priority number is:

RPN_{initial}=10(3)(6)=180

After verified rack isolation, command blocking, emergency ventilation, responder interface, preserved evidence, and no further adjacent temperature rise, the review may assign:

FactorValueRationale
Severity S10The credible consequence remains severe.
Occurrence O1Propagation likelihood is reduced by containment and stable trends.
Detection D2Verified state feedback and sensor trends improve detection confidence.

The contained-state RPN is:

RPN_{contained}=10(1)(2)=20

This does not mean the asset is safe to operate. It means the immediate containment state is much lower risk than the initial abnormal state. Return to service still requires root-cause evidence and validation, not only a lower RPN.

Return-to-Service Evidence

The return-to-service package should be explicit enough that operations, engineering, safety, and the grid interface make the same decision. Useful evidence includes:

Evidence itemWhy it matters
BMS event logsConfirms initiating cell, module, rack, timing, voltage behavior, and protection actions.
Temperature and off-gas trendsShows whether the event stabilized, propagated, or continued after isolation.
Verified isolation recordsProves contactor, breaker, inverter block, and DC bus states.
Insulation resistance and ground-fault checksDetects damaged insulation, moisture, conductive residue, or compromised DC paths.
Thermal inspectionIdentifies hot spots, adjacent damage, and cooling-system impairment.
Fire and gas system reset recordsConfirms alarms, ventilation, suppression, and emergency panel states are restored.
Vendor or qualified engineer inspectionEstablishes whether affected equipment can remain installed, must be replaced, or requires quarantine.
Root-cause statementDistinguishes cell defect, overcharge, thermal-management fault, mechanical damage, wiring fault, or sensor fault.
Control interlock testProves a similar safety alarm blocks dispatch and remote override.
Revised availability declarationStates which racks, containers, and services are approved for operation.

The evidence should also state what is not yet known. For example, a stable off-gas reading after ventilation does not prove the initiating module is undamaged. An open rack contactor does not prove there is no stored energy. A successful low-power inverter test does not prove the thermal safety envelope has been restored.

Engineering Lessons

The first lesson is that BESS availability is conditional. A storage asset is available only when the energy, power, thermal, electrical, control, safety, and operational assumptions are all valid. Energy remaining in the battery is not the same as service availability.

The second lesson is that early thermal runaway indicators should be handled as propagation evidence until disproved. Off-gas, adjacent rack heating, abnormal voltage behavior, insulation alarms, and protection actions deserve conservative interpretation.

The third lesson is that emergency response must be integrated into energy operations. The grid operator, site operator, fire panel, BMS, inverter controller, and maintenance team must not maintain separate truths about whether the asset is available.

The final lesson is that return to service is an evidence package, not a timeout. A BESS can be returned only when isolation, root cause, damage assessment, safety-system reset, control interlocks, and operating limits have been verified.

REF

See also