Case study
Battery Energy Storage Thermal Runaway Containment Case Study
Energy engineering case study on battery energy storage thermal runaway containment, module isolation, off-gas response, SOC service withdrawal, propagation risk, emergency interface, and return-to-service evidence.
A battery energy storage system thermal runaway event is not only a battery chemistry problem. It is an energy asset problem, a protection problem, a control-system problem, a site-safety problem, and an operations decision. The dangerous engineering error is to treat an alarm as a local maintenance issue while the system is still connected to dispatch commitments, grid-support commands, HVAC operation, emergency ventilation, and nearby energized equipment.
This case study follows a realistic lithium-ion BESS event at a solar-plus-storage site. The event is hypothetical, but the reasoning matches the type of evidence an engineering review board would expect: state of charge, affected energy, thermal trend, off-gas indication, ventilation state, electrical isolation, service withdrawal, emergency interface, and return-to-service hold points.
The central question is:
Should the BESS continue supporting the grid because enough energy remains available, or should it be withdrawn from service because the safety case is no longer valid?
The answer is that dispatch feasibility is not the governing criterion. Once thermal runaway propagation is credible, the asset must be treated as unavailable until isolation, emergency response, inspection, and validation evidence restore the safety case.
Case Context
The site includes a containerized BESS connected to a solar plant and a medium-voltage feeder. The BESS normally supports evening peak shaving, renewable smoothing, and microgrid resilience. The operator receives an alarm from one rack during the afternoon while the grid operator is requesting evening discharge support.
| Item | Value or observation |
|---|---|
| BESS nameplate energy | 20\ \text{MWh} |
| BESS power rating | 10\ \text{MW} |
| Routine SOC operating window | 15\% to 90\% |
| SOC at first safety alarm | 82\% |
| Affected rack nominal energy | 250\ \text{kWh} |
| Adjacent rack temperature rise | 9^\circ\text{C} in 8\ \text{min} |
| Normal rack-to-rack rise limit | 2^\circ\text{C} in 10\ \text{min} |
| Off-gas sensor indication | 1.8 times alarm threshold |
| Container free volume | 85\ \text{m}^3 |
| Normal ventilation flow | 4000\ \text{m}^3/\text{h} |
| Emergency ventilation flow | 12000\ \text{m}^3/\text{h} |
| Requested dispatch | 6\ \text{MW} for 1.5\ \text{h} |
| Immediate control state | One rack blocked, inverter still available |
The initial operator temptation is understandable: the affected rack is only a small fraction of a 20 MWh asset, and the requested discharge is well below nameplate power. That spreadsheet view is incomplete. The event has already crossed from normal availability management into abnormal safety containment.
Incident Timeline
At 14{:}12, the battery management system records a high-temperature warning in one rack. At 14{:}15, the off-gas sensor for the same container crosses the alarm threshold. At 14{:}20, adjacent rack temperature has risen by 9^\circ\text{C} over the preceding 8 minutes. The rack contactor is commanded open, but the site controller still shows the container as partially available for dispatch.
The control-room display therefore contains two conflicting messages:
- the energy-management system still sees usable stored energy;
- the safety system reports a credible thermal and gas-release event.
The engineering decision must be governed by the second message. A BESS that has lost its safety envelope cannot be offered as grid capacity simply because the inverter can still respond.
Stored Energy in the Affected Rack
The first calculation estimates the energy associated with the affected rack at the event SOC:
With a 250\ \text{kWh} rack at 82\% SOC:
This number is not a prediction that all stored energy will be released as heat. It is a scale check. A rack containing about 205\ \text{kWh} of stored electrochemical energy is not a small instrumentation fault. It is a high-consequence source term that can affect adjacent modules, gas generation, emergency access, fire exposure, and electrical isolation.
The calculation also prevents a common error: dismissing a single-rack event because it is only 1.25\% of a 20\ \text{MWh} plant. Consequence is not proportional to fleet percentage. A single rack can create the initiating condition for container-level propagation.
Propagation Temperature Screen
The adjacent rack temperature rise is:
The normal rack-to-rack rise limit is:
The observed rise-rate ratio is therefore:
An adjacent rack heating more than five times faster than the normal rack-to-rack screen is a propagation warning, not merely a comfort-cooling deviation. The exact threshold depends on the system design, sensor location, and validation basis, but the engineering interpretation is clear: adjacent equipment is responding to the event.
The correct action is not to wait for visible flame. Thermal propagation decisions should be based on early indicators: abnormal temperature gradient, off-gas, module voltage behavior, contactor state, insulation monitoring, ventilation response, and control interlocks.
Ventilation and Off-Gas Response
Ventilation effectiveness can be screened with air changes per hour:
For normal ventilation:
For emergency ventilation:
The emergency mode gives about:
times the normal air-change rate.
This does not prove that the atmosphere is safe. Air-change calculations are screening calculations; they do not replace gas concentration measurements, ignition-source control, enclosure-specific flow patterns, fire-response procedures, or manufacturer emergency guidance. Ventilation can dilute gas in one region while another region remains stagnant. It can also interact with smoke movement, suppression strategy, and responder access.
The engineering conclusion is narrower and stronger: the ventilation system must transition to the validated emergency state, but ventilation alone is not a return-to-service criterion.
Dispatch Feasibility Check
The requested dispatch energy is:
The stored nameplate energy at 82\% SOC is:
The routine usable energy above the minimum SOC is:
If the affected rack were simply removed from the available energy estimate:
On paper, the requested 9\ \text{MWh} discharge still fits. That is exactly why the calculation is useful: it shows that energy adequacy and safety availability are different decisions.
The BESS should still be withdrawn from service. The reason is not lack of energy. The reason is that the abnormal event invalidates the normal operating assumptions behind the dispatch offer: thermal containment, gas management, protection coordination, contactor status, emergency access, and confirmed isolation.
Electrical Isolation Decision
The affected rack should be electrically isolated using the designed rack, string, or container isolation architecture. The engineering review should confirm:
- contactor or breaker open indication;
- absence of unintended backfeed through auxiliary circuits;
- inverter command block for the affected string or container;
- DC insulation monitoring status;
- overcurrent and ground-fault protection status;
- energy-management system availability flag forced to unavailable;
- supervisory-control interlock preventing remote dispatch override.
Isolation must be treated as a verified state, not only as a command. A contactor-open command without auxiliary contact feedback, DC bus confirmation, insulation evidence, and control-system lockout is incomplete.
Service Withdrawal Decision
The site operator should declare the BESS unavailable for dispatch until a defined return-to-service package is approved. A partial-power mode may be technically possible after isolating one rack, but it is not justified while thermal propagation and off-gas evidence remain active.
The correct decision sequence is:
- block charge and discharge commands for the affected container or asset section;
- electrically isolate the affected rack or string using verified devices;
- initiate the site emergency response interface;
- place ventilation and suppression systems in the validated emergency state;
- preserve BMS, inverter, fire panel, gas sensor, thermal sensor, and SCADA logs;
- notify the grid operator that the service is withdrawn for a safety-critical event;
- define return-to-service evidence before any recommissioning attempt.
The critical distinction is between capacity derating and safety withdrawal. A failed fan or degraded cell may create a derated operating mode if the safety case remains valid. Active off-gas, abnormal adjacent heating, and suspected propagation require withdrawal.
FMEA and RPN Screen
A simple failure-mode review can document the change in risk after controls are applied. Use S for severity, O for occurrence, and D for detection, with higher numbers worse:
Before containment actions, the review assigns:
| Factor | Value | Rationale |
|---|---|---|
| Severity S | 10 | Potential fire propagation, gas release, equipment damage, and responder hazard. |
| Occurrence O | 3 | The initiating event is present but not yet container-wide. |
| Detection D | 6 | Detection exists, but propagation state and isolation completeness are uncertain. |
The initial risk priority number is:
After verified rack isolation, command blocking, emergency ventilation, responder interface, preserved evidence, and no further adjacent temperature rise, the review may assign:
| Factor | Value | Rationale |
|---|---|---|
| Severity S | 10 | The credible consequence remains severe. |
| Occurrence O | 1 | Propagation likelihood is reduced by containment and stable trends. |
| Detection D | 2 | Verified state feedback and sensor trends improve detection confidence. |
The contained-state RPN is:
This does not mean the asset is safe to operate. It means the immediate containment state is much lower risk than the initial abnormal state. Return to service still requires root-cause evidence and validation, not only a lower RPN.
Return-to-Service Evidence
The return-to-service package should be explicit enough that operations, engineering, safety, and the grid interface make the same decision. Useful evidence includes:
| Evidence item | Why it matters |
|---|---|
| BMS event logs | Confirms initiating cell, module, rack, timing, voltage behavior, and protection actions. |
| Temperature and off-gas trends | Shows whether the event stabilized, propagated, or continued after isolation. |
| Verified isolation records | Proves contactor, breaker, inverter block, and DC bus states. |
| Insulation resistance and ground-fault checks | Detects damaged insulation, moisture, conductive residue, or compromised DC paths. |
| Thermal inspection | Identifies hot spots, adjacent damage, and cooling-system impairment. |
| Fire and gas system reset records | Confirms alarms, ventilation, suppression, and emergency panel states are restored. |
| Vendor or qualified engineer inspection | Establishes whether affected equipment can remain installed, must be replaced, or requires quarantine. |
| Root-cause statement | Distinguishes cell defect, overcharge, thermal-management fault, mechanical damage, wiring fault, or sensor fault. |
| Control interlock test | Proves a similar safety alarm blocks dispatch and remote override. |
| Revised availability declaration | States which racks, containers, and services are approved for operation. |
The evidence should also state what is not yet known. For example, a stable off-gas reading after ventilation does not prove the initiating module is undamaged. An open rack contactor does not prove there is no stored energy. A successful low-power inverter test does not prove the thermal safety envelope has been restored.
Engineering Lessons
The first lesson is that BESS availability is conditional. A storage asset is available only when the energy, power, thermal, electrical, control, safety, and operational assumptions are all valid. Energy remaining in the battery is not the same as service availability.
The second lesson is that early thermal runaway indicators should be handled as propagation evidence until disproved. Off-gas, adjacent rack heating, abnormal voltage behavior, insulation alarms, and protection actions deserve conservative interpretation.
The third lesson is that emergency response must be integrated into energy operations. The grid operator, site operator, fire panel, BMS, inverter controller, and maintenance team must not maintain separate truths about whether the asset is available.
The final lesson is that return to service is an evidence package, not a timeout. A BESS can be returned only when isolation, root cause, damage assessment, safety-system reset, control interlocks, and operating limits have been verified.