SI-13: PREDICTABLE FAILURE PREVENTION
TAILORED FOR INDUSTRIAL CONTROL SYSTEMS
ISC Control Baseline:
- High (ADDED)
The organization:
- a. Determines mean time to failure (MTTF) for [Assignment: organization-defined information system components] in specific environments of operation; and
- b. Provides substitute information system components and a means to exchange active and standby components at [Assignment: organization-defined MTTF substitution criteria].
SUPPLEMENTAL GUIDANCE
While MTTF is primarily a reliability issue, this control addresses potential failures of specific information system components that provide security capability. Failure rates reflect installation-specific consideration, not industry-average. Organizations define criteria for substitution of information system components based on MTTF value with consideration for resulting potential harm from component failures. Transfer of responsibilities between active and standby components does not compromise safety, operational readiness, or security capability (e.g., preservation of state variables). Standby components remain available at all times except for maintenance issues or recovery failures in progress.
ICS SUPPLEMENTAL GUIDANCE
Failures in ICS can be stochastic or deterministic. Stochastic failures can be analyzed using probability theory, while analysis of deterministic failures is based on non-random properties of the system. Known ICS failure modes and causes are considered. The calculation and use of statistical descriptors, such as Mean Time To Failure (MTTF), should incorporate additional analysis to determine how those failures manifest within the cyber and physical domains. Knowledge of these possible manifestations may be necessary to detect whether a failure has occurred within the ICS, as failures of the information systems may not be easily identifiable. Emergent properties, which may arise both within the information systems and physical processes, can potentially cause system failures should be incorporated into the analysis. For example, cumulative effects of resource exhaustion (e.g., memory leakage) or errors (e.g., rounding and truncation) can occur when ICS processes execute for unexpectedly long periods. Deterministic failures (e.g., integer counter overflow), once identified, are preventable.
Often substitute components may not be available or may not be sufficient to protect against faults occurring before predicted failure. Non-automated mechanisms or physical safeguards should be in place in order to protect against these failures. In addition to information concerning newly discovered vulnerabilities (i.e., latent flaws) potentially affecting the system/applications that are discovered by forensic studies, new vulnerabilities may be identified by organizations with responsibility for disseminating vulnerability information (e.g., ICS-CERT) based upon an analysis of a similar pattern of incidents reported to them or vulnerabilities reported by other researchers.
Related controls:
Rationale for adding control to baseline: ICS are designed and built with certain boundary conditions, design parameters, and assumptions about their environment and mode of operation. ICS may run much longer than conventional systems, allowing latent flaws to become effective that are not manifest in other environments. For example, integer overflow might never occur in systems that are re-initialized more frequently than the occurrence of the overflow. Experience and forensic studies of anomalies and incidents in ICS can lead to identification of emergent properties that were previously unknown, unexpected, or unanticipated. Preventative and restorative actions (e.g., re-starting the system or application) are prudent but may not be acceptable for operational reasons in ICS.
RELATED CONTROLS: SI-13
CONTROL ENHANCEMENTS
SI-13 (1) PREDICTABLE FAILURE PREVENTION | TRANSFERRING COMPONENT RESPONSIBILITIES
NOT SELECTED FOR THE NIST ISC CONTROL SET
The organization takes information system components out of service by transferring component responsibilities to substitute components no later than [Assignment: organization-defined fraction or percentage] of mean time to failure.
Supplemental Guidance: NONE
SI-13 (2) PREDICTABLE FAILURE PREVENTION | TIME LIMIT ON PROCESS EXECUTION WITHOUT SUPERVISION
[Withdrawn: Incorporated into SI-7 (16)].
SI-13 (3) PREDICTABLE FAILURE PREVENTION | MANUAL TRANSFER BETWEEN COMPONENTS
NOT SELECTED FOR THE NIST ISC CONTROL SET
The organization manually initiates transfers between active and standby information system components [Assignment: organization-defined frequency] if the mean time to failure exceeds [Assignment: organization-defined time period].
Supplemental Guidance: NONE
SI-13 (4) PREDICTABLE FAILURE PREVENTION | STANDBY COMPONENT INSTALLATION / NOTIFICATION
NOT SELECTED FOR THE NIST ISC CONTROL SET
The organization, if information system component failures are detected:
- (a) Ensures that the standby components are successfully and transparently installed within [Assignment: organization-defined time period]; and
- (b) [Selection (one or more): activates [Assignment: organization-defined alarm]; automatically shuts down the information system].
Supplemental Guidance:
Automatic or manual transfer of components from standby to active mode can occur, for example, upon detection of component failures.
SI-13 (5) PREDICTABLE FAILURE PREVENTION | FAILOVER CAPABILITY
NOT SELECTED FOR THE NIST ISC CONTROL SET
The organization provides [Selection: real-time; near real-time] [Assignment: organization-defined failover capability] for the information system.
Supplemental Guidance:
Failover refers to the automatic switchover to an alternate information system upon the failure of the primary information system. Failover capability includes, for example, incorporating mirrored information system operations at alternate processing sites or periodic data mirroring at regular intervals defined by recovery time periods of organizations.
REFERENCES:
- NIST Special Publication 800-82 | GUIDE TO INDUSTRIAL CONTROL SYSTEMS (ICS) SECURITY