Understanding Functional Resonance Analysis: A Deeper Dive
Written on
Modeling intricate systems can be challenging. In this discussion, I aim to share insights on Functional Resonance Analysis (FRAM), a method established by Professor Erik Hollnagel. This approach is particularly relevant to those not deeply immersed in the safety research domain where Hollnagel has gained recognition. My intention is to provide an informative overview rather than an exhaustive or overly technical explanation.
Shall we explore whether this perspective is beneficial for understanding our increasingly complex world?
This is Part 3. Links to previous parts are provided below: - FRAM Part 1: Ignorance - FRAM Part 2: Functions - FRAM Part 3: Hexagons (you are here)
Learning Through Practical Examples
Today, we will analyze a common system: the on-call process for a live software application. For those unfamiliar, this refers to the personnel and tools designated to address issues when a website encounters problems. I will ensure the explanation is straightforward.
Once we dissect the system, we can consider how to approach your own analysis, including the level of detail required, the objectives of the analysis, and how to identify functions.
(Did you notice that my simplified description was framed in terms of components rather than functions? That's a common tendency. It requires intentional effort to shift our thinking towards functions instead.)
Does this look complicated? Real systems often are. Let's unravel this step by step. We’ll begin by examining the functions adjacent to the red hexagon labeled “Verify System Health (before)”. This encompasses functions that are peripheral to our system and do not require comprehensive modeling. Here are some key definitions:
Oncall: The individual responsible for responding to alerts and resolving issues at any time.
Red border: Represents actions taken by the oncall individual.
Yellow border: Pertains to team members responsible for the software system in question.
Blue border: Represents external systems necessary for troubleshooting.
Grey color: Denotes functions that are either not fully modeled or are part of external systems.
Alert Generation
This function is depicted in grey because it serves as an input to the system. I won’t delve into its full complexity—the alert system itself isn’t the focus here. What’s critical is recognizing that an “alert” indicates a potential issue with our software.
Suppressing Alerts
In this instance, the alert is silenced for a specified duration. Even if the underlying issue persists, no alert will sound. It’s akin to disabling your phone alarm for a week—morning still arrives, but the alarm remains quiet.
The alert acts as an input since it is the element being suppressed. A precondition is that the suppression of this specific alert has been set up.
Alert Activation
When the alert is not suppressed, it “fires.” This activation is defined by its configuration, which is also a precondition. The configuration might specify actions such as notifying specific individuals via call, text, or email, based on various rules we won’t explore in detail here.
Configuring Alerts
This function involves modifying alert configurations to ensure that pertinent issues are brought to the attention of the oncall individual. We’re not modeling the configuration system itself, hence it appears in grey. We simply need to demonstrate how it connects to our system through its output to other dependent functions.
Silencing Alerts
If suppressing an alert is like disabling your phone alarm, silencing it resembles hitting the “snooze” button. It’s a way of saying, “I’m not concerned! Please stop interrupting me!” However, this action does not prevent the alert from firing again. As this is at the periphery of the system we’re analyzing, it is not fully represented and is marked as an input, also colored grey.
Increasing Complexity
Now let’s examine a few functions that exhibit more complexity and interdependence. I’ll explain as we progress.
Verify System Health (before)
When an alert activates as an input, it signals the need to assess the extent of the issue. A human typically performs this task, although that’s not strictly necessary. This function processes the alert (which should contain information about what is presumed to be malfunctioning and when it occurred) and conducts an investigation. This may involve consulting dashboards or logging into the affected system to gather further details. The key is that a thorough examination is conducted using expertise, knowledge, or tools beyond the scope of the alert system.
One control in this process is the authorization to access dashboards and log into systems. Another control is adherence to established oncall protocols for such situations.
If this process concludes that no significant issues exist, it will output a request to silence the alert. Perhaps a user verified that the alert was merely a test, or it pertains to a system already known to be under repair. Conversely, if corrective measures are required, the output will be to initiate a mitigation plan.
You’ll observe that this function appears more intricate. Indeed, I could break it down further into smaller functions to depict the various methods of verifying system health. However, the focus of my analysis here is on the on-call team structure and workflow, leading me to consolidate it into one function.
I have marked all functions necessitating direct action from the oncall with a red border for clarity. The label (before) indicates that there are two such functions with similar processes occurring at different times.
Create Initial Mitigation Plan
Next, we need to determine the necessary steps to restore system functionality. This may involve directly addressing the issue or implementing a workaround, among other options. This function takes as input all information gleaned from the “Verify System Health (before)” function about the system's condition. Additionally, it draws upon the alert’s contents as a resource. The distinction is that the input initiates the function, while the resource does not trigger it.
- This function may identify the need for additional team members, perhaps someone with expertise in the area that has failed or who has encountered this issue previously. If so, it will output a request to “Call for Other Team Participants.”
- If other team members are already engaged for any reason, it will output to “Team Investigation.”
- If assistance from another team is required (e.g., due to a power outage), it will output to “Escalation to Partner Team” for additional help.
- If the oncall individual can begin repairs independently or prior to help’s arrival, the function outputs directly to “Team Mitigation Steps.”
Note that this function can produce multiple outputs simultaneously, depending on the circumstances.
The Remaining Functions
I have illustrated examples of input, output, resource, control, and precondition, as well as the interconnections between some functions. While some functions are granular, others are more extensive. I will now present the output (in tabular format) for the remaining parts of the system. If you feel confident in your understanding, feel free to skip ahead.
A Brief Examination
You may have noticed that I elaborated more on some functions than on others. In this example, my aim is to depict the overall workflow and dependencies of the oncall process. Thus, the actual tasks involved in assessing system health or addressing problems have been “simplified.” In practice, the assessment of system health likely warrants its own FRAM analysis at a later date, as it can be quite complex.
Additionally, I did not incorporate Time. My analysis did not focus on the sequence of events, as I felt this was captured in the input and precondition aspects. However, in time-sensitive scenarios, it is crucial to consider timing, especially if input and preconditions do not ensure that your functions occur in the correct sequence. You may find that functions risk being initiated prematurely or too late to be effective.
To identify functions within your system, start with an event—something that occurs. Ask, “What happens next?” and follow the sequence of responses. You will uncover your functions by thinking in terms of actions rather than objects—focusing on functions, not components.
Upcoming Topics
In the final part of this series, we will discuss how to evaluate your system once you have developed a model. Ultimately, the goal of creating a model is not for its own sake, but to gain insights into your system's strengths, weaknesses, connections, and dependencies that you may not have previously recognized.