Spectrum Scale has long provided mechanisms to alert about events. In the past, it was common to use callbacks to do that. Today, Scale provides webhooks that can be received by consumers. Also, mmhealth today provides a good source of info about the status of the various cluster components - mmhealth is even used to feed the GUI for component status. While all of that is available, there are certain client use-cases for monitoring and operations that make it very hard to consume that information. The Citi on IBM Cloud Spectrum Symphony use-case is one of them.
Citi has many shared-nothing architecture clusters on IBM Cloud and use a combination of Netcool / ServiceNow as the consumers of alert events and their resolved counterpart events. The idea is to capture failure alerts that can be used to generate service events, and then capture resolved events that can be used to close service events.
For the sake of a simple example only, a disk_down failure is generated for nsd3 on node4 and sent to Netcool (alert message consumer), which in turn creates a service request to the operations team (ServiceNow). The operations team then receives the ticket, works on solving the problem and brings nsd3 on node4 back online, in which case a resolved counterpart alert message would be received by Netcool and correlated to the original disk_down message, so that Netcool can then automatically close the respective support ticket in ServiceNow.
The Citi project IBM team and the Scale development team tried to make the flow above work by consuming the Scale webhooks but a few blockers made it difficult, sometimes impossible, for the information to be consumed / correlated:
the failure events don’t have a unique identifier, so correlating a disk_down message with a disk_up message is not trivial and can’t be coded as a rule in Netcool.
Multiple nodes may report the same global event (global cluster state changes, such as node_down, quorum change, and so on), therefore making Netcool think there has been multiple failure events as opposed to only one.
The resolved scenario of a given failure is reported in a INFO message but has different identifier, internalComponent, event_type, or sometimes one or more of those are empty (usually identifier), making it impossible to correlate the resolved message (healthy) with a previous failure message.
The ask here is to provide an alerting mechanism that can be easily consumed by monitoring tools, such as Netcool / ServiceNow, that is capable of allowing those tools to correlate failure-resolved events.
The mmhealth monitor inside Scale gets very close to what the ask here is for. It monitors the health of various components of the system and is capable of reporting that a given component is healthy, degraded or failed. We need something that would take the status of the components in Scale and send an alert out through a push mechanism (can be webhooks) in a way that:
the messages have a unique identifier associated to it
the messages have a target associated to it (presumably a node or set of nodes)
the messages have a component associated to it (optional, such as disk, filesystem, and so on)
the messages have a state associated to it (healthy, degraded, failed)
the messages have a description of the problem
the messages have a field that allows to make a distinction between a failed versus a resolved message
Consider the following scenario presented as a timeline, in a shared-nothing cluster, in a simplistic format:
1. disk nsd1 on node4 goes down → alert generated {id:xyz001, target: node4, component: nsd1, state: failed, description: “nsd1 is down!”, is_resolved: “false”}
2. disk nsd3 on node2 goes down → alert generated {id:xyz002, target: node2, component: nsd3, state: failed, description: “nsd3 is down!”, is_resolved: “false”}
3. disk nsd1 on node4 is brought up (problem is fixed, correlated to #1, uses same id) → alert generated {id:xyz001, target: node4, component: nsd1, state: healthy, description: “nsd1 is up!”, is_resolved: “true”}
4. disk nsd1 on node4 goes down again (new event, no correlation to previous failures)→ alert generated {id:xyz003, target: node4, component: nsd1, state: failed, description: “nsd1 is down!”, is_resolved: “false”}
5. disk nsd1 on node4 is brough up (problem fixed, correlated to #4, same id) → alert generated {id:xyz003, target: node4, component: nsd1, state: healthy, description: “nsd1 is up!”, is_resolved: “true”}
6. Node7 (non-quorum, non-manager) goes down → alert generated {id:xyz004, target: node7, component: null, state: failed, description: “Node7 is down!”, is_resolved: “false”}
7. disk nsd3 on node2 is brought up (problem is fixed, correlated to #2, same id) → alert generated {id:xyz002, target: node2, component: nsd3, state: healthy, description: “nsd3 is up!”, is_resolved: “true”}
8. Node7 comes back online (problem fixed, correlated to #6, same id) → {id:xyz004, target: node7, component: null, state: healthy, description: “Node7 is up!”, is_resolved: “true”}
Certain types of failure might trigger multiple alerts. For example, a quorum node going down would trigger the following types of alerts: node_down, disks_down (shared nothing arch), quorum_loss, GUI_down… and so on. When the node came back online, it would generate a counterpart to the messages, something such as: node_up, disks_up, quorum_reached, GUI_up… This is a very simplistic approach as an example, I realize it’s more complex than that.
The approach above would allow Operations and Monitoring teams to use commercial tools such as Netcool / ServiceNow to more easily support our clients evironments. This is in much need by Citibank but can be reused by other clients and other monitoring tools.