Skip to Main Content
IBM System Storage Ideas Portal


This portal is to open public enhancement requests against IBM System Storage products. To view all of your ideas submitted to IBM, create and manage groups of Ideas, or create an idea explicitly set to be either visible by all (public) or visible only to you and IBM (private), use the IBM Unified Ideas Portal (https://ideas.ibm.com).


Shape the future of IBM!

We invite you to shape the future of IBM, including product roadmaps, by submitting ideas that matter to you the most. Here's how it works:

Search existing ideas

Start by searching and reviewing ideas and requests to enhance a product or service. Take a look at ideas others have posted, and add a comment, vote, or subscribe to updates on them if they matter to you. If you can't find what you are looking for,

Post your ideas
  1. Post an idea.

  2. Get feedback from the IBM team and other customers to refine your idea.

  3. Follow the idea through the IBM Ideas process.


Specific links you will want to bookmark for future use

Welcome to the IBM Ideas Portal (https://www.ibm.com/ideas) - Use this site to find out additional information and details about the IBM Ideas process and statuses.

IBM Unified Ideas Portal (https://ideas.ibm.com) - Use this site to view all of your ideas, create new ideas for any IBM product, or search for ideas across all of IBM.

ideasibm@us.ibm.com - Use this email to suggest enhancements to the Ideas process or request help from IBM for submitting your Ideas.

Status Planned for future release
Created by Guest
Created on Feb 9, 2023

Enhancement request to Spectrum Scale event alerting mechanism to allow event correlation

Spectrum Scale has long provided mechanisms to alert about events. In the past, it was common to use callbacks to do that. Today, Scale provides webhooks that can be received by consumers. Also, mmhealth today provides a good source of info about the status of the various cluster components - mmhealth is even used to feed the GUI for component status. While all of that is available, there are certain client use-cases for monitoring and operations that make it very hard to consume that information. The Citi on IBM Cloud Spectrum Symphony use-case is one of them.


Citi has many shared-nothing architecture clusters on IBM Cloud and use a combination of Netcool / ServiceNow as the consumers of alert events and their resolved counterpart events. The idea is to capture failure alerts that can be used to generate service events, and then capture resolved events that can be used to close service events.

For the sake of a simple example only, a disk_down failure is generated for nsd3 on node4 and sent to Netcool (alert message consumer), which in turn creates a service request to the operations team (ServiceNow). The operations team then receives the ticket, works on solving the problem and brings nsd3 on node4 back online, in which case a resolved counterpart alert message would be received by Netcool and correlated to the original disk_down message, so that Netcool can then automatically close the respective support ticket in ServiceNow.


The Citi project IBM team and the Scale development team tried to make the flow above work by consuming the Scale webhooks but a few blockers made it difficult, sometimes impossible, for the information to be consumed / correlated:

  • the failure events don’t have a unique identifier, so correlating a disk_down message with a disk_up message is not trivial and can’t be coded as a rule in Netcool.

  • Multiple nodes may report the same global event (global cluster state changes, such as node_down, quorum change, and so on), therefore making Netcool think there has been multiple failure events as opposed to only one.

  • The resolved scenario of a given failure is reported in a INFO message but has different identifier, internalComponent, event_type, or sometimes one or more of those are empty (usually identifier), making it impossible to correlate the resolved message (healthy) with a previous failure message.

The ask here is to provide an alerting mechanism that can be easily consumed by monitoring tools, such as Netcool / ServiceNow, that is capable of allowing those tools to correlate failure-resolved events.

The mmhealth monitor inside Scale gets very close to what the ask here is for. It monitors the health of various components of the system and is capable of reporting that a given component is healthy, degraded or failed. We need something that would take the status of the components in Scale and send an alert out through a push mechanism (can be webhooks) in a way that:


  • the messages have a unique identifier associated to it

  • the messages have a target associated to it (presumably a node or set of nodes)

  • the messages have a component associated to it (optional, such as disk, filesystem, and so on)

  • the messages have a state associated to it (healthy, degraded, failed)

  • the messages have a description of the problem

  • the messages have a field that allows to make a distinction between a failed versus a resolved message

Consider the following scenario presented as a timeline, in a shared-nothing cluster, in a simplistic format:


1. disk nsd1 on node4 goes down → alert generated {id:xyz001, target: node4, component: nsd1, state: failed, description: “nsd1 is down!”, is_resolved: “false”}


2. disk nsd3 on node2 goes down → alert generated {id:xyz002, target: node2, component: nsd3, state: failed, description: “nsd3 is down!”, is_resolved: “false”}


3. disk nsd1 on node4 is brought up (problem is fixed, correlated to #1, uses same id) → alert generated {id:xyz001, target: node4, component: nsd1, state: healthy, description: “nsd1 is up!”, is_resolved: “true”}


4. disk nsd1 on node4 goes down again (new event, no correlation to previous failures)→ alert generated {id:xyz003, target: node4, component: nsd1, state: failed, description: “nsd1 is down!”, is_resolved: “false”}


5. disk nsd1 on node4 is brough up (problem fixed, correlated to #4, same id) → alert generated {id:xyz003, target: node4, component: nsd1, state: healthy, description: “nsd1 is up!”, is_resolved: “true”}


6. Node7 (non-quorum, non-manager) goes down → alert generated {id:xyz004, target: node7, component: null, state: failed, description: “Node7 is down!”, is_resolved: “false”}


7. disk nsd3 on node2 is brought up (problem is fixed, correlated to #2, same id) → alert generated {id:xyz002, target: node2, component: nsd3, state: healthy, description: “nsd3 is up!”, is_resolved: “true”}


8. Node7 comes back online (problem fixed, correlated to #6, same id) → {id:xyz004, target: node7, component: null, state: healthy, description: “Node7 is up!”, is_resolved: “true”}


Certain types of failure might trigger multiple alerts. For example, a quorum node going down would trigger the following types of alerts: node_down, disks_down (shared nothing arch), quorum_loss, GUI_down… and so on. When the node came back online, it would generate a counterpart to the messages, something such as: node_up, disks_up, quorum_reached, GUI_up… This is a very simplistic approach as an example, I realize it’s more complex than that.


The approach above would allow Operations and Monitoring teams to use commercial tools such as Netcool / ServiceNow to more easily support our clients evironments. This is in much need by Citibank but can be reused by other clients and other monitoring tools.

Idea priority Urgent