Monitoring and Alerting Local Filesystems Usage in Advance for IBM 9500 Nodes

Recently due to Node error 565 "Node internal disk is failing" due to /dumps full getting we discovered the impact where the nodes to be rebooted due to nodes reported in service state..

panel_name cluster_id cluster_name node_id node_name relation node_status error_data
01-2 00000XXXXXXXXXXX Flash840_Sec 2 node2 local Service 565 Disk full: /dumps01-1 00000XXXXXXXXXXX Flash840_Sec 3 node1 partner Active

Since there no mechanician to minotor and alert them in advance those FS reaches about threshold >80% and not action taken addess the problem to cleanup, the FS reaching out for 99% and impacting the node services.

this issue can address having appropriate monitoring and alert mechanician for customers ad iBM support team in advance..

Please treat this ideaa to high critical

superuser>fs_usage

Filesystem Size Used Avail Use% Mounted on

/dev/mapper/SVC_Encrypted1 8.0G 637M 7.0G 9% /

/dev/mapper/SVC_Encrypted5 14G 2.7M 13G 1% /tmp

/dev/mapper/SVC_Encrypted4 18G 4.1G 13G 25% /opt

/dev/md2p4 101G 66G 30G 70% /dumps

/dev/md2p5 1.4G 91M 1.2G 8% /var

/dev/md2p2 6.7G 281M 6.1G 5% /upgrade

/dev/mapper/SVC_Encrypted2 3.3G 468M 2.7G 15% /compass

/dev/md2p3 2.0G 48M 1.8G 3% /data

/dev/mapper/SVC_Encrypted3 8.0G 4.1M 7.6G 1% /home

/dev/sda2 192G 70M 182G 1% /hdata1

/dev/sdb2 192G 28K 182G 1% /hdata2

/dev/sdc1 7.4G 260K 7.4G 1% /run/do_usb_16087

This monitor alerting and metrics shoudl available for infrastructure should include:

CPU (aggregate and my core) - % used
Memory (aggregate and by partition to function) - % used
Disk (all disk needed to operate, and arrays disk that are leveraged as storage needs are not included here) - %used
Network Rates (again Aggregate to the array and specific by purpose) - bytes
Network Errors (same) - count
Network Latency (same) - time

Network (same) - % used

Idea priority

Urgent

Post comment

Admin

Philip Clark

Oct 25, 2024

Filesystem monitoring is addressed in other Idea.
Other metrics referenced (CPU, memory, network, etc) are already available via call home telemetry or support snap in the case of debugging network issues.

Reply
Hide replies

Guest

Oct 2, 2024

Checking in on this idea. The lack of Observability on the platform is of high concern for us. When will this be reviewed?

Reply
Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Monitoring and Alerting Local Filesystems Usage in Advance for IBM 9500 Nodes

Please enter your email address

RELATED IDEAS

Monitoring and Alerting Local Filesystems Usage in Advance for IBM 9500 Nodes