CES NFS - recover failed NFS node (marked with F flag) without reboot

Hi

We had NFS Ganesha node that failed - but we did not able to recover it , only reboot solve it.

Sysmon attempted to remove the F flag three times, but all attempts to acquire the lock failed.
As a result, the node remained in a failed state despite being healthy.
Restarting the node allowed the process to retry successfully once the lock became available.

NFS failure: RPC null checks and the stat checks (collects IO number 2 times and if its the same, the test fails) both failed. thats why nfs_not_active. This was fixed soon.
why node stayed in failed state even NFS was healthy: But to remove Failed state we need a fail-over lock and we didn't get it, it tries 3 times. Later when the lock was available, we had already exhausted our tries. And this is working as designed. But ideally we may want to redesign this.

Idea priority

Medium

Post comment

Guest

Jul 7, 2025

This is a bug. It was resolved with Storage Scale 5.2.3.1.

Reply
Hide replies

Guest

Feb 24, 2025

Hi

CAn we add to mmhealth monitor the following errors in ganesha.log file (or all CRIT messages to be monitor):

2025-01-26 12:20:48 : epoch 001c0612 : ess4-proto6 : gpfs.ganesha.nfsd-2163267[svc_4386] fsal_find_fd :FSAL :CRIT :Open for locking failed for access Read/Write
2025-01-26 12:20:48 : epoch 001c0612 : ess4-proto6 : gpfs.ganesha.nfsd-2163267[svc_2687] fsal_find_fd :FSAL :CRIT :Open for locking failed for access Read/Write
2025-01-26 12:20:48 : epoch 001c0612 : ess4-proto6 : gpfs.ganesha.nfsd-2163267[svc_4005] fsal_find_fd :FSAL :CRIT :Open for locking failed for access Read/Write
2025-01-26 12:20:49 : epoch 001c0612 : ess4-proto6 : gpfs.ganesha.nfsd-2163267[svc_4612] fsal_find_fd :FSAL :CRIT :Open for locking failed for access Read/Write

2025-01-26 16:07:25 : epoch 0010060f : ess4-proto2 : gpfs.ganesha.nfsd-2657517[svc_749] mdcache_lru_fds_available :INODE LRU :CRIT :FD Hard Limit (943718) Exceeded (open_fd_count = 943719), waking LRU thread.
2025-01-26 16:07:25 : epoch 0010060f : ess4-proto2 : gpfs.ganesha.nfsd-2657517[svc_989] mdcache_lru_fds_available :INODE LRU :CRIT :FD Hard Limit (943718) Exceeded (open_fd_count = 943723), waking LRU thread.
2025-01-26 16:07:25 : epoch 0010060f : ess4-proto2 : gpfs.ganesha.nfsd-2657517[svc_752] mdcache_lru_fds_available :INODE LRU :CRIT :FD Hard Limit (943718) Exceeded (open_fd_count = 943723), waking LRU thread.

Reply
Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

CES NFS - recover failed NFS node (marked with F flag) without reboot

Please enter your email address

RELATED IDEAS

CES NFS - recover failed NFS node (marked with F flag) without reboot