Provide Notification When PPRC link failure rate is high on all PPRC paths

We recently had a WAN issue where one of the circuits was experiencing very high packet loss for carrier A. Even though there was a second healthy circuit provided by carrier B, the failure rate was so high on carrier A's circuit that all PPRC paths across 2 DS8000s at the primary site were experiencing PPRC timeouts. PPRC send response time which is usually around 20-35 ms rose significantly to over a second. During this time we were in a CGFAIL state with nearly all 5000+ volumes suspended. There were hundreds of messages such as

IEA074I MODERATE CONTROLLER HEALTH,MC=10,TOKEN=0015,SSID=4A04, 416

DEVICE NED=2107.998.IBM.75.0000000LNM01.0400,INTF=0341,

PPRC PATH DEGRADED

IEA498I 4C00,SSEUA2,PPRC-PATH ONE OR MORE PPRC PATHS ARE DEGRADED 594

SSID=4A0C (PRI)=0175-LNM01,CCA=00,SENSE=00101000 00FFFFF5 06000400

293C2304 E5293C05 4A0C3887 00006080 0000000

Among these messages we noticed the following message which led us to believe there was something wrong with the secondary DS8ks:

IEA075I PPRC SUMMARY,SSID=4A00, 515

DEVICE NED=2107.998.IBM.75.0000000LNM01.0073,

SUSPENDED=034,PPRC=060,TOTAL=060,

REASON=SUSPEND(08),SECONDARY INTERNAL CONDITIONS

We began troubleshooting the issue based on the premise there was a problem with the secondary DS8Ks, perhaps a write inhibit or similar problem. This was not the case and after investigating the FCIP layer we confirmed with the carrierA that it was a dirty network and experiencing a high number of retries. We took the network interfaces offline on the carrierA switch and PPRC links started becoming operational.

The reason for this request is that for us, when there are issues on the primary or secondary DS8Ks we often look for messages on the DSGUI. In our case there were none and we had to make a support call and provide ODD dumps to have IBM confirm we had a network issue causing PPRC timeouts. What I would like to request is that we get an alert on the DSGUI when all links for an LSS on the primary encounter a high failure rate as shown

LNM01 -> LNM31 PPRCPATHS=0010:0010 (Status: 10-CS_ESTABLISHED Reason: 20-CSERR_FCP_CONNECTED_HIGH_FAILURE_RATE )

if possible on the secondary a similar message indicating PPRC Write error.

75LNM31, many PPRC GCpy target Volumes reported "PPRC_WRITE_ERROR" with "PPRC_SEQ_NUM_MISMATCH"

Idea priority

High

Post comment

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Provide Notification When PPRC link failure rate is high on all PPRC paths

Please enter your email address

RELATED IDEAS

Provide Notification When PPRC link failure rate is high on all PPRC paths