Improve zOS host IP connection loss detection and failover behavior

See this idea on ideas.ibm.com

There have been incidents in the field showing the current design of CSM z/OS host IP connection loss detection and failover to redundant active host connection is significantly delayed to 15-20 minutes. This may happen especially when the connected system used by the HyperSwap session will fail during a planned or unplanned HyperSwap.

CSM might still recognize the HS trigger, but inturn tries to query HS Status via the host connection that may no longer be working. The default timeout of 15 minutes waiting for command responses could be decreased, but this may lead to problems when normal command processing from IOS would take longer than the configured timeout.

Suggestions for improved design:

- Each Host connection should have its own connection pinger with a configurable timeout (e.g. 60 sec). When the connction ping fails, this connection should be closed with an I/O exception.

- When connection was closed, the connection re-trial can be started independently of any session command processing. When the connection can be re-established, it can be marked as active again, and possibly be re-used by Sessions that communicate to this sysplex.

- The Session communication to the system currently uses only 1 host connection, even if there are more active host connections available to the sysplex. When a command times out after 15 min, it seems another attempt using same connection is made with shorter timeout. If the connection pinger closes the unresponsive host connection quicker, all Sessions using that closed connection should re-act on this event and immediately switch over to another active host connection to the same sysplex if available.

Idea priority

High

Post comment

Guest

Reply
| Sep 15, 2023

Reopening this as a IDEA uncommitted candidate. The issue is that once the command has been issued across the IP network, there isn't an easy way for us to determine that the connection is actually dead. Need to rearchitect a way to either determine faster that the connection is gone for a system on the sysplex, or find a way to determine the connection is gone after the command is sent, and kill the sent command. Will look at this as a future feature.

0 reply Hide replies

Guest

Reply
| Jul 24, 2023

Not sure whether the code works exactly as described, however this appears more like a defect as the redundant connection should failover faster. Opened an internal defect to track looking into this issue closer.

0 reply Hide replies

By clicking the "Post Comment" or "Submit Idea" button, you are agreeing to the IBM Ideas Portal Terms of Use.
Do not place IBM confidential, company confidential, or personal information into any field.

Shape the future of IBM!

Search existing ideas

Post your ideas

Specific links you will want to bookmark for future use

Improve zOS host IP connection loss detection and failover behavior

Please enter your email address

RELATED IDEAS

Improve zOS host IP connection loss detection and failover behavior