The problem to be discussed:
a) Reclaim of a non-exported, not-properly-ejected, physically damaged tape should have been easier to do in a multi-cluster grid, in the scenario DTNA experienced.
b) Reclaim of offsite tapes should not have been impacted due to the damaged tape, in the scenario DTNA experienced.
c) Even after getting the damaged tape's PVOL to show zero active LVOLS, nothing DTNA could do from the customer GUI could get the tape to logically eject and not generate error messages any more.
The fundamental design shortcoming, from DTNA's perspective, is that the TS7760T in question persisted in insisting on regaining access to the damaged tape (which was impossible), ignoring the fact that it could have gotten all of the needed LVOLS from other clusters in the grid. Every LVOL on the damaged PVOL, and every LVOL on the offsite tapes, had extra consistent copies in the cache of one or two other TS7700s in the same grid family. It does not make any sense that the customer has to get the support center involved, to initiate a ROR process. The EJECT or MOVE commands from the customer GUI should have been able to remove the tape from consideration, or at least moved as many LVOLS are possible (just like the ROR does).
The one potential complication that could have affected these attempts to EJECT or MOVE the LVOLs using other copies in the grid is if the timing for the removal of scratched LVOLS is inconsistent. For instance, the first attempt to EJECT the bad volume would have happened when 33 scratched LVOLS were 1 or 2 hours past their “earliest deletion on” date/time, and 2 scratched LVOLS had started their 1 day grace period just 1 or 2 hours prior to the EJECT attempt. Perhaps some of the 33 ‘eligible for deletion' LVOLS had followed through with physical deletion on the other members of the grid. All of the 33 LVOLS eligible for deletion were in cache on two other clusters leading up to this event.
One reason this potential timing problem with scratched tapes comes to mind is because when the ROR was finally initiated by the IBM support center, all but 1 LVOL moved, with the remaining LVOL being a scratched LVOL that had reached it's eligible for deletion' time 21 hours prior to the ROR attempt. We have a cut-and-paste display that shows it no longer in cache on the companion clusters, yet apparently not physically removed from the cluster that owned the bad tape. There were 10 other LVOLS that had been scratched at the same time, and none of those held up the ROR, but that 1 LVOL lingered for some reason.
A second flaw is that it appeared that on-demand Copy Export Reclaim was ‘stuck'. DTNA had at least 33 PVOLs for which they had issued COPYEXP,RECLAIM commands. These were shown to be in CE_RECLAIM status, yet no reclaims were completing. After a ROR was done for the bad PVOL, getting that bad POL down to only 1 remaining LVOL (which was in scratched-but-not-removed status), the reclaims resumed. However, oddly, once the 1 remaining LVOL did get removed, the Reclaims stopped happening once again. It took a forced pause and resume of the TS7760T to once again get the Reclaims happening.
Attachment (Description): Full write up email from the client on the perceived design shortcomings.