My experiences with ocrmirror, voting disks and stretched clusters

If you are about to migrate to Oracle RAC, there is an important thing to know about the OCRmirror. As far as I know, this is documented nowhere in any Oracle doc. This post talks about the failover properties of the OCRmirror.

The following is the traditional thinking path of the higher ICT management:

We want high availability, so we need to go to RAC (first mistake, the use of the word “need”. That is what Oracle told them, but anyway that’s another post). So let’s foresee 2 nodes. Nice, we happen to have 2 server rooms, so this is excellent for high availability: one server in each room.

In this scenario, the idea automatically comes up to protect you against failure of one of the computer rooms. That’s what everyone turns out to want in the end.

The first problem however is: where do we put the storage? Probably you will be able to convince the management to buy a second SAN. You want redundancy after all, isn’t it?

Good news: Oracle told that ASM can be used to mirror the data. They say that ASM is great and it solves our problems. Beep! Wrong again. Previous statement is true and false. ASM is great indeed, but it doesn’t solve the storage problems. Distributing the SANs over the two server rooms is ok, and ASM will mirror the database-data, but the cluster (Oracle CRS) happens to have a registry and a voting disk as well. These cannot be mirrored by ASM. The cluster must first be up before ASM can be started. So where do we put these?

Luckiliy Oracle 10.2 provides the option to foresee multiple voting disks. One on each san? Wrong again. You need to have an odd number of voting disks, and a node can only continue when it can see a majority of the voting disks. So putting two voting disks on SAN1 and the third voting disk on SAN2 makes SAN1 single point of failure for the whole cluster. And probably you can’t convince the management to buy a third SAN and build a third server room.

To solve his, problem, you will need a third location anyway to store the third voting disk. Fortunately it only has to store the voting disk, and the voting disk is not accessed very intensively. So you may use a simple PC with a disk that you present under e.g. fiber or iscsi or even NFS. This third server should be placed on a location independent from that what you want to protect yourself against. So if you want to protect against failure of one of the computer rooms, you will have to put the pc/machine outside of both rooms. Also make sure that it has network connection to each server room without passing through the other room.

This solves the problem of the voting disks, but what about the OCR? Well very simple, Oracle told that since 10.2 you can have an OCRmirror as a copy of the OCR. So let’s put the OCR on SAN1 and the OCRmirror on SAN2. Simple isn’t it? Beep! Wrong again.

I recently did the test again at a customer site with two sans. One OCR + 2 voting disks on SAN1, OCRmirror and voting disk 3 on SAN2. Pull out the fiber connections to SAN2, and guess what happens. If you want the detailed logfile output, I can send that to you, but this is the summary:

Jan 23 11:07:18 <hostname> kernel: lpfc 0000:0a:00.0: 3:1305 Link Down Event x2 received Data: x2 x20 x110

It turns out that all IO is blocked. This depends on OS and software (I used linux x86_64 + multipathd + device mapper in our case). However, it turns out that the software queues all IO during 10 minutes (60 retries of 10 seconds) and then gives up. However during these 10 minutes, a simple dd command on the failed lun hangs. It hangs so hard that even CTRL-C cannot interrupt this. The same is true for all database access that requires physical IO to these luns.

CSS is very robust, and produces messages very soon, saying that the third voting disk hangs (+/- immediately after the problem, so it doesn’t hang for 10 minutes. One bonus point for CSS!).

[ CSSD]2008-01-23 11:08:59.492 [1168148800] >WARNING: clssnmDiskPMT: voting device hang at 50 0.000000atal, termination in 99480 ms, disk (1//dev/raw/raw22)

In my case, the node is not rebooted, CSS does not evict it from the cluster because it still sees a majority of the voting disks (the 2 on SAN1 which is still available).

[CSSD]2008-01-23 11:10:38.978 [1168148800] >TRACE: clssnmDiskPMT: stale disk (200010 ms) (1//dev/raw/raw22)

All other commands that access the failed lun hang as well and ctrl-c doesn’t help: ocrcheck, srvctl add service, … Now this is an OS issue outside of the scope of Oracle. If an uninterruptable IO call does not return, the process turns out to be blocked. (Remark: the 60 retries seem to be hard coded in the multipathd software, but the interval can be set to 1 second, resulting in a hang of 1 minute instead of 10 minutes). Besides, if using this multipath software, you will need to decrease the interval, because you cannot afford your client applications to hang for 10 minutes.

However 10 minutes later, the multipath software gives up on queuing the IO (var/log/messages says: Jan 23 11:17:36 <hostname> multipathd: <lunname>: Disable queueing) and returns an IO error to the process. As a consequence all commands are unblocked, and react on the IO errors. It turns out now that all cluster functionality that needs the OCR is not working anymore (remember, the ocrmirror is unavailable, the ocr is still available). The most surprising thing happens in the CRSD logfile. After a while it gets:

2008-01-23 11:22:13.768: [ CRSD][1518532928][PANIC]0Exception caught at cppStart

2008-01-23 11:22:13.768: [ CRSD][1518532928][PANIC]0cluinfo(memberid) failed for <hostname>

(File: caa_Cluster.cpp, line: 115

and it dumps with core. It doesn’t restart. Every ocr command results in:

[client(20388)]CRS-1011:OCR cannot determine that the OCR content contains the latest updates. Details in /opt/oracle/crs/log/<hostname>/client/css135.log.

The only way to get out is to reboot the node. So we reboot (SAN2 is still disconnected). This means that OCR and VD1 and VD2 are available, OCRmirror and VD3 are not yet available. And guess what, CRS doesn not start anymore. We see:

Jan 23 11:45:02 <hostname> logger: Cluster Ready Services waiting on dependencies. Diagnostics in /tmp/crsctl.6599

and

Oracle Database 11g CRS Release 10.2.0.3.0 Production Copyright 1996, 2006 Oracle. All rights reserved.
2008-01-23 11:38:33.078: [ OCROSD][2897458400]utopen:6m’:failed in stat OCR file/disk /dev/raw/raw91, errno=2, os err string=No such file or directory
2008-01-23 11:38:33.081: [ OCRRAW][2897458400]proprioini: disk 0 (/dev/raw/raw101) doesn’t have enough votes (1,2)
2008-01-23 11:38:33.081: [ OCRRAW][2897458400]proprinit: Could not open raw device
2008-01-23 11:38:33.081: [ default][2897458400]a_init:7!: Backend init unsuccessful : [26]
2008-01-23 11:38:33.081: [ CSSCLNT][2897458400]clsssinit: Unable to access OCR device in OCR init.PROC-26: Error while accessing the physical storage
Only after making the lun containing OCRmirror available again, it starts (reinsert the cables and reissuing the raw binding commands).

This test proves that the OCRmirror is not a failover for the OCR. In fact, by having an OCRmirror, you are less available, because if any of the OCR or OCRmirror fails, your cluster won’t behave normally anymore. So what is the use of the OCRmirror then? Well, it can be used to repair an OCR if that would get corrupt in some way (logical corruption, e.g. write zeroes on it with dd, or a lun that is accidentally destroyed). It is even possible to do this online.

It is however very important to realize that the errors on the OCR access can be very different, depending on the problem (physical problem, logical corruption, OS device configuration, …) In some cases I assume the node can survive a failure of OCR and OCRmirror, and a repair can be done online. However there are cases where unavailability of one of them causes problem on the local node (as the case I described above). For this reason I prefer to put OCR and OCRmirror on the same SAN, optionally on different raid sets (suppose a raid set gets corrupt or is accidentally destroyed…).

Bottom lines: Using RAC for protection against site failure is not straightforward. As a consequence the whole idea of “stretched clusters” with miles in-between the nodes is subject to the same problems. Look at what happens in case of site failure:

  • In each mirrored ASM diskgroup, one mirror will be removed when it is physically not accessible anymore. However, afterwards this requires manual intervention to first cleanup everything regarding the metadata of the disks (sometimes it is not cleanly removed and the “drop disk force” syntax might help you. You need to erase manually the contents of the failed disk before you can add it again (ok on unix with dd, less obvious on windows). Then you need to rebuild the mirror from scratch, which takes time and IO resources.
  • If you have OCR at one site and OCRmirror at the other, your cluster may crash and probably you will have to reboot your nodes to get all stable again. If you put OCR and OCRmirror on the same site, then that site is a single point of failure. If you use OS mirroring for the OCR, you will have manual intervention required to resync the mirror again, and hopefully you don’t copy in the wrong direction.
  • It happens to be that site failures usually don’t happen during working hours, but in weekends or on moments when the most experienced people are on holiday.

Conclusion: For the moment I prefer not to mirror storage in a RAC environment. I prefer to invest in 1 (one) highly redundant SAN per site. If you want to protect against site failure, consider the use of Data Guard. That is a solution designed for separate storage between primary and standby site and for long distance between the two sites. And I hope that in the next release of Oracle Clusterware ocr failover capabilities will be better, then my conclusions may change.

I emphasize that all the above is my own personal opinion, experience and recommendation. Also I can confirm that there are cases where unavailability of the ocrmirror causes no such problems (except for e.g. an “integrity check failed” message in ocrcheck) and a repair can be done online. It all depends on the kind of failure.

Advertisements

14 Responses to My experiences with ocrmirror, voting disks and stretched clusters

  1. prodlife says:

    Excellent post! Thank you for sharing the experience and warning from this undocumented trap.

  2. AFAIK VxFS (veritas cluster filesystem) is capable of host based mirroring: you can write your data to 2 SAN’s, which can be accessed by 2 hosts…. It’s not cheap, but a lot cheaper then setting up a second site.

  3. Airsentry says:

    Thank you for your valuable post.
    We also see error “doesn’t have enough votes (1,2)” for OCR but later we find that error appear because simultaneously with OCR mirror one of the 3 voting disks is not available (they are on the same storage system). And third voting disk on nfs is not really available – but it lists in “crsctl query css votedisk”. When we solve voting disk problems – error disappeare.
    And also RAC11g documentation says:
    “The OCR has a mechanism that prevents data loss due to accidental overwrites. If you configure a mirrored OCR and if Oracle Clusterware cannot access the two mirrored OCR locations and also cannot verify that the available OCR location contains the most recent configuration, then Oracle Clusterware prevents further modification to the available OCR location. ”
    Is it close to what you find in experiments?

  4. Shri says:

    Excellent post . a good experiment and update.

  5. Philip says:

    When you get the chance you should try this test again with 10.2.0.4

  6. Jakub Wartak says:

    From my “home” experiments, 11g extended RAC@Linux, with iSCSI initiators and 3rd NFS voting disks is able to survive crash of SAN2 with OCRmirror and one voting disk without problems. However no multipathing for iSCSI is being used. The IO error stale time can be regulated with Linux iSCSI initiator (default 120s).

  7. pier00 says:

    Recently I did more experiments with this and indeed, my insight has grown. You can build stretched clusters and rely on the ocrmirror, if only you know how it works internally and how to recover from errors. I will write another blog about this soon.

    • truff says:

      Hi Pier00,

      did you had the chance to publish your expiriments ? I have a setup that looks like the one explained here (but using NFS NAS instead of SAN). I’ve the exact same problem as the one described here and will be interested in knowing how to use OCR and OCRMIRROR on two nas.

  8. Mikael says:

    Hi, excellent blog. I recently managed a test of a HA solution using Oracle RAC. Indeed, we encountered problems right away when simulating the loss of a SAN on one node. The technicians scratched their heads while reading the Oracle manuals and found nothing. Searching for the problem led me to this blog, and much needed answers. We have a work around in place in case the problem occurs. Now I am curious to find out what you have found out reg OCR mirror and stretched clusters.

  9. Jatin says:

    Appreciate the testing scenario

  10. Hardik says:

    Hi Mate,

    only a word for you, Excellent !

    that was quite a test.. keep posting…

  11. […] story about OCR, OCRMIRROR and 2 storage boxes – Introduction Some time ago I wrote a blog about stretched clusters and the OCR. The final conclusion at that time was that there was no easy way to get your OCR safe on both […]

  12. Mpho says:

    Great Stuff!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: