The ultimate story about OCR, OCRMIRROR and 2 storage boxes – Chapter 5

Scenario 5: Loss of ocrmirror from non-ocr-master – reloaded

This is a follow-up of chapter 4.
In this final scenario, we do the same thing as in scenario 4. I.e. while crs is running on both nodes, we hide the ocrmirror from the non-ocr-master node, which is node 2 now.
So node 1 is the master, we hide ocrmirror from node 2 and we verify on node 2:

(nodeb01 /app/oracle/crs/log/nodeb01) $ dd if=/dev/oracle/ocrmirror of=/dev/null bs=64k count=1
dd: /dev/oracle/ocrmirror: open: I/O error

What happens?

As we know from scenario 4, ocrcheck on node 2 now fails with:

(nodeb01 /app/oracle/crs/log/nodeb01) $ ocrcheck
PROT-602: Failed to retrieve data from the cluster registry

On node 1 all is ok. This is still the same as scenario 4, but in scenario 4 we now stopped crs on the ocr master who can see both luns. In this scenario we will now stop crs on the non-master node (node 2) who can see only ocr.

And now it gets interesting….

-bash-3.00# crsctl stop crs
OCR initialization failed accessing OCR device: PROC-26: Error while accessing the physical storage

Did I say “really interesting”? We don’t seem to be able to stop crs anymore on the non-ocr-master node. Maybe it is worth referring to the RAC FAQ on Metalink that says “If the corruption happens while the Oracle Clusterware stack is up and running, then the corruption will be tolerated and the Oracle Clusterware will continue to funtion without interruptions”. That’s true, but they don’t seem to speak about stopping crs. Anyway, the real “playing” continues:

Let’s try to tell Oracle CRS that the ocr is the correct version to continue with, and ask kindly to increase its votecount to 2. We do this on node 2 and get:

ocrconfig -overwrite
PROT-19: Cannot proceed while clusterware is running. Shutdown clusterware first

Deadlock on node 2! We can’t stop crs, but in order trying to correct the problem, crs has to be down…

Moreover, at this time, it is not possible anymore to modify the OCR. Both nodes now give:

(nodea01 /app/oracle/crs/log/nodea01/client) $ srvctl remove service -d ARES -s aressrv
PRKR-1007 : getting of cluster database ARES configuration failed, PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]
PRKO-2005 : Application error: Failure in getting Cluster Database Configuration for: ARES

And doing the above command on each node gives always in the alert logfile of node 1 (who is the master):

[  OCRAPI][29]a_check_permission_int: Other doesn’t have permission

Note: “srvctl add service” doesn’t work either.

Now it seems like things are really messed up. We have never seen permission errors before. Please be aware now that the steps below are the steps I took trying to get things right again. There may be other options, but I only did this scenario once, with the steps below:

As the original root cause of the problem was making the ocrmirror unavailable, let’s try to tell the cluster to forget about this ocrmirror, and continue only with ocr, which is still visible on both nodes.

So in order to remove ocrmirror from the configuration, we do as root on node 2:

-bash-3.00# ocrconfig -replace ocrmirror “”

Note: specifying an empty string (“”) is used to remove the raw device from the configuration.

At that time in the crs logfile of node 1:

2008-07-23 11:11:18.136: [  OCRRAW][29]proprioo: for disk 0 (/dev/oracle/ocr), id match (0), my id set (1385758746,1028247821) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1866209186) my votes (1), total votes (2)
2008-07-23 11:11:18.136: [  OCRRAW][29]propriowv_bootbuf: Vote information on disk 0 [/dev/oracle/ocr] is adjusted from [1/2] to [2/2]
2008-07-23 11:11:18.195: [  OCRMAS][25]th_master: Deleted ver keys from cache (master)
2008-07-23 11:11:18.195: [  OCRMAS][25]th_master: Deleted ver keys from cache (master)

That looks ok. We will be left with one ocr device having 2 votes. This is intended behaviour.

In the alert file of node 1, we see:

2008-07-23 11:11:18.125
[crsd(26268)]CRS-1010:The OCR mirror location /dev/oracle/ocrmirror was removed.

and in the crs logfile of node 2:

2008-07-23 11:11:18.155: [  OCRRAW][34]proprioo: for disk 0 (/dev/oracle/ocr), id match (1), my id set (1385758746,1028247821) total id sets (2), 1st set (1385758746,1866209186), 2nd set (1385758746,1028247821) my votes (2), total votes (2)
2008-07-23 11:11:18.223: [  OCRMAS][25]th_master: Deleted ver keys from cache (non master)
2008-07-23 11:11:18.223: [  OCRMAS][25]th_master: Deleted ver keys from cache (non master)

(node 2 updates its local cache) and in the alert file of node 2:

2008-07-23 11:11:18.150
[crsd(10831)]CRS-1010:The OCR mirror location /dev/oracle/ocrmirror was removed.

Now we do an ocrcheck on node 2:

(nodeb01 /app/oracle/crs/log/nodeb01) $ ocrcheck
Status of Oracle Cluster Registry is as follows :
         Version                  :          2
         Total space (kbytes)     :     295452
         Used space (kbytes)      :       5600
         Available space (kbytes) :     289852
         ID                       : 1930338735
         Device/File Name         : /dev/oracle/ocr
                                    Device/File integrity check succeeded
<br />                                    Device/File not configured
         Cluster registry integrity check succeeded

Now the configuration looks ok again, but the error remains on node 2 (we do this as user oracle):

(nodeb01 /app/oracle/crs/log/nodeb01) $ srvctl remove service -d ARES -s aressrv
PRKR-1007 : getting of cluster database ARES configuration failed, PROC-5: User does not have permission to perform a cluster registry operation on this key. Authentication error [User does not have permission to perform this operation] [0]
PRKO-2005 : Application error: Failure in getting Cluster Database Configuration for: ARES

However doing the same command as root on node 2 succeeds:

-bash-3.00# srvctl remove service -d ARES -s aressrv
aressrv PREF: ARES1 AVAIL: ARES2
Service aressrv is disabled.
Remove service aressrv from the database ARES? (y/[n]) y

After this, managing the resources as user oracle succeeds again:

(nodeb01 /app/oracle/crs/log/nodeb01) $ srvctl add service -d ARES -s aressrv2 -r ARES1
(nodeb01 /app/oracle/crs/log/nodeb01) $ srvctl remove service -d ARES -s aressrv2
aressrv2 PREF: ARES1 AVAIL:
Remove service aressrv2 from the database ARES? (y/[n]) y

At this point, unfortunately the internals end. At the moment of my testing, I had no time to investigate this further, and since then I had no time to make and test a similar setup (that’s why this blog posting took so long, I would have loved to do more research on this). However I remember I have done some more testing in some place at some customer site (but I have no tracscript of that, so no details to write here) and I can still tell the following:

For some reason, the ownership of the ARES resource in OCR seems to be changed from oracle to root. A way to get out of this as well is using the following commands:

 crs_getperm
 crs_setperm -o oracle | -g dba

This allows to change ownership back to oracle, and then all will become ok again.

I can’t say where it went wrong. Maybe I have done something as root, instead of oracle, without knowing (however I double checked my transcripts). I think it went wrong at the moment where I first tried to stop crs as root on node 2 and then did an “ocrconfig -overwrite” as root on node 2. I wonder if something has then been sent to node 1 (who is ocr master), i.e. as root, that may have changed some permission in the ocr…? If anyone has time and resources to investigate this further, please don’t hesitate to do so, and inform me about the results. In this way, you may gain perpetual honour in my personal in-memory list of great Oracle guys.

Conclusion

Altthough crs is very robust and 2 storage boxes are ok, there may be a situation where you get unexpected error messages. Hopefully this chapter will help you in getting out of this without problems, and strengthen your confidence in Oracle RAC.

Let’s make a final conclusion in the next chapter…

About these ads

One Response to The ultimate story about OCR, OCRMIRROR and 2 storage boxes – Chapter 5

  1. [...] Geert de Paep-Scenario 5: Loss of ocrmirror from non-ocr-master – reloaded [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: