Voting Disk and Split Brain Syndrome in Oracle RAC
The Oracle Clusterware includes two important components: the voting disk and the Oracle Cluster Registry (OCR). The voting disk is a file that manages information about node membership and the OCR is a file that manages cluster and RAC database configuration information.
Oracle recommends that you select the option to configure multiple voting disks during Oracle Clusterware installation to improve availability.
Backing up Voting Disks
Run the following command to back up a voting disk. Perform this operation on every voting disk as needed where voting_disk_name is the name of the active voting disk and backup_file_name is the name of the file to which you want to back up the voting disk contents:
dd if=voting_disk_name of=backup_file_name
Recovering Voting Disks
Run the following command to recover a voting disk where backup_file_name is the name of the voting disk backup file and voting_disk_name is the name of the active voting disk:
dd if=backup_file_name of=voting_disk_name
Note: If you have multiple voting disks, then you can remove the voting disks and add them back into your environment using the crsctl delete css votedisk path and crsctl add css votedisk path commands respectively, where path is the complete path of the location on which the voting disk resides.
Changing the Voting Disk Configuration after Installing Real Application Clusters
You can dynamically add and remove voting disks after installing Real Application Clusters. Do this using the following commands where path is the fully qualified path for the additional voting disk. Run the following command as the root user to add a voting disk:
crsctl add css votedisk path
Run the following command as the root user to remove a voting disk:
crsctl delete css votedisk path
Split Brain Syndrome, In a Oracle RAC environment all the instances/servers communicate with each other using high-speed interconnects on the private network. This private network interface or interconnect are redundant and are only used for inter-instance oracle data block transfers. Now talking about split-brain concept with respect to oracle RAC systems, it occurs when the instance members in a RAC fail to ping/connect to each other via this private interconnect, but the servers are all physically up and running and the database instance on each of these servers is also running. These individual nodes are running fine and can conceptually accept user connections and work independently. So basically due to lack of communication the instance thinks that the other instance that it is not able to connect is down and it needs to do something about the situation. The problem is if we leave these instances running, the same block might get read, updated in these individual instances and there would be data integrity issue, as the blocks changed in one instance, will not be locked and could be over-written by another instance. This situation is termed as Split Brain Syndrome.
I/O Fencing, there will be some situation where the leftover write operations from failed database instances (The cluster function failed on the nodes, but the nodes are still running at OS level) reach the storage system after the recovery process starts. Since these write operations are no longer in the proper serial order, they can damage the consistency of the data stored data. Therefore when a cluster node fails, the failed node needs to be fenced off from all the shared disk devices or disk groups. This methodology is called I/O fencing or failure fencing
Simple Majority Rule, According to Oracle – “An absolute majority of voting disks configured (more than half) must be available and responsive at all times for Oracle Clusterware to operate.” Which means to survive from loss of ‘N’ voting disks, you must configure atleast ‘2N+1′ voting disks.
Now we are in a state to understand the use of voting disk in case of heartbeat failure.
Example 1:- Suppose in a 3 node cluster with 3 voting disks, a network heartbeat fails between Node 1 and Node 3 & Node 2 and Node 3 whereas Node 1 and Node 2 are able to communicate via interconnect, and from the Voting Disk CSSD notices that all the nodes are able to write to Voting Disks thus spli-brain, so the healthy nodes Node 1 & Node 2 would would update the kill block in the voting disk for Node 3. Then when during pread() system call of CSSD of Node 3, it sees a self kill flag set and thus the CSSD of Node 3 evicts itself. And then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.
Example 2:- Suppose in a 2 node cluster with 3 voting disk, a disk heartbeat fails such that Node 1 can see 2 Voting Disks and Node 2 can see 1 Voting Disk, ( If here the Voting Disk wouldn’t have been odd then both the Nodes would have thought the other node should be killed hence would have been difficult to avoid split-brain), thus based on Simple Majority Rule, CSSD process of Node 1 (2 Voting Disks) sends a kill request to the CSSD process of Node 2 (1 Voting Disk) and thus the Node 2 evicts itself and then the I/O fencing and finally the OHASD will finally attempt to restart the stack after graceful shutdown.
managing Voting Disk,
From 11g release 2, voting disks can be stored on either on ASM diskgroups or on cluster file systems, following are the steps to add/delete/move a voting disk:-
How to add a Voting Disk
When voting disk is placed on Cluster file system.
==>crsctl add css votedisk
==>alter diskgroup
' force;How to delete a Voting Disk
When voting disk is placed on Cluster file system.
==>crsctl delete css votedisk
How to Replace a Voting Disk
To replace a voting disk first add a new Voting Disk and then Delete the one you wanted to replace, using the command explained above.
How to move a Voting Disk between diskgroup & CFS
First check the status of voting disk, with following command.
==>crsctl query css votedisk
This would display the current status of voting disk, then follow the below command to move it to ASM diskgroup from Cluster file system or vice versa:-
==>crsctl replace votedisk
And then to verify the changes, issue :-
==>crsctl query css votedisk
How to Backup Voting Disk
Since 11gr2, the voting disk are backed up automatically in the OCR, as part of any configuration change and is restored automatically to any added disk.
Comments
Post a Comment