在Oracle RAC中,能夠從多個(gè)層次,多個(gè)不同的機(jī)制來檢測RAC的健康狀況,即能夠通過心跳機(jī)制以及一定的投票算法來隔離故障。假設(shè)檢測到某節(jié)點(diǎn)失敗,則存在故障的節(jié)點(diǎn)將會(huì)被逐出集群以避免故障節(jié)點(diǎn)破壞數(shù)據(jù)。本文主要描寫敘述了Oracle RAC下的幾種心跳機(jī)制以及心跳參數(shù)的調(diào)整。
?
一、OCSSD與CSS?
OCSSD是一個(gè)管理及提供Cluster Synchronization Services (CSS)服務(wù)的Linux或者Unix進(jìn)程。使用Oracle用戶來執(zhí)行該進(jìn)程并提供節(jié)點(diǎn)成員管理功能,一旦該進(jìn)程失敗。將導(dǎo)致節(jié)點(diǎn)重新啟動(dòng)。CSS服務(wù)提供2種心跳機(jī)制。一種為網(wǎng)絡(luò)心跳。一種為磁盤心跳。兩種心跳都有最大延時(shí),網(wǎng)絡(luò)心跳的延時(shí)叫MC(Misscount), 磁盤心跳延時(shí)叫作IOT (I/O Timeout)。
這2個(gè)參數(shù)都以秒為單位。缺省時(shí)情況下Misscount < Disktimeout。
以下分別描寫敘述這2種心跳機(jī)制。
?
二、網(wǎng)絡(luò)心跳
故名思義即是通過私有網(wǎng)絡(luò)來檢測節(jié)點(diǎn)的狀態(tài)。假設(shè)私有網(wǎng)絡(luò)硬件、軟件導(dǎo)致集群節(jié)點(diǎn)間私有網(wǎng)絡(luò)在一定時(shí)間內(nèi)無法進(jìn)行正常通信。由此而導(dǎo)致腦裂。由于集群環(huán)境中的存儲(chǔ)為共享存儲(chǔ),因此此時(shí)必須要將故障節(jié)點(diǎn)從?集群隔離出來,以避免數(shù)據(jù)災(zāi)難。關(guān)于這個(gè)網(wǎng)絡(luò)心跳的詳細(xì)動(dòng)作描寫敘述例如以下:?
?? ?Every one second, a sending thread in the cssd sends a network tcp heartbeat to itself and all nodes. The receiving thread of the ocssd.bin receives the heartbeat.??
??? If the package network is dropped or has error, the error correction mechanism on tcp would retransmit the package.???
??? Oracle does not retransmit.? From the ocssd.log, you will see a WARNING message about missing of heartbeat if a node does not receive a heartbeat from another node for 15 seconds (50% of miscount).??Another warning is reported in ocssd.log if the same node is missing for 22 seconds (75% of miscount)..another warning continues from the same node for 27 seconds (90% miscount).??When the heartbeat is missing 100% ..30 seconds miscount, the node is evicted?
??
這個(gè)網(wǎng)絡(luò)心跳的延遲稱之為misscount,能夠通過crsctl 工具查詢及改動(dòng)。?
[grid@Linux-01 ~]$ crsctl get css misscount?
CRS-4678: Successful get misscount 30 for Cluster Synchronization Services.?