Rac Node Eviction Trouble Shooting
Rac Node Eviction Trouble Shooting
Rac Node Eviction Trouble Shooting
STEPS:
1. Look at the cssd.log files on both nodes; usually we will get more information on the second
node if the first node is evicted. Also take a look at crsd.log file too
All Clusterware log files are stored under $ORA_CRS_HOME/log/hostname directory.
2. The evicted node will have core dump file generated and system reboot info.
3. Find out if there was node reboot, is it because of CRS or others, check system reboot time.
4. If you see “Polling” key words with reduce in percentage values in cssd.log file that says the
eviction is probably due to Network. If you see “Diskpingout” or something related to -DISK-
then, the eviction is because of Disk time out…
5. After finding Network or Disk issue. Then starting going in depth.
6. Now it’s time to collect NMON/OSW/RDA reports to make sure /justify if it was DISK issue or
Network.
7. If in case we see more memory contention/paging in the reports then it’s time to collect AWR
report to see what loads/SQL was running during that period?
8. If network was the issue, then check if any NIC cards were down, or if link switching as
happen. And check private interconnect is working between both the nodes.
9. Sometimes eviction could also be due to OS error where the system is in halt state for while
or Memory over commitment or CPU 100% used.
11. What got changed recently? Ask your coworker to open up a ticket with Oracle and upload
logs
12. Check the health of clusterware, db instances, asm instances, uptime of all hosts and all the
logs – ASM logs, Grid logs, CRS and ocssd.log,
HAS logs, EVM logs, DB instances logs, OS logs, SAN logs for that particular timestamp.
13. Check health of interconnect if error logs guide you in that direction.
14. Check the OS memory, CPU usage if error logs guide you in that direction.
16. Run TFA and OSWATCHER, NETSTAT, IFCONFIG settings etc based on error messages during
your RCA.
17. Node eviction because iptables had been enabled. After iptables was turned off, everything
went back to normal.
Avoid to enable firewalls between the nodes, and that appears to be true.
The ACL can open the ports on the interconnect, as we did, but we still experienced all kinds of
issues.
(unable to start crs, unable to stop crs and node eviction).
We also had a problem with the voting disk caused by presenting LDEV's using business copies /
Shadowimage that made RAC less than happy.