Exchange Server Forums
Forums |
Register |
Login |
My Profile |
Inbox |
RSS
|
My Subscription |
My Forums |
Address Book |
Member List |
Search |
FAQ |
Ticket List |
Log Out
CCR Failover Problem
|
Users viewing this topic:
none
|
Logged in as: Guest
|
Login | |
|
Limited time MSExchange.org offer! -- 1.Sep.2008 1:00:00 PM
|
|
TechGenix and SolarWinds have partnered to provide free copies of SolarWinds Exchange Monitor to all visitors who join the MSExchange.org Forums. SolarWinds Exchange Monitor is a handy desktop dashboard that continuously monitors Microsoft Exchange to deliver real-time insight into Exchange services, mail queue sizes, and host server health. Learn more about Exchange Monitor and the free offer!
|
CCR Failover Problem - 1.Aug.2007 9:32:07 AM
|
|
|
rparsons1000
Posts: 143
Joined: 29.Aug.2006
Status: offline
|
I recently tried to simulate a failure on the active node of a CCR cluster and things did not go right. I pulled the network cables on the active node, the resources moved to the passive node, the Cluster Group came online but all of the node resources went to "online pending" then after 4 minutes they all said "offline". I plugged the network cables back into the active node and within seconds all resources came online on the passive node. I moved everything back and had to re-seed one storage group. If I do a gracefull shutdown of the active node everything moves over fine. Any ideas why I am not getting a good failover on a hard crash?
|
|
|
|
RE: CCR Failover Problem - 1.Aug.2007 11:07:16 AM
|
|
|
John Weber
Posts: 584
Joined: 20.Apr.2005
From: Portland, Oregon
Status: offline
|
Sounds like the majority node is not functioning. Verify that both node member servers have ms kb 921181 installed. -jmw quote:
ORIGINAL: rparsons1000 I recently tried to simulate a failure on the active node of a CCR cluster and things did not go right. I pulled the network cables on the active node, the resources moved to the passive node, the Cluster Group came online but all of the node resources went to "online pending" then after 4 minutes they all said "offline". I plugged the network cables back into the active node and within seconds all resources came online on the passive node. I moved everything back and had to re-seed one storage group. If I do a gracefull shutdown of the active node everything moves over fine. Any ideas why I am not getting a good failover on a hard crash?
|
|
|
|
RE: CCR Failover Problem - 3.Aug.2007 6:34:26 AM
|
|
|
rparsons1000
Posts: 143
Joined: 29.Aug.2006
Status: offline
|
I am fairly certain I installed it and MNS is running and I remember it being a bugger to install. It is one of the resources in the cluster that moved and did start back up. I don't see the hotfix in add/remove programs. clussvc.exe is version 5.2.3790.3959 How can I be sure it is installed and installed correctly?
< Message edited by rparsons1000 -- 3.Aug.2007 6:44:56 AM >
|
|
|
|
RE: CCR Failover Problem - 3.Aug.2007 8:10:08 AM
|
|
|
Henrik Walther
Posts: 6827
Joined: 21.Nov.2002
From: Copenhagen, Denmark
Status: offline
|
I've forwarded this on to a trusted source of mine...
_____________________________
HTH Henrik Walther Exchange MVP & MCM: Exchange 2007 MCITP: Exchange 2007, MCITP: Windows Server 2008, MCSE: M+S Order my Exchange Server 2007 Book!
|
|
|
|
RE: CCR Failover Problem - 3.Aug.2007 9:04:38 AM
|
|
|
rparsons1000
Posts: 143
Joined: 29.Aug.2006
Status: offline
|
Thanks Henrik. BTW, good job on getting the article on NLB for CAS going. Lots of people looking forward to that one.
|
|
|
|
RE: CCR Failover Problem - 26.Aug.2007 6:38:46 PM
|
|
|
Elan Shudnow
Posts: 544
Joined: 4.Jan.2007
From: Chicago, IL
Status: online
|
I just visited this high availiby section to see if somoene was having this problem, and voila! I'm noticing the same issue. 100% sure that I installed the KBs and my MNS is working just fine. My CCR is set to best availability but it's working as a lossless. When I pull the connectivity on active, it all goes over to the passive just fine but the storage groups are at online pending and then just go to offline. I then bring the old active back up and then the storage groups are brought online. It's as if this is working as lossless instead of best availability. Seems like a bug as best availability should automatically bring the storage groups online and then mount the databases. I tested this out with RTM. I just installed the Rollup 4 and tested this again and everything with best availability started working as it should. Edit: I just ran some more tests and it's still working. So to the original poster, I would try installing Rollup 4 and see if it fixes your problem since it did for me.
< Message edited by eshudnow -- 26.Aug.2007 6:45:22 PM >
|
|
|
|
RE: CCR Failover Problem - 27.Aug.2007 2:56:14 AM
|
|
|
Henrik Walther
Posts: 6827
Joined: 21.Nov.2002
From: Copenhagen, Denmark
Status: offline
|
If you pull the network cable the resources will move = expected. What you need to know here there are two things, what is the availability setting and what is the copy queue length. Remember that we have protections built into CCR known as the availability settings. These availability settings are good availability, best availability, and lossless. These translate into the number of log files that can be lost on a failover. Good = 3, best = 5 (or 7 I forget), and lossless = 1 I believe. I haven't looked at those numbers in quite a while. When a failover occurs we interrogate the availability settings. If the copy queue lenght (number of logs pending copy) is greater then availability settings....we wait. This is where the replication service is attempting to wait for the other node to come back so that it can copy the needed log files over. Hence you plug the network cable in, machine is available, replication service copies and mounts database. You can check this out I believe by using get-clusteredmailboxserver -server <name> | fl to see availability settings. Get-storagegroupcopystatus -identity <SGName> | fl will give you the copy queue length. Check the app log first though...as the events should be right there I believe.
_____________________________
HTH Henrik Walther Exchange MVP & MCM: Exchange 2007 MCITP: Exchange 2007, MCITP: Windows Server 2008, MCSE: M+S Order my Exchange Server 2007 Book!
|
|
|
|
RE: CCR Failover Problem - 27.Aug.2007 10:48:13 AM
|
|
|
a.grogan
Posts: 1887
Joined: 12.Apr.2005
From: London
Status: offline
|
Hiya chap, just a thought, but is the File Share Witness share on one of the nodes in the cluster - I have seen this cause problems? Cheers A
_____________________________
Andy Grogan MSExchange.org Forums Moderator For my general ramblings about Exchange please visit my blog: W: http://telnetport25.wordpress.com/ M: manifoldmaster@gmail.com
|
|
|
|
RE: CCR Failover Problem - 27.Aug.2007 11:11:56 AM
|
|
|
rparsons1000
Posts: 143
Joined: 29.Aug.2006
Status: offline
|
This certainly makes sense. I did not have a backup for a while so had many many log files and evn had to re-seed due to an issue after failing back. Let's assume though that this were a true condition and you exceeded the AutoDatabaseMountDial parameters. How would you get the DB's mounted? I am wondering if this is not somehow addressed in rollup 4. For further reading http://technet.microsoft.com/en-us/library/bb288910.aspx
|
|
|
|
RE: CCR Failover Problem - 27.Aug.2007 6:35:33 PM
|
|
|
rparsons1000
Posts: 143
Joined: 29.Aug.2006
Status: offline
|
I put the FSW on a Hub Transport role server. I think Henrik is on the right track with the log files it makes sense. I will test again tomorrow after rollup 4 is applied. Thanks.
|
|
|
|
RE: CCR Failover Problem - 28.Aug.2007 12:18:30 AM
|
|
|
rparsons1000
Posts: 143
Joined: 29.Aug.2006
Status: offline
|
I just spent the last 2 hours doing more testing and here is what I found: The AutoDatabaseMountDial is set to best meaning 3 lost log files before "failure". The documentation really leads me to believe it is 3 or more. Tested again by shutting down the active node, worked perfect Tested again removing network cables and still, all DB's fail to mount or come online. All other resources come online. If I try to mount in the console it says I have to run repair-storagegroup because replication is still occurring. I have 15 DB's/storage groups so I did this to one arbitrarily and all of a sudden they all came online except Public Folder since it is a replicated folder (from 2003). I will check the copy queue length next but I was unable to find where to look for it or set it. It needs to be lower than the AutoDatabaseMountDial?
|
|
|
|
RE: CCR Failover Problem - 28.Aug.2007 12:53:37 PM
|
|
|
rparsons1000
Posts: 143
Joined: 29.Aug.2006
Status: offline
|
I ran a get-storagegroupcopystatus and opy queue length is 0. Bottom line is failover is good if shutting down and how often does a server really just crash and burn? If that happens I will just run the restore cmdlet keeping our down time still minimal. I'll be keeping an eye out for further info though.
|
|
|
|
New Messages |
No New Messages |
Hot Topic w/ New Messages |
Hot Topic w/o New Messages |
Locked w/ New Messages |
Locked w/o New Messages |
|
Post New Thread
Reply to Message
Post New Poll
Submit Vote
Delete My Own Post
Delete My Own Thread
Rate Posts |
|