Exchange Server Forums

Forums | Register | Login | My Profile | Inbox | RSS RSS icon | My Subscription | My Forums | Address Book | Member List | Search | FAQ | Ticket List | Log Out

CCR Failover Problem

Users viewing this topic: none

Logged in as: Guest
  Printable Version
All Forums >> [Microsoft Exchange 2007] >> High Availability >> CCR Failover Problem Page: [1]
Login
Message << Older Topic   Newer Topic >>
Limited time MSExchange.org offer! -- 1.Sep.2008 1:00:00 PM
TechGenix and SolarWinds have partnered to provide free copies of SolarWinds Exchange Monitor to all visitors who join the MSExchange.org Forums. SolarWinds Exchange Monitor is a handy desktop dashboard that continuously monitors Microsoft Exchange to deliver real-time insight into Exchange services, mail queue sizes, and host server health. Learn more about Exchange Monitor and the free offer!
CCR Failover Problem - 1.Aug.2007 9:32:07 AM   
rparsons1000

 

Posts: 143
Joined: 29.Aug.2006
Status: offline
I recently tried to simulate a failure on the active node of a CCR cluster and things did not go right. I pulled the network cables on the active node, the resources moved to the passive node, the Cluster Group came online but all of the node resources went to "online pending" then after 4 minutes they all said "offline". I plugged the network cables back into the active node and within seconds all resources came online on the passive node. I moved everything back and had to re-seed one storage group.

If I do a gracefull shutdown of the active node everything moves over fine.

Any ideas why I am not getting a good failover on a hard crash?
Post #: 1
RE: CCR Failover Problem - 1.Aug.2007 11:07:16 AM   
John Weber

 

Posts: 584
Joined: 20.Apr.2005
From: Portland, Oregon
Status: offline
Sounds like the majority node is not functioning.
Verify that both node member servers have ms kb 921181 installed.

-jmw

quote:

ORIGINAL: rparsons1000

I recently tried to simulate a failure on the active node of a CCR cluster and things did not go right. I pulled the network cables on the active node, the resources moved to the passive node, the Cluster Group came online but all of the node resources went to "online pending" then after 4 minutes they all said "offline". I plugged the network cables back into the active node and within seconds all resources came online on the passive node. I moved everything back and had to re-seed one storage group.

If I do a gracefull shutdown of the active node everything moves over fine.

Any ideas why I am not getting a good failover on a hard crash?

(in reply to rparsons1000)
Post #: 2
RE: CCR Failover Problem - 3.Aug.2007 6:34:26 AM   
rparsons1000

 

Posts: 143
Joined: 29.Aug.2006
Status: offline
I am fairly certain I installed it and MNS is running and I remember it being a bugger to install. It is one of the resources in the cluster that moved and did start back up.

I don't see the hotfix in add/remove programs.

clussvc.exe is version 5.2.3790.3959

How can I be sure it is installed and installed correctly?

< Message edited by rparsons1000 -- 3.Aug.2007 6:44:56 AM >

(in reply to rparsons1000)
Post #: 3
RE: CCR Failover Problem - 3.Aug.2007 8:10:08 AM   
Henrik Walther

 

Posts: 6827
Joined: 21.Nov.2002
From: Copenhagen, Denmark
Status: offline
I've forwarded this on to a trusted source of mine...


_____________________________

HTH
Henrik Walther
Exchange MVP & MCM: Exchange 2007
MCITP: Exchange 2007, MCITP: Windows Server 2008, MCSE: M+S

Order my Exchange Server 2007 Book!

(in reply to rparsons1000)
Post #: 4
RE: CCR Failover Problem - 3.Aug.2007 9:04:38 AM   
rparsons1000

 

Posts: 143
Joined: 29.Aug.2006
Status: offline
Thanks Henrik.

BTW, good job on getting the article on NLB for CAS going. Lots of people looking forward to that one.

(in reply to rparsons1000)
Post #: 5
RE: CCR Failover Problem - 26.Aug.2007 6:38:46 PM   
Elan Shudnow

 

Posts: 544
Joined: 4.Jan.2007
From: Chicago, IL
Status: online
I just visited this high availiby section to see if somoene was having this problem, and voila!

I'm noticing the same issue.  100% sure that I installed the KBs and my MNS is working just fine.  My CCR is set to best availability but it's working as a lossless.  When I pull the connectivity on active, it all goes over to the passive just fine but the storage groups are at online pending and then just go to offline.  I then bring the old active back up and then the storage groups are brought online.  It's as if this is working as lossless instead of best availability.  Seems like a bug as best availability should  automatically bring the storage groups online and then mount the databases.

I tested this out with RTM.

I just installed the Rollup 4 and tested this again and everything with best availability started working as it should.

Edit:  I just ran some more tests and it's still working.  So to the original poster, I would try installing Rollup 4 and see if it fixes your problem since it did for me.

< Message edited by eshudnow -- 26.Aug.2007 6:45:22 PM >

(in reply to rparsons1000)
Post #: 6
RE: CCR Failover Problem - 27.Aug.2007 2:56:14 AM   
Henrik Walther

 

Posts: 6827
Joined: 21.Nov.2002
From: Copenhagen, Denmark
Status: offline
If you pull the network cable the resources will move = expected.  What you need to know here there are two things, what is the availability setting and what is the copy queue length.
 
Remember that we have protections built into CCR known as the availability settings.  These availability settings are good availability, best availability, and lossless.  These translate into the number of log files that can be lost on a failover.  Good = 3, best = 5 (or 7 I forget), and lossless = 1 I believe.  I haven't looked at those numbers in quite a while.
 
When a failover occurs we interrogate the availability settings.  If the copy queue lenght (number of logs pending copy) is greater then availability settings....we wait.  This is where the replication service is attempting to wait for the other node to come back so that it can copy the needed log files over.  Hence you plug the network cable in, machine is available, replication service copies and mounts database.
 
You can check this out I believe by using get-clusteredmailboxserver -server <name> | fl to see availability settings.  Get-storagegroupcopystatus -identity <SGName> | fl will give you the copy queue length.
 
Check the app log first though...as the events should be right there I believe.

_____________________________

HTH
Henrik Walther
Exchange MVP & MCM: Exchange 2007
MCITP: Exchange 2007, MCITP: Windows Server 2008, MCSE: M+S

Order my Exchange Server 2007 Book!

(in reply to Elan Shudnow)
Post #: 7
RE: CCR Failover Problem - 27.Aug.2007 10:48:13 AM   
a.grogan

 

Posts: 1887
Joined: 12.Apr.2005
From: London
Status: offline
Hiya chap, just a thought, but is the File Share Witness share on one of the nodes in the cluster - I have seen this cause problems?

Cheers

A

_____________________________

Andy Grogan
MSExchange.org Forums Moderator
For my general ramblings about Exchange please visit my blog:
W: http://telnetport25.wordpress.com/
M: manifoldmaster@gmail.com

(in reply to rparsons1000)
Post #: 8
RE: CCR Failover Problem - 27.Aug.2007 11:11:56 AM   
rparsons1000

 

Posts: 143
Joined: 29.Aug.2006
Status: offline
This certainly makes sense. I did not have a backup for a while so had many many log files and evn had to re-seed due to an issue after failing back. Let's assume though that this were a true condition and you exceeded the AutoDatabaseMountDial parameters. How would you get the DB's mounted? I am wondering if this is not somehow addressed in rollup 4.

For further reading
http://technet.microsoft.com/en-us/library/bb288910.aspx

(in reply to rparsons1000)
Post #: 9
RE: CCR Failover Problem - 27.Aug.2007 6:35:33 PM   
rparsons1000

 

Posts: 143
Joined: 29.Aug.2006
Status: offline
I put the FSW on a Hub Transport role server. I think Henrik is on the right track with the log files it makes sense. I will test again tomorrow after rollup 4 is applied. Thanks.

(in reply to a.grogan)
Post #: 10
RE: CCR Failover Problem - 28.Aug.2007 12:18:30 AM   
rparsons1000

 

Posts: 143
Joined: 29.Aug.2006
Status: offline
I just spent the last 2 hours doing more testing and here is what I found:

The AutoDatabaseMountDial is set to best meaning 3 lost log files before "failure". The documentation really leads me to believe it is 3 or more.

Tested again by shutting down the active node, worked perfect

Tested again removing network cables and still, all DB's fail to mount or come online. All other resources come online. If I try to mount in the console it says I have to run repair-storagegroup because replication is still occurring. I have 15 DB's/storage groups so I did this to one arbitrarily and all of a sudden they all came online except Public Folder since it is a replicated folder (from 2003).

I will check the copy queue length next but I was unable to find where to look for it or set it. It needs to be lower than the AutoDatabaseMountDial?

(in reply to rparsons1000)
Post #: 11
RE: CCR Failover Problem - 28.Aug.2007 12:53:37 PM   
rparsons1000

 

Posts: 143
Joined: 29.Aug.2006
Status: offline
I ran a get-storagegroupcopystatus and opy queue length is 0.

Bottom line is failover is good if shutting down and how often does a server really just crash and burn? If that happens I will just run the restore cmdlet keeping our down time still minimal.

I'll be keeping an eye out for further info though.

(in reply to rparsons1000)
Post #: 12

Page:   [1] << Older Topic    Newer Topic >>
All Forums >> [Microsoft Exchange 2007] >> High Availability >> CCR Failover Problem Page: [1]
Jump to:

New Messages No New Messages
Hot Topic w/ New Messages Hot Topic w/o New Messages
Locked w/ New Messages Locked w/o New Messages
 Post New Thread
 Reply to Message
 Post New Poll
 Submit Vote
 Delete My Own Post
 Delete My Own Thread
 Rate Posts