RSS Subscription 167 Posts and 2,769 Comments

Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 1

I’ve talked about this topic in some of my other articles but wanted to create an article that talks specifically about this model and show several different examples in a Database Availability Group (DAG)’s tolerance for node and File Share Witness (FSW) failure.  Many people don’t properly understand how the Majority Node Set Clustering Model works.  In my article here, I talk about Database Activation Coordination Mode and have a section on Majority Node Set.  In this article, I want to visibly show show some real world examples on how the Majority Node Set Clustering Model works.  This will be a multi-part article and each Part will have its own example.

Part 1

Part 2

Part 3

Majority Node Set

Majority Node Set is a Windows Clustering Model such as the Shared Quorum Model, but different.  Both Exchange 2007 and Exchange 2010 Clusters use Majority Node Set Clustering (MNS).  This means that 50% of your votes (server votes and/or 1 file share witness) need to be up and running.  The proper formula for this is (n / 2) + 1 where n is the number of DAG nodes within the DAG. With DAGs, if you have an odd number of DAG nodes in the same DAG (Cluster), you have an odd number of votes so you don’t have a witness.  If you have an even number of DAGs nodes, you will have a file share witness in case half of your nodes go down, you have a witness who will act as that extra +1 number.

So let’s go through an example.  Let’s say we have 3 servers. This means that we need (number of nodes which is 3 / 2) + 1  which equals 2 as you round down since you can’t have half a server/witness.  This means that at any given time, we need 2 of our nodes to be online which means we can sustain only 1 (either a server or a file share witness) failure in our DAG.  Now let’s say we have 4 servers.  This means that we need (number of nodes which is 4 / 2) + 1 which equals 3.  This means at any given time, we need 3 of our servers/witness to be online which means we can sustain 2 server failures or 1 server failure and 1 witness failure.

Real World Examples

Each of these examples will show DAG Models with a Primary Site and a Failover Site.

2 Node DAG  (One in Primary and One in Failover)

In the following screenshot, we have 3 Servers.  Two are Exchange 2010 Multi-Role Servers; one in the Primary Site and one on the Failover Site.  The Cluster Service is running only on the two Exchange Multi-Role Servers.  More specifically, it would run on the Exchange 2010 Servers that have the Mailbox Server Role. When Exchange 2010 utilizes an even number of Nodes, it utilizes Node Majority with File Share Witness.  If you have dedicated HUB and/or HUB/CAS Servers, you can place the File Share Witness on those Servers.  However, the File Share Witness cannot be placed on the Mailbox Server Role.

So now we have our three Servers; two of them being Exchange.  This means we have two voters and a File Share Witness.  Two of the Mailbox Servers that are running the cluster service are voters and the File Share Witness is just a witness that the voters use to determine cluster majority.  So the question is, how many voters/servers can I lose?  Well if you read the section on Majority Node Set (which you have to understand), you know the formula is (number of nodes /2) + 1.  This means we have (2 Exchange Servers / 2) = 1 + 1 = 2.  This means that 2 cluster objects must always be online for your Exchange Cluster to remain operational.

But now let’s say one of your Exchange Servers go offline.  Well, you still have at least two cluster objects online.  This means your cluster will be still be operational.  If all users/services were utilizing the Primary Site, then everything continues to remain completely operational.  If you were sending SMTP to the Failover Site or users were for some reason connecting to the Failover Site, they will need to be pointed to the Exchange Server in the Primary Site.

But what happens if you lose a second node? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times.  At this time, the entire cluster goes offline.  You need to go through steps provided in the site switchover process but in this case, you would be activating the Primary Site and specify a new Alternative File Share Witness Server that exists in the Primary Site so you can active the Exchange 2010 Server in the Primary Site.  The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.

But what happens if you lose two nodes in the Primary Site? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times.  At this time, the entire cluster goes offline.  You need to go through steps provided in the site switchover process but in this case, you would be activating the Failover Site and specify a new Alternative File Share Witness Server that exists (or will exist) in the Failover Site so you can activate the Exchange 2010 Server in the Primary Site.   The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.

Once the Datacenter Switchover has occurred, you will be in a state that looks as such.  An Alternate File Share Witness is not for redundancy for your 2010 FSW that was in your Primary Site.  It’s used only during a Datacenter Switchover which is a manual process.

Once your Primary Site becomes operational, you will re-add the Primary DAG Server to the existing DAG which will still be using the 2010 Alternate FSW Server in the Failover Site and you will now be switched into a Node Majority with File Share Witness Cluster instead of just Node Majority.  Remember I said with an odd number of DAG Servers, you will be in Node Majority and with an even number, the Cluster will automatically switch itself to Node Majority with File Share Witness?  You will now be in a state that looks as such.

Part of the Failback Process would be to switch back to the old FSW Server in the Primary Site.  Once done, you will be back into your original operational state.

As you can see with how this works, the question that may arise is where to put your FSW?  Well, it should be in the Primary Site with the most users or the site that has the most important users.  With that in mind, I bet another question arises?  Well, why with the most users or the most important users?  Because some environments may want to use the above with an Active/Active Model instead of an Active/Passive.  Some databases may be activated in both sites.  But, with that, if the WAN link goes down, the Exchange 2010 Server in the Failover Site loses quorum since it can’t contact at least 1 other voter.  Again, you must have two voters online.  This also means that each voter must be able to see one other voter.  Because of that, the Exchange 2010 Server will go completely offline.

To survive this, you really must use 2 different DAGs.  One DAG where the FSW is in the First Site and a second DAG where its FSW is in the Second Site.  Users that live in the First Active Site would primarily be using the Exchange 2010 DAG Members in the First Active Site.  Users that live in the Second Active Site would primarily be using the Exchange 2010 DAG Members in the Second Active Site. This way, if anything happens with the WAN link, users in the First Active Site would still be operational as the FSW for their DAG is in the First Active Site and DAG 1 would maintain Qourum.  Users in the Second Active Site would still be operational as the FSW for their DAG is in the Second Active Site and DAG 2 would maintain Quorum.

Note: This would require twice the amount of servers since a DAG Member cannot be a part of more than one DAG.  As shown below, each visual representation below of a 2010 HUB/CAS/MBX is a separate server.

The Multi-DAG Model would look like this.

 

Share

31 Responses to “Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 1”

  1. […] Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 1 | Elan Shudnow’s Blog Posted on August 5, 2011 by johnacook http://www.shudnow.net/2011/08/05/exchange-2010-site-resilient-dags-and-major… […]

  2. on 31 Aug 2011 at 11:32 amVincent

    I tought that each mailbox server could only be member of 1 DAG.
    Considering this, how can you have 2 DAG^

  3. on 30 Nov 2011 at 10:21 amJim

    I have the same setup here I'm implementing. One mbx/cas/hub onsite with the FSW. The other mbx/cas/hub located at our datacenter(DR location). Should I also turn on DAC for this now? Exchange 2010 sp1

    thanks

    Jim

  4. on 01 Dec 2011 at 10:21 amJim

    Thanks for the reply Elan, one more issue I'm having. Basically whenever i switchover to the other server then shutdown the primary server all outlook clients get prompted for credentials and even if they type in the credentials it doesn't work. I found a fix here disabling outlook anywhere for that. http://port25.wordpress.com/2011/01/26/users-rece… but now the it just sits at trying to connect and can't establish a connection.

    If i fire up the primary node again it works fine. I can then switchover everything to the primary node and shutoff the secondary node and everything is happy. It's only when the primary node is down does it not work. Switchover works fine, but once the primary node is shutoff clients can't connect. Any ideas? I'm stuck.

    thanks!

    Jim

  5. on 01 Dec 2011 at 11:03 amJim

    Great. thanks for the fast response. Any idea on the actual issue though of when I switchover and shutdown the primary that they won't connect up to the secondary? Prompts credentials and after authenticating doesn't work bring up outlook either. It seems it is something between the CAS/MBX.

    I've tried somethings I found on google but nothing has solved it

  6. on 01 Dec 2011 at 1:45 pmCraig

    Elan

    What if just the primary hub/cas/mbx failed and the fsw and the server in the other datacenter is still active? Does everything switchover fine or is there manual intervention to get the clients working on the datacenter hub/cas/mbx

  7. on 19 Jan 2012 at 9:37 amJohn Panicci

    Elan, Great website, awesome articles. I have 2 site (Prod/DR) active passive dag. 1 mbx in each site. and 2 hub/cas servers in each array in each site. I have FSW on one of the hub/cas servers in primary site. I have alternate fsw setup on one of the hub/cas servers in DR site. you mention the following: "Part of the Failback Process would be to switch back to the old FSW Server in the Primary Site. Once done, you will be back into your original operational state." I agree that this is what you need to do, but how do you actually do the switch back to fsw in primary site. Im basically worrying about site link being down between prod and DR..

  8. on 01 Feb 2012 at 10:10 amwaterproof camera

    have read some of your blog post and it is very informative blog, i already bookmarked your blog..thanks

  9. on 11 Mar 2012 at 11:30 pmabidalilliane

    I can see that the information is quite helpful , specially for those people who don't have any idea on this.
    baby eagle

  10. on 21 Mar 2012 at 1:17 pmNowin

    Hi Elan,

    Please can you help me with my problem. I have 2 nodes active and passive exchange 2007 roles installed. Whenever i switch off the node that has the quorum, the cluster fails. I understad this is because of the fomula and that quorum is not maintained. So how to i maintain quorum. When i switch off the serve that has the quorum, is the quorum suppose to failover to the passive node so that quorum could be maintained? i notice that my quorum disk is not moving over to the passive node once the server its connected on is shutdown. My SCC disks are connected on a SAN,

  11. on 25 May 2012 at 11:39 pmBulk sms india

    It is very special and interesting news provide on your website.really good article and special learn by your article.thanks

  12. on 27 May 2012 at 9:51 pmcurrent version

    Thanks for give me this information you give very nice information on this topic.

  13. on 01 Aug 2012 at 8:57 amSunita

    Great Article. I had a question,

    I have 2 sites. Site 1 is my primary and Site 2 I would like to setup as my DR. I am planning on move to a new building for Site 1 so I will need to power off all of the servers in Site 1. My question is with this design will Site 2 be able to work?

    My configuration at Site 1 is:
    CAS1FSW
    MBX1
    MBX2

    Site 2:
    CAS2FSW
    MBX3

    Thanks for your help
    S

  14. on 08 Feb 2013 at 4:39 pmAntonio

    What about the bandwidth consumption between site1 and site2
    I should not worry about that?

  15. on 20 Mar 2013 at 4:22 pmShea Werner

    can I use load balanced dns for this setup.
    So if the primary hub/cas/mbx but the fsw did not, the dns lb would see site one hub/cas/mbx is not working and would redirect clients to site 2 allowing for automatic failover at least in this case?

  16. on 20 Mar 2013 at 6:15 pmShea Werner

    correction to my question. I meant "DNS failover"

  17. on 23 Mar 2013 at 10:25 pmShea Werner

    Rephrase of above question; Can I use DNS Failover (i.e. offered by dnsmadeeasy.com)
    So if the primary hub/cas/mbx failed but the fsw did not, the dns failover would see site one hub/cas/mbx is not working and would redirect clients to site 2 allowing for automatic failover at least in this case.

  18. on 27 Apr 2013 at 3:41 amsaid selfani

    hi
    i have immplement exchange 2010 cluster
    PRIMARY
    MBX1
    HUB /CAS/FSW
    and i have DR
    MBX2
    HUB2/CAS2
    i have 1 fsw but when the primary site goes down i think dr site will not work because the fsw is down
    so please help me how to make dr site work in case the primary site is down
    if i need to creat alternate fsw so how i can do that ?

  19. on 24 Oct 2013 at 5:58 pmSam

    Hi Elan,

    This has been very informative. A scenario I'm currently facing with a client is that they have two datacentres. DC1 contains MBX 1 & 2 and a FSW. DC2 contains MBX 3 only. All databases are active only on the 2 MBX servers in DC1 whilst MBX 3 holds passive copies. My intention is to recommend putting in a MBX 4 at DC2 so in case DC1 goes down, MBX 3 doesn't have to deal with the load by itself.

    Currently, there is just one DAG however. I know you recommended above to set up a MBX 4 in a separate DAG and set up a FSW in DC2. However, in the even the client does not want to invest in additional servers, what options do I have if DC1 goes down? We are also looking to implement DAC mode for the client.

  20. on 24 Oct 2013 at 6:22 pmSam

    Hi Elan,

    Part 2 actually answered my question. :-)

    Thanks

  21. on 22 Jan 2014 at 12:30 amAhmed Al-Haffar

    Thanks Elan for this informative, very well structured article.
    i have a doubt currently we have 2 DAG Members and 1 FSW, what will happen if i restart the FSW, as per the formula nothing will be happened but i want to double check with you.

    regards.

Trackback this post | Feed on Comments to this post

Leave a Reply