RSS Subscription 167 Posts and 2,643 Comments

Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 1

I’ve talked about this topic in some of my other articles but wanted to create an article that talks specifically about this model and show several different examples in a Database Availability Group (DAG)’s tolerance for node and File Share Witness (FSW) failure.  Many people don’t properly understand how the Majority Node Set Clustering Model works.  In my article here, I talk about Database Activation Coordination Mode and have a section on Majority Node Set.  In this article, I want to visibly show show some real world examples on how the Majority Node Set Clustering Model works.  This will be a multi-part article and each Part will have its own example.

Part 1

Part 2

Part 3

Majority Node Set

Majority Node Set is a Windows Clustering Model such as the Shared Quorum Model, but different.  Both Exchange 2007 and Exchange 2010 Clusters use Majority Node Set Clustering (MNS).  This means that 50% of your votes (server votes and/or 1 file share witness) need to be up and running.  The proper formula for this is (n / 2) + 1 where n is the number of DAG nodes within the DAG. With DAGs, if you have an odd number of DAG nodes in the same DAG (Cluster), you have an odd number of votes so you don’t have a witness.  If you have an even number of DAGs nodes, you will have a file share witness in case half of your nodes go down, you have a witness who will act as that extra +1 number.

So let’s go through an example.  Let’s say we have 3 servers. This means that we need (number of nodes which is 3 / 2) + 1  which equals 2 as you round down since you can’t have half a server/witness.  This means that at any given time, we need 2 of our nodes to be online which means we can sustain only 1 (either a server or a file share witness) failure in our DAG.  Now let’s say we have 4 servers.  This means that we need (number of nodes which is 4 / 2) + 1 which equals 3.  This means at any given time, we need 3 of our servers/witness to be online which means we can sustain 2 server failures or 1 server failure and 1 witness failure.

Real World Examples

Each of these examples will show DAG Models with a Primary Site and a Failover Site.

2 Node DAG  (One in Primary and One in Failover)

In the following screenshot, we have 3 Servers.  Two are Exchange 2010 Multi-Role Servers; one in the Primary Site and one on the Failover Site.  The Cluster Service is running only on the two Exchange Multi-Role Servers.  More specifically, it would run on the Exchange 2010 Servers that have the Mailbox Server Role. When Exchange 2010 utilizes an even number of Nodes, it utilizes Node Majority with File Share Witness.  If you have dedicated HUB and/or HUB/CAS Servers, you can place the File Share Witness on those Servers.  However, the File Share Witness cannot be placed on the Mailbox Server Role.

So now we have our three Servers; two of them being Exchange.  This means we have two voters and a File Share Witness.  Two of the Mailbox Servers that are running the cluster service are voters and the File Share Witness is just a witness that the voters use to determine cluster majority.  So the question is, how many voters/servers can I lose?  Well if you read the section on Majority Node Set (which you have to understand), you know the formula is (number of nodes /2) + 1.  This means we have (2 Exchange Servers / 2) = 1 + 1 = 2.  This means that 2 cluster objects must always be online for your Exchange Cluster to remain operational.

But now let’s say one of your Exchange Servers go offline.  Well, you still have at least two cluster objects online.  This means your cluster will be still be operational.  If all users/services were utilizing the Primary Site, then everything continues to remain completely operational.  If you were sending SMTP to the Failover Site or users were for some reason connecting to the Failover Site, they will need to be pointed to the Exchange Server in the Primary Site.

But what happens if you lose a second node? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times.  At this time, the entire cluster goes offline.  You need to go through steps provided in the site switchover process but in this case, you would be activating the Primary Site and specify a new Alternative File Share Witness Server that exists in the Primary Site so you can active the Exchange 2010 Server in the Primary Site.  The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.

But what happens if you lose two nodes in the Primary Site? Well, based on the formula above we need to ensure we have 2 cluster objects operational at all times.  At this time, the entire cluster goes offline.  You need to go through steps provided in the site switchover process but in this case, you would be activating the Failover Site and specify a new Alternative File Share Witness Server that exists (or will exist) in the Failover Site so you can activate the Exchange 2010 Server in the Primary Site.   The DAG won’t actively use the File Share Witness but you should specify it anyways because part of the Failback process is re-adding the Primary Site Servers back to the DAG once they become operational.

Once the Datacenter Switchover has occurred, you will be in a state that looks as such.  An Alternate File Share Witness is not for redundancy for your 2010 FSW that was in your Primary Site.  It’s used only during a Datacenter Switchover which is a manual process.

Once your Primary Site becomes operational, you will re-add the Primary DAG Server to the existing DAG which will still be using the 2010 Alternate FSW Server in the Failover Site and you will now be switched into a Node Majority with File Share Witness Cluster instead of just Node Majority.  Remember I said with an odd number of DAG Servers, you will be in Node Majority and with an even number, the Cluster will automatically switch itself to Node Majority with File Share Witness?  You will now be in a state that looks as such.

Part of the Failback Process would be to switch back to the old FSW Server in the Primary Site.  Once done, you will be back into your original operational state.

As you can see with how this works, the question that may arise is where to put your FSW?  Well, it should be in the Primary Site with the most users or the site that has the most important users.  With that in mind, I bet another question arises?  Well, why with the most users or the most important users?  Because some environments may want to use the above with an Active/Active Model instead of an Active/Passive.  Some databases may be activated in both sites.  But, with that, if the WAN link goes down, the Exchange 2010 Server in the Failover Site loses quorum since it can’t contact at least 1 other voter.  Again, you must have two voters online.  This also means that each voter must be able to see one other voter.  Because of that, the Exchange 2010 Server will go completely offline.

To survive this, you really must use 2 different DAGs.  One DAG where the FSW is in the First Site and a second DAG where its FSW is in the Second Site.  Users that live in the First Active Site would primarily be using the Exchange 2010 DAG Members in the First Active Site.  Users that live in the Second Active Site would primarily be using the Exchange 2010 DAG Members in the Second Active Site. This way, if anything happens with the WAN link, users in the First Active Site would still be operational as the FSW for their DAG is in the First Active Site and DAG 1 would maintain Qourum.  Users in the Second Active Site would still be operational as the FSW for their DAG is in the Second Active Site and DAG 2 would maintain Quorum.

Note: This would require twice the amount of servers since a DAG Member cannot be a part of more than one DAG.  As shown below, each visual representation below of a 2010 HUB/CAS/MBX is a separate server.

The Multi-DAG Model would look like this.

 

Share

28 Responses to “Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 1”

  1. [...] Exchange 2010 Site Resilient DAGs and Majority Node Set Clustering – Part 1 | Elan Shudnow’s Blog Posted on August 5, 2011 by johnacook http://www.shudnow.net/2011/08/05/exchange-2010-site-resilient-dags-and-major… [...]

  2. on 31 Aug 2011 at 11:32 amVincent

    I tought that each mailbox server could only be member of 1 DAG.
    Considering this, how can you have 2 DAG^

  3. on 31 Aug 2011 at 12:48 pmElan Shudnow

    That's correct. You have 2 DAGs with twice the amount of servers that you would with 1 DAG.

  4. on 31 Aug 2011 at 6:09 pmVincent

    Ok. That's what I was thinking… The diagrams were not that clear so I was wondering if there were a trick to do it anyways…
    Thanks fir the reply.
    Vincent

  5. on 30 Nov 2011 at 10:21 amJim

    I have the same setup here I'm implementing. One mbx/cas/hub onsite with the FSW. The other mbx/cas/hub located at our datacenter(DR location). Should I also turn on DAC for this now? Exchange 2010 sp1

    thanks

    Jim

  6. on 30 Nov 2011 at 1:48 pmElan Shudnow

    Yes, you should turn DAC on. It's designed for every environment where you have 1 DAG in more than one AD Site or Datacenter with stretched AD Sites. Read more on DAC here: http://www.shudnow.net/2010/06/30/exchange-2010-d

  7. on 01 Dec 2011 at 10:21 amJim

    Thanks for the reply Elan, one more issue I'm having. Basically whenever i switchover to the other server then shutdown the primary server all outlook clients get prompted for credentials and even if they type in the credentials it doesn't work. I found a fix here disabling outlook anywhere for that. http://port25.wordpress.com/2011/01/26/users-rece… but now the it just sits at trying to connect and can't establish a connection.

    If i fire up the primary node again it works fine. I can then switchover everything to the primary node and shutoff the secondary node and everything is happy. It's only when the primary node is down does it not work. Switchover works fine, but once the primary node is shutoff clients can't connect. Any ideas? I'm stuck.

    thanks!

    Jim

  8. on 01 Dec 2011 at 10:41 amElan Shudnow

    I wouldn't do what that article said. The basic idea during a site failover is that you have a lower TTL value for your DNS records. For example, 5 minutes. When your primary server goes down, you cut over all your DNS records to point to the second server.

    Because your second datacenter is strictly DR only and you're not in an active/active scenario, you can set the Outlook Anywhere FQDN on your DR Servers to have the same FQDN as Outlook Anywhere in the Primary Site. Then when you switch over DNS to the secondary Datacenter, your Outlook Anywhere FQDN will be the same. Obviously you'll want to make sure that the certificate in the secondary site has the Outlook Anywhere FQDN and the Common Name on the certificate is the same. This is because clients older than Vista SP1 don't have the capability to have the Certificate Principle Name (MSSTD value in Outlook Anywhere) to be a SAN name on the certificate.

  9. on 01 Dec 2011 at 11:03 amJim

    Great. thanks for the fast response. Any idea on the actual issue though of when I switchover and shutdown the primary that they won't connect up to the secondary? Prompts credentials and after authenticating doesn't work bring up outlook either. It seems it is something between the CAS/MBX.

    I've tried somethings I found on google but nothing has solved it

  10. on 01 Dec 2011 at 1:45 pmCraig

    Elan

    What if just the primary hub/cas/mbx failed and the fsw and the server in the other datacenter is still active? Does everything switchover fine or is there manual intervention to get the clients working on the datacenter hub/cas/mbx

  11. on 01 Dec 2011 at 2:07 pmElan Shudnow

    Well it's most likely because they can't authenticate because the Outlook Anywhere FQDN is pointed to the primary site. That server is no longer up. This is why part of the DR plan includes downtime and switching over the FQDNs to point to the DR Site. That way clients can contact the DR Server(s) and authenticate since the FQDNs are now pointed there. This includes moving over all CAS Namespaces.

  12. on 01 Dec 2011 at 2:08 pmElan Shudnow

    If Quorum is maintained then the Mailbox role stays operational. But, clients may not be able to access it since the FQDNs will most likely be pointing to the servers in the primary site. So while Mailbox Role may be operational, the CAS Role may not.

  13. on 19 Jan 2012 at 9:37 amJohn Panicci

    Elan, Great website, awesome articles. I have 2 site (Prod/DR) active passive dag. 1 mbx in each site. and 2 hub/cas servers in each array in each site. I have FSW on one of the hub/cas servers in primary site. I have alternate fsw setup on one of the hub/cas servers in DR site. you mention the following: "Part of the Failback Process would be to switch back to the old FSW Server in the Primary Site. Once done, you will be back into your original operational state." I agree that this is what you need to do, but how do you actually do the switch back to fsw in primary site. Im basically worrying about site link being down between prod and DR..

  14. on 25 Jan 2012 at 4:05 pmElan Shudnow

    The official documentation is here: http://technet.microsoft.com/en-us/library/dd3510

    There is a section entitled, "Restoring Service to the Primary Datacenter"

    It discusses on how to go back to your original FSW.

  15. on 01 Feb 2012 at 10:10 amwaterproof camera

    have read some of your blog post and it is very informative blog, i already bookmarked your blog..thanks

  16. on 11 Mar 2012 at 11:30 pmabidalilliane

    I can see that the information is quite helpful , specially for those people who don't have any idea on this.
    baby eagle

  17. on 21 Mar 2012 at 1:17 pmNowin

    Hi Elan,

    Please can you help me with my problem. I have 2 nodes active and passive exchange 2007 roles installed. Whenever i switch off the node that has the quorum, the cluster fails. I understad this is because of the fomula and that quorum is not maintained. So how to i maintain quorum. When i switch off the serve that has the quorum, is the quorum suppose to failover to the passive node so that quorum could be maintained? i notice that my quorum disk is not moving over to the passive node once the server its connected on is shutdown. My SCC disks are connected on a SAN,

  18. on 25 May 2012 at 11:39 pmBulk sms india

    It is very special and interesting news provide on your website.really good article and special learn by your article.thanks

  19. on 27 May 2012 at 9:51 pmcurrent version

    Thanks for give me this information you give very nice information on this topic.

  20. on 01 Aug 2012 at 8:57 amSunita

    Great Article. I had a question,

    I have 2 sites. Site 1 is my primary and Site 2 I would like to setup as my DR. I am planning on move to a new building for Site 1 so I will need to power off all of the servers in Site 1. My question is with this design will Site 2 be able to work?

    My configuration at Site 1 is:
    CAS1FSW
    MBX1
    MBX2

    Site 2:
    CAS2FSW
    MBX3

    Thanks for your help
    S

  21. on 01 Aug 2012 at 5:48 pmElan Shudnow

    You will not have quorum in Site 2. You would need to go through the manual DR site procedures in Site 2 in order to have quorum in Site 2 and using CAS2FSW as the alternate FSW. Then when Site 1 is back up and running, you would re-add Site 1 MBX Servers back into the DAG.

    What you could do here is add MBX4 to Site 2 and before you take down Site 1, move the active FSW to CAS2FSW in Site 2 so you have quorum there and wouldn't have to run through manual DR procedures. Obviously you'd still need to ensure that HTTP traffic will be pointed to Site 2 and SMTP mail is delivered to Site 2.

  22. on 25 Oct 2012 at 7:52 pmponzekap2

    Hey Elan. Huge fan of your blog, and Im actually the author of the article Jim references. I was wondering your thoughts on what I had wrote. In a situation where a DB fails over and the client is using MAPI to connect to the CAS array, and OA is in basic authentication mode, you'll have a situation where the Outlook client tries to connect using HTTPS, particularly the public folder connection point. Was just wondering besides having NTLM enabled (saying that isnt an option), I'm interested in what you would recommend in that situation? Huge fan of the material you put up, and would be interested to hear what you think. You can email me at my posting name AT gmail.com if you want. Would love to hear your thoughts.

  23. on 08 Feb 2013 at 4:39 pmAntonio

    What about the bandwidth consumption between site1 and site2
    I should not worry about that?

  24. on 10 Feb 2013 at 8:39 amElan Shudnow

    Depends on several factors. Are users active in both sites? If so, are databases active in both sites? If so, is mail for users in the second site have their MX records going into the other site? And don't forget about replication traffic? For users active in the second site, are they doing centralized webmail.domain.com which is going to the primary site and proxying traffic to the site they are in?

    As you can see, there are definitely bandwidth consumption considerations that need to be accounted for. There are two calculators which can help here:
    1. Client Network Bandwidth Calculator: http://blogs.technet.com/b/exchange/archive/2012/
    2. Exchange 2010 Mailbox Server Role Requirements Calculator: http://blogs.technet.com/b/exchange/archive/2009/

  25. on 20 Mar 2013 at 4:22 pmShea Werner

    can I use load balanced dns for this setup.
    So if the primary hub/cas/mbx but the fsw did not, the dns lb would see site one hub/cas/mbx is not working and would redirect clients to site 2 allowing for automatic failover at least in this case?

  26. on 20 Mar 2013 at 6:15 pmShea Werner

    correction to my question. I meant "DNS failover"

  27. on 23 Mar 2013 at 10:25 pmShea Werner

    Rephrase of above question; Can I use DNS Failover (i.e. offered by dnsmadeeasy.com)
    So if the primary hub/cas/mbx failed but the fsw did not, the dns failover would see site one hub/cas/mbx is not working and would redirect clients to site 2 allowing for automatic failover at least in this case.

  28. on 27 Apr 2013 at 3:41 amsaid selfani

    hi
    i have immplement exchange 2010 cluster
    PRIMARY
    MBX1
    HUB /CAS/FSW
    and i have DR
    MBX2
    HUB2/CAS2
    i have 1 fsw but when the primary site goes down i think dr site will not work because the fsw is down
    so please help me how to make dr site work in case the primary site is down
    if i need to creat alternate fsw so how i can do that ?

Trackback this post | Feed on Comments to this post

Leave a Reply