RSS Subscription 168 Posts and 2,769 Comments

Archive for May, 2012

Lync 2010 Central Site Resilience w/ Backup Registrars, Failovers, and Failbacks – Part 3

Introduction

Welcome to Part 3 of this article series. In Part 1, we started off by discussing the goal of this lab. That goal is to wrap all the information out there on how to utilize Central Site Resilience in regards to failovers, fallbacks, how redirects function, how SRV records fit in, etc… We first discussed what the lab setup is going to be using Hyper-V, and then proceeded to take a look at the base topology and configuration.  In Part 2 of this article series, I went through the sign-in process for each user.  Because the SRV record for sign-in was pointed to A-L14FE1.shudlab.net, the sign-in process was different for each user logging in.

In this Part, we’ll do a failover test and a failback test without a SRV record in place.  We’ll take a look at what happens to ClientUser1 using A-L14FE1 Pool and what happens to ClientUser2 using A-L14FE2 Pool when we take down one of the Pools.  We will then take a look at what happens when the Pool that came down comes back online. And finally, we will end our tests by seeing what happens when a second SRV record is in place.

Part 1

Part 2

Part 3

The Failover with only one SRV

As shown in Part 1, our SRV record is pointing to A-L14FE1.shudlab.net. If you recall from Part 2, when ClientUser1 signed in and connected to A-L14FE1.shudlab.net, he received no 301 Redirect and therefore was not informed of their Primary and Backup Registrar.  We also saw ClientUser2 connect to A-L14FE1.shudlab.get, and received a 301 redirect with his Primary and Backup Registrar and ClientUser2 then connected and registered to A-L14FE2.shudlab.net.

ClientUser1

Let’s start with disabling the NIC on A-L14FE1.  I wanted to see the behavior of ClientUser1.  Don’t forget, ClientUser1 initially connected to A-L14FE1.shudlab.net with no 301 redirection.  Because of this, it has no idea what its backup registrar is and there is no additional SRV records other than the one that has you connecting to A-L14FE1.

After approximately 30 seconds, ClientUser1 gets disconnected.

Remember in the Topology, we had a failover detection time of 30 seconds.  I let this sit here for about 5-8 minutes and it stayed disconnected.

ClientUser2

First thing I did is re-enable the NIC on A-L14FE1 and let ClientUser1 sign back in.  I want to be in a normal operational state.  Now let’s disable the NIC on A-L14FE2.shudlab.net and see what happens with ClientUser2.  What we should see happen is that it fails over to A-L14FE2.shudlab.net.  The reason being is that it signed in using the SRV record, received the 301 redirect, and was informed of both its Primary Registrar and its Backup Registrar.  While ClientUser2 should be able to fail over, don’t forget about the Endpointconfiguration.cache file.  It this client were to sign out and sign back in, it would not use the SRV record and connect directly to A-L14FE2.shudlab.net.  Because of that, it would no longer know about its Backup Registrar and would have no idea where to reconnect.

But let’s take a look at both scenarios.  Let’s first take a look at if it fails over properly since the last sign-in it completed it received a 301 redirect.

We’ll go ahead and disable the NIC on A-L14FE2.shudlab.net.

After around 30 or so seconds, ClientUser2 signs out.  What I would expect now is ClientUser2 connects to A-L14FE1.shudlab.net since again, when ClientUser2 initially signed in, it received a 301 redirect which informed ClientUser2 of both the Primary and Backup Registrar.

And just as I thought, ClientUser2 connects to A-L14FE1.shudlab.net

After re-enabling the NIC on A-LyncFE2.shudlab.net, within 40 seconds (which is the failback detection time), ClientUser2 reconnects.

The Failover with a second SRV pointing to the secondary pool

So we’ve seen ClientUser1 fail to connect when A-L14FE1.shudlab.net goes down because ClientUser1 never received a 301 redirect message and because there is no 2nd SRV record in the environment.  Let’s go ahead and add our second SRV record with a priority of 10.

And just to verify A-Client1 sees the change, let’s do a new nslookup.

Ok, now let’s run the same test we initially did.  I’m shutting down A-L14FE1.shudlab.net server’s NIC.  What we saw earlier on in our tests is that ClientUser1 would just sit signed out with nowhere to go.  What should happen now is the Lync client signs out, ends up finding the second SRV record, and now is able to connect to the second pool, A-L14FE2.shudlab.net.

After around 30 or so seconds, ClientUser1 signs out.  Let’s see if it picks up the 2nd SRV record and then signs into A-L14FE2.shudlab.net

After a little bit of waiting, sure enough, ClientUser1 can now successfully sign into A-L14FE2.shudlab.net

Now let’s take a look at a Netmon Trace and see what exactly ClientUser1 did for DNS lookups.

When the server is down, we see the client query for _sipinternaltls._tcp.shudlab.net.  We can see in the red highlights at the bottom, we have a-l14fe1.shudlab.net and l14fe2.shudlab.net returned.  Part of the data return is obviously the priority information.  What we end up seeing below is ClientUser1 ends up trying to connec tto a-l14fe2.shudlab.net because it knows it is having problems connecting to a-l14fe1.shudlab.net.  Because of that 2nd SRV being in place, ClientUser1 found it, is doing another query for a-l14fe2.shudlab.net to find its IP address, and now makes a connection to this server.  Voila, we now have a failed over client.

 

Reviewing some key points

  • If a client gets redirected to a server, it is a 301 redirect that informs the client of their Primary and Backup Registrar.  If the Primary happens to be down (for example, if you connected to a Director), the client will automatically be able to connect to their Backup Registrar.  If their Primary happens to be operational, the user connects, and their Primary Goes down, that user will failover to their Backup Registrar.
  • If a client has signed in at least once, their Primary Server has been cached into a file called Endpointconfiguration.cache.  That client will always connect directly to that server instead of potentially getting a 301 redirect.  It is because of this it is very important to have multiple SRV records in the environment to increase the chance that regardless if a server is cached in the Endpointconfiguration.cache file, that client will have another means to find another registrar in the environment.  If that registrar happens to be another pool that is not their primary, the user will get a 301 redirect to their Primary and Backup Registrar Pool.
  • A registrar does help as it will redirect clients to their correct pool and provides the clients with a 301 redirect thus letting the client know what their Primary and Backup Registrar is.  But as you have seen, do not completely rely on this due to the client caching server information in the Endpointconfiguration.cache. You absolutely should have at least 2 SRV records with two different priorities to ensure a client will failover to another registrar regardless if you have a Director in your environment or not.

 

Conclusion

Well folks, that is all for not just Part 3, but the entire article series. In this part, we performed a failover test and a failback test without a SRV record in place.  We then took a look at what happens to ClientUser1 using A-L14FE1 Pool and what happens to ClientUser2 using A-L14FE2 Pool when we take down one of the Pools.  We then took a look at what happens when the Pool that came down comes back online.  And we finally ended our tests in seeing what happens when a second SRV record is in place.

Hopefully these articles have helped you understand more on how the deployment of Lync 2010 Central Site Resilience works.  Feel free to ask questions in the comments below and I will do my best to answer questions.

 

Share

Lync 2010 Central Site Resilience w/ Backup Registrars, Failovers, and Failbacks – Part 2

Introduction

Welcome to Part 2 of this article series. In Part 1, we started off by discussing the goal of this lab. That goal is to wrap all the information out there on how to utilize Central Site Resilience in regards to failovers, fallbacks, how redirects function, how SRV records fit in, etc… We first discussed what the lab setup is going to be using Hyper-V, and then proceeded to take a look at the base topology and configuration.

In this Part, I will go through the sign-in process for each user.  Because the SRV record for sign-in is pointed to A-L14FE1.shudlab.net, the sign-in process will be different for each user logging in.

Part 1

Part 2

Part 3

The Sign-In Process

As shown in Part 1, our SRV record is pointing to A-L14FE1.shudlab.net.  This means when ClientUser1 connects, he will connect directly to his home server.  When ClientUser2 connects, he will connect to A-L14FE1.shudlab.net, will get authenticated, and will receive a 301 redirect to A-L14FE2.shudlab.net.

ClientUser1

This is a completely fresh client.  No Lync 2010 client has signed in and therefore, there is no cached folder with any endpointconfiguration.cache file with a cached server.  The Lync 2010 client will sign in for its first time and do the SRV lookup.

Let’s enable logging on A-Lync14FE1 since that is the server that will be authenticating all Lync 2010 logins.  Essentially, if we had a Director, it would be doing the exact same thing in this situation.  We’ll start logging by going to: Start > All Programs > Microsoft Lync Server 2010 > Lync Server Logging Tool.  Enable the SIPStack option, choose Information, and then choose All Flags.  Then Click Start.

Now that we’re logging, we’ll hop back onto A-Client1, and sign into the Lync 2010 client using the ClientUser1 user account.  We can see that the Lync 2010 Client on A-Client1 signed in successfully and we can see in the Configuration Information (Control + Right-Click on Lync Icon in Notification Area) that we’re connected to A-L14FE1.

Heading back over to- AL14FE1.shudlab.net, let’s take a look at the Lync Logs.  We click Stop and then Analyze to view the logs in Snooper.  Make sure the Lync 2010 Resource Kit tools are installed otherwise Snooper will not launch.  Taking a look at the log, we can see a bunch of incoming Subscribes and a bunch of incoming Service messages.  This Pool has authenticated this user and is now servicing this user.  We see no SIP Redirects and therefore, this ClientUser1 has no idea what its backup registrar is.

Taking a look at the endpointconfiguration.cache file, we can see that this client now has A-L14FE1.shudlab.net cached.  It will no longer try to do an SRV lookup unless it cannot connect to the server specified in this endpointconfiguration.cache file.

ClientUser2

Just like ClientUser1, this is a completely fresh client.  No Lync 2010 client has signed in and therefore, there is no cached folder with any endpointconfiguration.cache file with a cached server.  The Lync 2010 client will sign in for its first time and do the SRV lookup.

Let’s go ahead and start logging again on A-L14FE1.  Refer to the ClientUser1 section on  how to log.  We’re logging on A-L14FE1 instead of A-L14FE2 to see the difference in how A-L14FE1 responds when a user is logging in from a different pool.

Now that we’re logging, we’ll hop back onto A-Client2, and sign into the Lync 2010 client using the ClientUser2 user account.  We can see that the Lync 2010 Client on A-Client2 signed in successfully and we can see in the Configuration Information (Control + Right-Click on Lync Icon in Notification Area) that we’re connected to A-L14FE2.

Heading back over to A-L14FE1.shudlab.net, let’s take a look at the Lync Logs.  We click Stop and then Analyze to view the logs in Snooper.  Make sure the Lync 2010 Resource Kit tools are installed otherwise Snooper will not launch.  Taking a look at the log, we see a ton less data than we did when ClientUser1 logged in.  Again, this is because ClientUser1 was homed on A-L14FE1 whereas ClientUser2 is not homed on A-L14FE1 but is rather homed on A-L14FE2.

Because ClientUser2 is homed on A-L14FE2, when ClientUser2 was initially connecting to A-L14FE2, we can see the authentication occurring, and then A-L14FE2 issues a 301 redirect to ClientUser2.  In the data on the right of the log, we see that in this 301 redirect message, the user is notified what their Primary Registrar is (A-L14FE2.shudlab.net:5061) and their Backup Registrar (A-L14FE1.shudlab.net:5061).  This is why Doug, in his blog article, talked about the benefits of Directors.  A Director will issue 301 redirect for all authenticating users.  This way, clients will know about Primary and Backup Registrar.

But Chris’ article talks about the endpointconfiguration.cache file.  When this ClientUser2 successfully connected, he put ONLY his Primary Registrar into this file.  Because of this, on subsequent attempts, ClientUser2 will connect directly to A-L14FE2.shudlab.net instead of doing an SRV lookup, connecting to A-L14FE1.shudlab.net, and then getting a 301 redirect.  It’s because of this Chris’ article mentions that you should still have multiple SRV records.  They are needed to handle this situation.

Taking a look at the endpointconfiguration.cache file on ClientUser2, we can see that the Backup Registrar is not cached.

Conclusion

Thanks for reading Part 2.  In this Part, I went through the sign-in process for each user.  Because the SRV record for sign-in was pointed to A-L14FE1.shudlab.net, the sign-in process was different for each user logging in.

In Part 3, we’ll then do a failover test and a failback test without a SRV record in place.  We’ll take a look at what happens to ClientUser1 using A-L14FE1 Pool and what happens to ClientUser2 using A-L14FE2 Pool when we take down one of the Pools.  Finally, we’ll take a look at what happens when the Pool that came down comes back online.

To read Part 3 of this article series, click here.

 

Share

Lync 2010 Central Site Resilience w/ Backup Registrars, Failovers, and Failbacks – Part 1

Introduction

I’ve seen quite a bit of discussion on Lync 2010 Backup Registars.  There’s been some PowerPoints which show how a Backup Registrar works, there’s been some blogs that discuss how clients are handed back a primary registrar and backup registrar, etc…  What I have not seen, is one article that wraps everything together and shows it all in action.  That is what this multi-part article is going to do.

In this first part, we’ll go over what the lab setup is going to look like and take a look at the base topology and configuration before we really start diving into the actual scenario testing.

Part 1

Part 2

Part 3

General Information

Now the two best blog articles (both from Microsoft employees) I’ve found in regards to Backup Registrars (other than the official Planning for Central Site Resilience which you can find here) are:

To summarize the articles above to set the stage for my article, you will see below that I am not including a Director.  The key point Doug makes in his article is that when a Lync 2010 Registrar, in Doug’s article it is a Director, redirects a user to their home pool, that 301 redirect contains the user’s Primary Registrar and their Secondary (Backup Registrar).  Now any Registrar in the deployment can accept a user login and if that registrar is not the home pool for the given user, it issues a 301 redirect for that user.  One of the jobs of a Director is to always issue 301 redirects as one of its purposes is to handle client logins.  In my article, I have two pools and one pool will always be authenticating all users.  So in my case, I can simulate the 301 redirect piece by having Client2 that is homed on Pool2 by signing into Pool1.  That client will then get the 301 redirect.  We’ll take a look at the SIP traces later.

Chris Norman then makes a point that there is a file called Endpointconfiguration.cache which caches your last connected server.  In my example above, once Client2 gets a 301 redirect to his pool, he knows about his Primary and Backup Registrar.  On subsequent connects, he will use the information in his Endpointconfiguration.cache and will no longer get a 301 redirect and he won’t know about his Backup Registrar anymore.  Because of this, we need another mechanism for clients to be able to find another registrar in the environment.  And Chris Norman mentioned having additional SRV records with a different priority.  This way, if Client2 looked in his Endpointconfiguration.cache file, saw that he should connect to Pool2, and Pool2 happened to be down or it went down while he was connected, the client would find the other SRV record, know how to connect to Pool1, and voila, he is connected.

That is a very quick summary.  I would recommend giving the above two blog articles a read.  Or just read on as I will explain everything and show multiple scenarios.  Throughout the rest of the article series, I will be showing the following scenarios:

  • Showing Failover without SRVs.  What happens to ClientUser1 on A-L14FE1 (Pool1) when we take down one of the Pools.  What happens to ClientUser2 on A-L14FE2 (Pool2) when that same pool went offline?
  • How does fallback work?
  • What changes after we’ve cached our Primary Registrar for both users in the Endpointconfiguration.cache file?
  • What changes if we add a secondary SRV record?

Lab Setup

Guest Virtual Machines

One Server 2008 R2 Enterprise (Standard can be used) SP1 x64 Domain Controller which Certificate Services installed as the Enterprise Root Certificate Authority.

Two Server 2008 R2 Standard (Enterprise can be used) x64 (x64 required) Member Servers where Lync Server 2010 is installed. Both of these Lync Server 2010 Servers are installed as two separate Lync Standard Edition Pools.

Two Windows 7 X64 Enterprise (Professional and Ultimate can be used instead) Client Machines where the Lync 2010 Client will be installed.  One client machine is using a user account to connect to one Standard Edition Pool.  The other client machine is using a separate user account to connect to the other Standard Edition Pool.

Assumptions

  • You have a domain that contains at least one Server 2003 Domain Controller (DC/GC).
  • The client machines have network connectivity and can talk to either Lync 2010 Front End Servers
  • You are using the latest updates on this Lync 2010 infrastructure.  At the time of writing this lab, we are utilizing Cumulative Update (CU) 5.

Computer Names

Lync 2010 Standard Edition Front End Server –A-L14FE1

Lync 2010 Standard Edition Front End Server –A-L14FE2

Domain Controller  / Global Catalog /  Root Enterprise CA – A-DC1

Domain Controller  / Global Catalog – A-DC2

Client  – A-Client1 (Lync User Account Homed on A-L14FE1)

Client  – A-Client2 (Lync User Account Homed on A-L14FE2)

Configuration of  Domain Controllers

Operating System: Windows Server 2008 R2 SP1

Processor: 1

Memory: 1024MB static

Virtual Network Type External NIC

Virtual Disk Type – System Volume (C:\): 60GB Dynamic

Note: In a real-world environment, depending on the needs of the business and environment, it is best practice to install your database and logs on separate disks/spindles. I installed Active Directory and Certificate Services on the same disks/spindles for simplicity sakes for this lab.

Configuration of Lync 2010 Standard Edition Front End Servers

Operating System: Windows Server 2008 R2 SP1

Processor: 2

Memory: 2048 MB static

Virtual Network Type External NIC

Virtual Disk Type – System Volume (C:\): 60 GB Dynamic

Configuration of Client Machines

Operating System: Windows 7 Enterprise X64

Processor: 1

Memory: 1024 MB dynamic (512 startup)

Virtual Network Type External NIC

Virtual Disk Type – System Volume (C:\): 60 GB Dynamic

IP Addressing Scheme (Corporate Subnet)

IP Address – 192.168.1.x

Subnet Mask – 255.255.255.0

Default Gateway – 192.168.1.1

DNS Server – 192.168.1.51 Primary (A-DC1) / 192.168.1.55 (Secondary)

Base Topology and Configuration

Lync 2010 Topology

What I did was create two Central Sites.  The A-L14FE1.shudlab.net Standard Edition Pool Server will go in Chicago.  The A-L14FE2.shudlab.net Standard Edition Pool Server will go in Detroit.  There wasn’t a particular reason I decided to go with two Central Sites.  I could have put both Standard Pools in the same site.  The Lync 2010 Topology does let you have two Pools in the same Central Site and set the other Pool as a Backup Registrar just like if both Pools were in separate Central Sites.   Now I’m not going to go into detail on why we would want to use multiple Central Sites, but here’s a breakdown of some of the reasons I can think of off the top of my head:

  • Want to have separate Edge Pools for Media Traffic.  That way, we can put one set of users in Chicago and have it use Chicago pipe for Chicago users and we can put the other set of users in the other pool (for example, Detroit which would most likely be an entirely different region much farther away) so those users use an entirely different pipe for Edge media traffic.
  • We want to associate a specific Survivable Branch Appliance (SBA) or Survivable Branch Server (SBS) with a specific Pool.  This is done by creating a Branch Site within the Central Site.  The Branch Site would contain the SBA or SBS and because the SBA or SBS is within the Branch Site which is in a Central Site, that SBA is associated with the Pools in that Central Site.
  • We want more granular control over Call Admission Control.  I’m not going to get into detail on CAC, but CAC uses something called Regions which is associated with a Central Site.  If we have our Pools in different Central Sites, it allows to create more Regions which gives us more control over how we link Regions together and therefore, how we can route Audio/Video traffic across our sites.
  • A thing to note is that when you have a PSTN Gateway associated to a Central Site, that PSTN Gateway cannot be associated to a Mediation Server in another Central Site.

With that out of the way, let’s take a look at our topology from a Site level.  As stated, we have two Sites: Chicago and Detroit.  We only have one SIP Domain and that SIP Domain is shudlab.net.

As you can see, we have two Standard Edition Front End Pools: A-L14FE1.shudlab.net and A-L14FE2.shudlab.net.

If we take a look at the Properties of our A-L14FE1.shudlab.net Pool we can seeA-L14FE2.shudlab.net is set as our Backup Registrar.  I set the Failure detection interval (sec) to 30 seconds and Fallback interval (sec) to 40 seconds. Around 30 seconds the Lync 2010 client that is connected or connecting to that Primary Registrar that is no longer available will fallback to this Backup Registrar.  After their Primary Registrar is back online, shortly after 40seconds of their Primary Registrar being back online and the client detecting that their Primary Registrar is operational, the Lync 2010 client will automatically sign out and sign back into their Primary Registrar.

The defaults for these values are 300 seconds for Failure detection interval and 600 seconds for Failback interval.  But because this is a lab, and I will be taking and embedding video in this article series for you to witness the failover and failback, I didn’t want you guys to wait for long hence the lower values I have set.

If we take a look at the A-L14FE2.shudlab.net Pool’s Resiliency settings, we will see a similar setup where A-L14FE1.shudnow.net is the Backup Registrar.

Note: Keep in mind that an SBA and/or SBS cannot be the Backup Registrar for a Pool.  You can, however, have a Pool be the backup Registrar for a SBA and/or SBS.  This is for registration only.  You can still have redundant voice routes that use the voice gateway and/or SIP Trunks in any location you would like.

Domain Name System (DNS)

If we take a look at the single SRV record currently in our environment, we see it is pointing to the Chicago Pool, a-L14FE1.shudlab.net.  We can point it directly to the pool name instead of something like sip.shudlab.net because AD and our single SIP namespace is using shudlab.net.  If your SIP namespace and your AD (Pool using AD namespace) is different, we would have a SAN name on our Pool containing sip.<sip domain> for each sip domain in our environment.  Our SRV record for each namespace would point to its corresponding sip.<sip domain>.com.  While we could still use sip.shudlab.net, I would rather use the Pool FQDN instead so when a user connects, we can see in the logs that it is connected to the Pool FQDN rather than sip.shudlab.net.  You’ll see what I’m talking about when we start looking at logs.  I’ll make a mention of it.

Client Connectivity

We have 2 Client Computers both running Windows 7 x64 with Lync 2010 CU5 client installed:

  • A-Client1.shudlab.net
  • A-Client2.shudlab.net

We have 2  Users:

  • ClientUser1 (ClientUser1@shudlab.net
  • ClientUser2 (ClientUser2@shudlab.net)

We have 2 Pools:

  • A-L14FE1.shudlab.net
  • A-L14FE2.shudlab.net

Everything with 1 is aligned together and everything with 2 is aligned together.  Therefore:

  • ClientUser1 is using A-Client1 computer which is associated to the A-L14FE1 Pool
  • ClientUser2 is using A-Client2 computer which is associated to the A-L14FE2 Pool

Conclusion

Thanks for reading Part 1.  In this Part, I went over what the lab setup looks like and took a look at the base topology and configuration before we really start diving into the actual scenario testing.

In Part 2, I will go through the sign-in process for each user.  Because the SRV record for sign-in is pointed to A-L14FE1.shudlab.net, the sign-in process will be different for each user logging in.  We’ll take a look at this in detail.

To read Part 2 of this article series, click here.

Share