Understanding Device Roaming Behavior Crucial to Effective Troubleshooting

I cannot tell you how many times I have been at a client site performing Wi-Fi troubleshooting where the system works fine for most devices but not for the one device type that the client needs for their productivity.  The common misconception from the user’s point of view is, “The Wi-Fi here sucks.”  Notwithstanding their lack of comprehension on how 802.11 works with all the intricate pieces and parts that must live harmoniously in order to effectively deliver a satisfying experience, they often point fingers and blame the wireless as a whole instead of just one part that might be failing to perform appropriately.

Not that long ago I was at a client site in Reno, Nevada where they had engaged a couple other vendors to troubleshoot why their handheld scanners kept losing connectivity.  My organization was brought in as the other organizations had recommended solutions but had not been able to truly correct the issues experienced.  Please understand that I am not knocking other vendors and their capabilities; rather I am providing simple background as it was presented to me.

After ensuring that sufficient signal existed for coverage in all areas required thus eliminating one aspect of connectivity issues and ensuring there were no obvious sources of interference to account for the connectivity challenges, I proceeded to observe the device’s roaming behavior to determine if the device itself was having issues.  My discovery in fact demonstrated that the device would not initiate its roaming algorithm until it had reached a signal strength far less than the system was designed for and as such was exhibiting sticky behavior by staying associated to an access point farther away than ones that should have been better choices due to their closer proximity to the actual device location.  Fortunately, the driver for this device had the option to modify its roaming aggressiveness and changing its setting solved the issue.  While explaining my findings to a cast of client representatives I was asked, “We have had multiple people out to look at this issue and you are the only one to recommend modifying the device settings.  Why is it that no one else even looked at them as a possible solution?”  My answer was truthfully, “I don’t know.”

Even today, I do not understand why there is such an emphasis on the infrastructure side and tuning the system to a device when perhaps we should be looking at it from the other direction.  Okay, bad infrastructure design accounts for a large portion of Wi-Fi issues and that is a larger discussion for another day; however, if the system is properly designed and the majority of devices using that system are functioning properly why do we blame the system when one type, model, or even driver revision is not playing nice?

Let me offer up some points to consider when looking at troubleshooting a challenging device:

Association and re-association requests are initiated by the client

These transmissions are directed at a specific AP unit via the BSSID.  This means the device has made its decision on who to send these frames to and that may not have been influenced by the infrastructure. There are infrastructure methods to help with device decision making such as 802.11k, v, r but not all devices / driver support them at this time.  Additionally, the manufacturer’s internal algorithm may or may not use the provided information in the way we would expect.

Accounting for the above point, all roaming decisions are made by the client

I know I am likely to get some people arguing this point; however, if you read any of the documentation on any special “roaming assistance” feature you will likely see phrases such as, “assists with”, “helps with”, etc.  This is because there is no sure-fire, guaranteed method to make a given client associate with a specific AP – unless it is the only AP in range.

All client device drivers use some derivative of signal strength to trigger their roaming algorithm

I have seen devices that will not connect to the closest physical AP due to the one they are associated to has not fallen below whatever criteria it uses to start the roaming process.  Whether it is actual dBm, an RSSI value, Signal Quality, or some other named metric, signal strength is a primary factor within this calculation.  As such, if a system has been designed for a single device type or model and additional units do not adhere to similar requirements then changing the misbehaving device is more logical, provided their internal settings can be modified appropriately.  In cases where this cannot be accomplished finding a “middle of the road” signal strength may be your only solution.  Not ideal but likely will be workable.

Don’t forget drivers are software and can have “bugs” too

Keeping drivers updated is just as important as Windows updates.  However, in the case of Windows, not all drivers are created equal – occasionally you may have to recommend using a manufacturer specific driver over the native Windows provided one.  I have found that Windows supplied drivers are not always the most up to date so I have had to encourage clients to test out other drivers direct from the manufacturer to find the solution.  Even then, those aren’t always perfect.  Sometimes it takes a bit of trial and error to find the right one.

Not all drivers and controller / AP firmware are compatible with one another

I have seen things go haywire with devices that worked just fine one day, then the client performed a firmware upgrade and then the devices start having issues – sometimes not revealing them until hours or even days later.  As much as we have standards and testing bodies attempting to ensure compatibility it does not always work out that way.  Also remember the point above about software bugs.

Some devices, especially older ones, can have specific data rate requirements for association

This point I encounter less and less but it is one to consider especially when dealing with a legacy device.  Some older drivers require lower data rates to be enabled or even set as “basic” or “mandatory” before they will transmit an association request.  Even though these requirements may not be common knowledge a little Google action might reveal something crucial.  Also, you may have no other choice but to support lower data rates – at least until your client can get rid of those devices.

In summary, when working on a troubleshooting issue I often hear clients attempting to “tweak” the infrastructure to attempt resolving the problem.  Many times, this approach introduces other issues instead of fixing them.  If we look at any given challenge with an open mind and not get locked into viewing the possible solution from only one angle, we can open the door to other possible solutions and find the best course of action.  As WLAN experts, our knowledge base should not stop with just the infrastructure; rather it should include all aspects and angles of the system as a whole ecosystem that is dependent on all the parts working in harmony for the desired user experience.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s