I'm trying to get my Linux to reliably roam from AP to AP. Right now I'm using iwlwifi - iwd - NetworkManager combo to manage everything and apart from roaming everything works reasonably well and I'm happy with it.
But I'm unable to get reliable roaming experience.
My laptop clings to the AP that is far away (< -80dBm with 40MHz channel) and completely ignores AP that is literally next to it (-60dBm, 160MHz channel). That connection to a "weak AP" has a lot of transmission errors, traffic stops for few seconds at a time etc - the experience is horrible, wifi basically does not work.
I also have to mention that we are talking about 5/6GHz wifi network as I know that 2.4GHz networks have this problem in much larger scale due to the fact that 2.4GHz will reach much further than higher bands.
(Not the) Sticky client problem
This problem happens at multiple locations with different network vendors (Ubiquiti vs Aruba) so my best guess would be that the wifi network is not the problem. As @Peregrino69 mentioned (Thank you!), the correct name for this is Sticky client problem and common solution is to try to force the client to roam - usualy by setting some RSSI threshold, which would make the AP to send DEAUTH packet to the client, thus forcing the client to reassociate/roam.
But Sticky client problem means that client is unwilling to switch to another AP. From logs of IWD I can clearly see that it is trying to but failing. Meaning I don't have this problem.
Debugging IWD?
I downloaded sources & added many debugs so I can figure out what and where is happening. I know that IWD is the responsible piece of SW that handles roaming and I know it tries to roam but fails.
01| station_roam_scan_notify() Will do scan & prepare for roaming [BSS=AP-FAR-AWAY SSID='xxx'
02| station_roam_scan_notify() |- Candidate [BSS=AP-CLOSE SSID='xxx' FREQ=5180 RANK=589 DBM=-7500 DATARATE=72.0]
03| station_roam_scan_notify() | |- This IS a roaming candidate!
04| station_roam_scan_notify() |- Candidate [BSS=AP-FAR-AWAY SSID='xxx' FREQ=5180 RANK=140 DBM=-8100 DATARATE=17.2]
05| station_roam_scan_notify() | |- SKIPPING: Already connected
06| station_roam_scan_notify() Calling station_transition_start()
07| station_transition_start() Starting roaming process - will try BSS one by one
08| station_transition_start() AP-CLOSE| Trying to roam
09| station_try_next_transition() Trying to roam [IF=36 TGT=AP-CLOSE]
10| station_fast_transition() ft_authenticate() branch taken
11| wiphy_radio_work_insert() Inserting work item 6
12| station_fast_transition() wiphy_radio_work_insert() -> FT / PRIO_CONNECT
13| wiphy_radio_work_insert() Inserting work item 7
14| station_try_next_transition() Trying Fast Transition [SUCCESS]
15| station_transition_start() AP-CLOSE| Successfully started roaming!
16| station_transition_start() Roaming process SUCCESFULLY started!
17| wiphy_radio_work_next() Starting work item 6
18| netdev_mlme_notify() MLME notification Frame TX Status(60)
19| netdev_unicast_notify() Unicast notification Frame(59)
20| netdev_mlme_notify() MLME notification Remain on Channel(55)
21| netdev_mlme_notify() MLME notification Cancel Remain on Channel(56)
22| wiphy_radio_work_done() Work item 6 done
23| wiphy_radio_work_next() Starting work item 0
24| ft_associate() We got new BS but did not auth YET!!
25| station_ft_work_ready() ft_associate(AP-CLOSE) returned -2
26| station_ft_work_ready() Calling station_transition_start() as -ENOENT
27| station_transition_start() Starting roaming process - will try BSS one by one
28| station_transition_start() Roaming process FAILED for all candidate BSS
29| station_roam_failed() 36
30| station_roam_failed() [SIGNAL_LOW=1 AP_DIRECTED=0]
802.11r be damned
From this log I assume the following:
IWL would like to roam as expected (from AP-FAR-AWAY to AP-CLOSE, which makes complete sense)
We are trying to roam using 802.11r (FT)
We are not yet authed to the AP-CLOSE as we did not receive the necessary information from driver (yet)
Cannot associate to the new AP -> FT roaming fails
I did some more testing: I disabled 802.11r (FT) and lo and behold, now the computer can indeed roam as I would expect it to roam. Yes, there is an (short) outage when roaming as we have to do full authentication to the new AP, but it works.
Nothing is ever easy
I have reached Denis Kenzior (one of the authors of IWD) on IRC and he told me that FT should indeed work just fine. After some digging we (well, Denis) have found that the main culprit is timing.
Technical details: Without going into too much technial details, the issue seems to be race condition in (probably) Intel AX210 firmware and Ubiquiti. Ubiquiti sends FT association frame, but for some unknown reason it sends it way later than IWD/Intel AX210 firmware waits for. The log above paints (almost) a full picture and that's why I'm keeping it there - the IWD waits for FT association frame between MLME notification Remain on Channel(55) (line 20) and MLME notification Cancel Remain on Channel(56) (line 21). And that frame contains necessary information for ft_associate (line 26) to work properly. That frame is received after this log snippet ends, which is of course too late
We are getting quite deep inside kernel/iwlwifi etc. Hopefully we can get either some workaround this issue or we can fix it.
I will keep this post as kind of a notepad for me and for anyone that might have similar problem.
For now it seems that this issue may happen if you have AX210 card and slow-ish AP (as my Ubiquiti at home).