-
Notifications
You must be signed in to change notification settings - Fork 324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mediatek-filogic: weird and recurring wifi instability after few minutes #3305
Comments
Can you check if the tx retries / tx failed counters from |
They are slightly increasing, but most of the time, they are constant. tx failed
tx retries
batctl p towards some mesh partner often does not work either, with package losses above 90%. |
I just tested the MTK patch: It looked good until I reloaded the driver at about 7:20 The logread still does not hint to something useful. |
Some driver hiccup is probably the most likely. But still wanted to ask, as it's not clear to me from these graphs alone: Has external traffic causing these losses been ruled out? Is there some available airtime graph? Does it correlate with some route changes in batman-adv? (I've seen funny route flapping / TQ changes/breakdowns caused by unicast traffic in the past in a test in a specific setup years ago when it was still 802.11g, caused by a hidden node problem: https://www.open-mesh.org/projects/batman-adv/wiki/Bcast-hidden-node. There it would oscillate between the good two-hop route and a bad, direct 1-hop route. Even if CTS/RTS was enabled for unicast traffic. Traffic over the two-hop route would interfere with the batman-adv OGM broadcasts... causing a breakdown in TQ and then switching to 1-hop. Then the TQ would improve, things would switch back to 2-hop. Rinse and repeat. Usually the hidden-node-problem should be quite rare though. Maybe even less likely with newer 802.11 revisions / improvements therein?) |
Thanks for looking into this @T-X . In our test-setup we experienced the same problems with a WAX220: There seems to be another test-installation with wr3000 and wax220 which does not have such issues according to its grafana: Remotely, I can only debug by checking the TQ with mesh partners (if there are any) - the actual symptom is that the wifi is unusable as a client when connected to the node during the times in which the device has 3-4% of TQ to neighbors (even though it might have a 100% TQ to mesh-vpn, so it surely is the wifi driver) I don't think it is a flapping route in batman, though I can not exclude this completely. |
First of all - the issue is still present in latest firmware with updated openwrt - as well as on openwrt master. I just noticed, that before some firmware iterations, the max TQ and min TQ were both fluctuating:
The latest v2023.2.x firmware does have a solid max tq but still varying min tq which seems broken. The solid max tq can be seen here - wax220: The same bad mesh symptoms were found on the NWA55AXE as well
openwrt masterI did build firmware from openwrt master to test, though it sometimes did not even load the wifi driver at all (for the wr3000) and did show the above behavior as well for the WAX220. So the problem is still not solved in openwrt master as of August 2024 (commit 5d2a008670122f3f69eb3ab4f776d9fe9b6d76dd). |
Die Freude hielt nicht lange an :) We are back to the old behavior here Eventually this is a problem for the NWA50AX Pro only (WR3000 and WAX220 are fine for now) Just a short update to not get people too enthusiastic 😆 |
From a debugging session with @blocktrron : first activate mt76 debug logs: cd /sys/kernel/debug/ieee80211/phy0/mt76
echo 1 > fw_debug_wm Now one can see recovery attempts of the firmware with
this can be triggered using sys_recovery:
sometimes, it looks like a higher value is needed to recover correctly writing only affected device for now is the zyxel nwa50ax-pro Just a log of why sys_recovery triggers which debug message: echo 0 > sys_recovery
[98632.738283] ieee80211 phy0: WM: (4122.883124:73:SER-W)wsysSerExtCmd: action(0), ser_set(0)
echo 1 > sys_recovery
Thu Jan 9 22:09:47 2025 kern.info kernel: [98636.763965] ieee80211 phy0: WM: (4126.908972:50:SER-W)wsysSerExtCmd: action(2), ser_set(2)
Thu Jan 9 22:09:47 2025 kern.info kernel: [98636.772250] ieee80211 phy0: WM: (4126.909064:51:SER-W)wsysSerExtCmd: action(3), ser_set(1)
echo 2 > sys_recovery
Thu Jan 9 22:09:51 2025 kern.info kernel: [98641.198233] ieee80211 phy0: WM: (4131.343085:92:SER-W)wsysSerExtCmd: action(2), ser_set(4)
Thu Jan 9 22:09:51 2025 kern.info kernel: [98641.206519] ieee80211 phy0: WM: (4131.343329:93:SER-W)wsysSerExtCmd: action(3), ser_set(2)
echo 3 > sys_recovery
Thu Jan 9 22:09:58 2025 kern.info kernel: [98648.367482] ieee80211 phy0: WM: (4138.512488:16:SER-W)wsysSerExtCmd: action(2), ser_set(8)
Thu Jan 9 22:09:58 2025 kern.info kernel: [98648.375777] ieee80211 phy0: WM: (4138.512579:17:SER-W)wsysSerExtCmd: action(3), ser_set(3)
echo 4 > sys_recovery
Thu Jan 9 22:10:04 2025 kern.info kernel: [98654.032242] ieee80211 phy0: WM: (4144.177252:38:SER-W)wsysSerExtCmd: action(2), ser_set(16)
Thu Jan 9 22:10:04 2025 kern.info kernel: [98654.040621] ieee80211 phy0: WM: (4144.177344:39:SER-W)wsysSerExtCmd: action(3), ser_set(4)
echo 5 > sys_recovery
Thu Jan 9 22:10:08 2025 kern.info kernel: [98658.404888] ieee80211 phy0: WM: (4148.549689:45:SER-W)wsysSerExtCmd: action(2), ser_set(32)
Thu Jan 9 22:10:08 2025 kern.info kernel: [98658.413271] ieee80211 phy0: WM: (4148.549963:46:SER-W)wsysSerExtCmd: action(3), ser_set(5)
echo 6 > sys_recovery
Thu Jan 9 22:10:17 2025 kern.info kernel: [98666.681804] ieee80211 phy0: WM: (4156.826758:66:SER-W)wsysSerExtCmd: action(2), ser_set(64)
Thu Jan 9 22:10:17 2025 kern.info kernel: [98666.690181] ieee80211 phy0: WM: (4156.826880:67:SER-W)wsysSerExtCmd: action(3), ser_set(6) |
The NWA50AX-Pro did have the issue again this morning: echoing any lower values than 7 to sys_recover did not have an impact to recovering The issue occured at 07:05 While broken, a lot of Echoing 7 did a full recovery and worked:
Other filogic devices like WR3000 and WAX220 are also affected, but do recover themselves in <2h. note that |
General instability on mediatek filogic devices with mt7915e have been seen, especially on the WR3000, WAX220 and others.
It has to be noted that some devices work better than others. Heavy wifi mesh seems to make the situation worse.
What is the problem?
An example of this is this behavior is this device:
https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=80afca06d558&from=1718344052951&to=1718403869219&viewPanel=13
which includes very varying TQ of the device.
The latest finding is this:
https://grafana.ffac.rocks/d/000000002/node?orgId=1&var-node=80afca06d558&from=1720175532350&to=1720193698710&var-select_hostname=ffac-seilpforte-wr3000&var-hostname=ffac-seilpforte-wr3000&var-saveinterval=1m&var-nodetolink=0c0e76cf5d5e&viewPanel=13
At 1. I restarted the wifi driver using
rmmod mt7915e && modprobe mt7915e
At 2. I added another mesh device with which this device could mesh on mesh1, creating the timeout issue without the device being possible to reload the firmware
At 3. I restarted the device, as nothing helped.
Afterward, the weird changing TQ can be seen, which behaves in weird waves.
The current workaround includes reloading the mt7915e driver and rebooting the device once the mt7915e bug from #3154 occurs.
A package for this can be found here: https://github.com/ffac/gluon-packages/tree/main/ffac-mt7915-hotfix/files/lib/gluon/mt7915
As @nrbffs also noted on IRC, some other people reported instability with these devices as well. Currently, reloading the wifi driver twice a day seems to help in this situation..
This issue is not about #3154 but about the weird changing TQ leading to bad mesh quality and wifi quality.
What is the expected behaviour?
Mesh and wifi quality should be stable on mediatek filogic devices such as the WR3000.
Further steps
ls /sys/kernel/debug/ieee80211/phy*/mt76
to find somethingTX_Stats
I found that on other devices
cat /sys/kernel/debug/ieee80211/phy1/mt76/tx_stats
does only show values for 1 to 4 while the affected WR3000 has values for 1 to 8I do not really know if this is related or not, just a finding.
Gluon Version:
v2023.2.3
Site Configuration:
ffac @ v2023.2.3-2
Custom patches:
see site
The text was updated successfully, but these errors were encountered: