-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenNSL 3.5.0.1 report #76
Comments
Thank you for reporting this. Could you please share more details about the crash dump? For example, the stack trace etc.? |
Hi, I'm reporting an absence of a crash and recommending you to reconsider or provide more details of the crash referred to here: Line 118 in 346e30c
Essentially "It works for us". |
hi @bluecmd, Are you sure you didn't have to do anything else to get opennsl 3.5.0.1 working? I can definitely believe that the crash in opennsl_pkt_alloc() has been fixed but there were a number of other changes that were required to get opennsl 3.5.0.1 working -- trivially, opennsl_driver_init()'s prototype changed -- see my changes here to at least get it to compile: #65 And even after it compiled, it was my experience that all of the packet forwarding was broken because the initialization process was quite different. If you have it working, we'd definitely appreciate to understand how, because if we could update to OpenNSL 3.5.0.1, then we can unlock a bunch of previously unreleased changed (e.g., ACLs) that depend on newer versions. Please confirm and let us know - thanks as always for the interest! |
Hi @capveg. I admit it's a bit sneaky, but if you click on "OpenNSL 3.5.0.1" in my report you get the diff of the patch, and you'll see the actual code changes that we did. Since we have Wedges graciously donated from FB running with ONL + FBOSS we're more than happy to help you collect any data that you need to debug any issues, but as far as we've seen It Just Works(TM) with the somewhat trivial patch of essentially only changing the EDIT: Direct link to what I'm talking about here: https://github.com/dhtech/fboss/pull/4/files#diff-941e4fb204c29b957373093d97373880 |
Hmm... so your patch looks effectively identical to my patch... so I'm wondering why your's works. I saw in one of the comments there a "Status: not working" - can you clarify? Just because the FBOSS agent logs "sending lldp to X" doesn't necessarily mean it's happening. Are you seeing that packet received on the other side? Sorry if this seems pedantic - but we've been (admittedly, slowly) debugging this for a while... |
The status: not working is my quest to downrate the serdes's to support 1G line rate (https://github.com/Broadcom-Switch/OpenNSL/issues/37). No worries, I also wouldn't trust strangers on the internet.
|
Just a random question, did you upgrade the kernel modules associated with OpenNSL when running a newer OpenNSL? We're running pretty brand new kernel modules (I think we're even using 3.5.0.1 kernel modules) even for 6.4, as those are the only ones available for us. It required some hacks to get to compile, but it worked well enough. Maybe the fact that we're running newer kernel drivers is the missing puzzle piece? EDIT:
|
Thanks for all the info. The kernel API is fairly stable so I'm not surprised that the 3.5.0.1 kernel modules work for older versions of OpenNSL. I wouldn't run that way long term (it's definitely not a tested setup :-), but not surprised it works. We run a fairly new kernel internally... let me confirm some details with some other folks and see if we can come up with a theory. In any case, glad to hear this is working for you. |
Just to add more data to keep myself honest:
That matches with the Dec-27 release that's current in https://github.com/Broadcom-Switch/OpenNSL/tree/master/bin/wedge. So I'm pretty sure I'm not messing up the versioning on my end. |
Can you provide any more information about the hacks? Compiling OpenNSL 3.5.0.1 for the 4.14 kernel I have fixed pci_enable_msix, copy_to/from_user and dev->trans_start = jiffies; but FBOSS is still having issues: I1010 20:50:13.945568 4058 BcmSwitch.cpp:560] Initializing BcmSwitch for unit 0 |
It was a while since I hacked together the kernel modules, but dhtech/OpenNSL@3e5a8af + ONL 9 should be what we're running. A notable thing is that we do not load the knet driver. I seem to recall that it was a crash inside OpenNSL 6.4 when running FBOSS with the knet driver loaded. I have not tried that driver with 3.5.0.1. |
Ah @sonoble, looking at the last line of your report you're probably hitting #74. Not sure without the full stack trace however. You can try using our fork that is using FBOSS from May with some patches applied: https://github.com/dhtech/fboss if you need it up and running right now. |
No one runs knet that I know of. Looks like your changes are the same as mine. I build the entire OpenNSL from source, so I just set the KERNEL_SRC and LINUX_UAPI_SPLIT="1". I don't need FBOSS running right now, I was just trying to confirm that your patch worked for me on the 40's. I have been working on getting everything working on the 100S but in a totally different way, by removing the init from OpenNSL and having FBOSS handle it. I will build your fboss and see if I can get it working. Thank you! |
I built your fboss + the modified OpenNSL and while everything is running, there are no interfaces at all using your config or mine. I will dig more into it later. |
@bluecmd I don't see it in this thread, have you been able to confirm packets other than LLDP are passing? We have seen LLDP packets before but were unable to ping or send any different traffic between boxes. |
Only LLDP so far as well as normal L2 switching. |
Hi @bluecmd I am able to confirm L2 and LLDP on the Wedge 100S but no L3 (Packets are not making it to the CPU) so no routing protocols can be run. Can you check if you assign an IP to a port that you can or cannot ping it? |
@sonoble Sure. Do you have any configuration to share to make the time commitment shorter on my part? Also, did this work on 6.4? We only use L2 stuff so I'm not very aware of the state of L3 in FBOSS. |
Here is a generic one from Facebook
https://github.com/facebook/fboss/blob/master/fboss/agent/configs/sample3.json
you just need to add the ip in the correct area.
…On Mon, Oct 15, 2018, 11:16 PM Christian Svensson ***@***.***> wrote:
@sonoble <https://github.com/sonoble> Sure. Do you have any configuration
to share to make the time commitment shorter on my part? Also, did this
work on 6.4? We only use L2 stuff so I'm not very aware of the state of L3
in FBOSS.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#76 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AB_bHwNwrOWizM7xtt_U6ZjYDfvTbpXRks5ulXmigaJpZM4XNlXu>
.
|
So these are my observations. This is with 3.5.0.1 and our FBOSS fork from May/June. We have never tried running this with the old FBOSS, so I have no idea if this is a regression - but as requested by @sonoble. I added an L3 interface like this:
This configured an
ICMP replies are also sent (looking at tcpdump fboss10) but they never arrive at the pinger. Using another IP address that is on its own subnet makes things break earlier. The fboss10 interface still shows some random IPv6 traffic that it captures, so packet capture works - however not much more than that. FBOSS output:
Notice that it is sending out an ARP broadcast but never logs an "sendPacketOutOfPort" message, following the code it is because this path calls "sendPacketSwitched". See here. Maybe sendPacketSwitched is broken while sendPacketOutOfPort works? Next steps to confirm that could be:
EDIT: I have a thesis this might also be related to L1 errors, I'll debug a bit and update. |
Update: Yes, it was L1 error. Having fixed the cabling I can now see packets egressing as well. Ping doesn't work, but that is most likely FBOSS related.
tcpdump on computer:
|
Summary: X-link: facebookincubator/fizz#76 X-link: facebook/proxygen#402 X-link: facebook/folly#1735 X-link: facebookarchive/bistro#60 X-link: facebook/watchman#1012 X-link: facebook/fbthrift#487 Pull Request resolved: #114 X-link: facebook/fb303#27 When using getdeps inside of a container, Python's urllib isn't able to download from dewey lfs (see this post for details https://fb.workplace.com/groups/systemd.and.friends/permalink/2747692278870647/). This allows for getdeps to use `libcurl` to fetch dependencies, which allows for a getdeps build to work inside the container environment. Reviewed By: mackorone Differential Revision: D34696330 fbshipit-source-id: 06cae87eef40dfa3cecacacee49234b6737d546f
…2023 Fb upstream 09 12 2023
Hi,
We're currently running FBOSS with a naively updated OpenNSL 3.5.0.1.
Since the reported crash in
getdeps.sh
should occur inopennsl_pkt_alloc
we verified the upgrade by using LLDP:No crash was observed.
Using OpenNSL 3.5.0.1 allows using modern kernel drivers and to configure the OpenNSL BCM configuration, so upgrading to it would probably interesting for a lot of folks.
The text was updated successfully, but these errors were encountered: