-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IOMMU related issues #718
Comments
The Intel I350 NIC has four ports, doesn't it? Is it built into the mainboard or a discrete NIC? If it's discrete, does your server have an onboard NIC? Have you configured any of the ports to be used by the kernel? |
The I350 on GK has 2 ports, while the one on GT has 4.
They're discrete, the onboard NICs on both machines are Broadcom NetXtreme which don't support DPDK.
I'm not quite sure about this, but here's all I have done to the ports :
I'm not sure whether it is necessary to bind all the ports on the same NIC to vfio-pci. However, this is already true for the GK server which has only 2 ports on the NIC. |
Btw, after ensuring network connectivity to check they are working fine, I have already removed their netplan configuration and changed the state to DOWN before proceed to modifying them to be used by DPDK |
I encountered a similar error while testing a NIC some time ago. I solved the problem then by binding all ports of the NIC with Hardware issues are time-consuming to address. I recommend getting another NIC model and moving forward. Section NIC of our wiki page "Hardware Requirements" lists NICs that Gatekeeper deployers have been using in production. |
Thanks for the recommendation, I will see if I can find the same model of NICs. Meanwhile, can you advise some of the the brand, model and BIOS version of the bare metal servers that Gatekeeper have been successfully deployed on? While this error is not neccessarily caused by the BIOS, some proven effective examples will be a useful piece of information while trying to resolve it. |
All shared notes on hardware are centralized on the wiki page Hardware Requirements. That said, my personal experience is with Dell servers. |
Update : After a bunch of study + trial & error, I have managed to bring GT server up. It is related to the kernel. The problem is fixed by using System Configuration utility on Gen9+ servers to disable "HP Shared Memory features", more details at https://github.com/kiler129/relax-intel-rmrr/blob/master/deep-dive.md On the other hand, since the GK server is a Gen8, the same solution doesn't apply for this machine. It seems like I would have to patch the kernel manually using https://github.com/kiler129/relax-intel-rmrr . Alternatively, this looks promising as well. However, both solutions seems time-consuming and troublesome. The good news is that I350 is not the culprit. Do you ever have to deal with these kinds of workaround when dealing with Dell servers? If Dell servers don't make this kind of problem, it will be my primary option and I would want to advise anyone trying to deploy Gatekeeper afterwards to avoid using old HP servers. Another link with good reference value : |
Based on my experience, I recommend Dell. Nevertheless, I don't want to suggest Dell servers are trouble-free; see issue #703 for an example. My recommendation is based on the fact that we've been able to overcome the problems with a clean solution. |
I have built v1.2.0 RC2 from source. After running
sudo build/gatekeeper
an error is shown :
cannot add vfio group to container, error 22 (invalid argument)
and there I'm unable to start Gatekeeper. While troubleshooting, the error message below is found :
sudo dmesg | grep -i vfio
Firmware has requested this device have a 1:1 IOMMU mapping, rejecting configuring the device without a 1:1 mapping. Contact your platform vendor.
From the preliminary study, I think the issue is more related to hardware / BIOS. However, I can't find an exact solution to actually solve this. I'm trying my luck here to see if anyone with deeper understanding of Intel VT-d, IOMMU and vfio-pci can assist to provide any idea.
The same error occured on both GT and GK. Below is the specification of the testbed :
Bare-metal deployment, isolated lab environment.
GK :
OS : Ubuntu 24.04 LTS
Server : HPE ProLiant
RAM : 256GB
CPU : Intel Xeon E5-2665 2.4GHz, 32 cores
NUMA : 2 NUMA nodes
NIC : Intel I350 1G, both front and back (Confirmed that DPDK is supported). The server also has Intel 82599ES 10G interface that supports DPDK, but we neither have a 10G uplink router available at the moment, so we didn't use it for the testbed.
GT :
OS : Ubuntu 24.04 LTS
Server : HPE ProLiant
RAM : 256GB
CPU : Intel Xeon E5-2640 2.6GHz, 32 cores
NUMA : 2 NUMA nodes
NIC : Intel I350 1G, front
Solutions tried on GT (which didn't work):
Adding
vfio_iommu_type1.allow_unsafe_interrupts=1
In GRUB_CMDLINE_LINUX_DEFAULTCurrent thoughts :
Is there a way to verify the "dma-ranges" property? If yes, at least I can know what is causing it a non 1:1 mapping, and probably being able to trace down the root cause from here.
Some links that are probably relevant but I can't fully understand the content due to lacking of relevant experience :
https://github.com/kiler129/relax-intel-rmrr/blob/master/deep-dive.md#what-vendors-did-wrong
https://lore.kernel.org/linux-iommu/BN9PR11MB5276E84229B5BD952D78E9598C639@BN9PR11MB5276.namprd11.prod.outlook.com/
https://lore.kernel.org/linux-iommu/BN9PR11MB52768ACA721898D5C43CBE9B8C27A@BN9PR11MB5276.namprd11.prod.outlook.com/t/
https://lore.kernel.org/linux-iommu/[email protected]/
https://community.hpe.com/t5/proliant-servers-ml-dl-sl/proliant-dl360-gen9-getting-error-quot-rejecting-configuring-the/td-p/7220298
https://www.reddit.com/r/VFIO/comments/1gi95zf/rejecting_configuring_the_device_without_a_11/?rdt=37975
https://forum.proxmox.com/threads/qemu-exited-with-code-1-pcie-passthrough-not-working.146297/
The text was updated successfully, but these errors were encountered: