Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lspci] Fix handling of more than one PCIe domain #1822

Merged
merged 1 commit into from
May 28, 2024

Conversation

Babar
Copy link
Contributor

@Babar Babar commented Apr 21, 2024

Description

(sorry, I'm using domain and root port interchangeably. Maybe domain is more correct, if so I can update the code to use that instead of root_port)

I'm guessing most machines only have one PCIe root port, but things like ARM hosts tend to have more, and that breaks as it creates collisions.

As we want to keep backwards compatibility, and lspci is run with the defaults that doesn't show the root port if it's 0000, only add it when it's non-zero, and also add it as a new extra field called root_port to denote the fact that there is one (useful when using the BDF to find files on disk, as when there is no root_port, one needs to add the extra 0000:.

More detailed explanation I wrote on the slack channel: On Linux, we run lspci -vnnmk, which will show the BDF in the first Device: field. So the first bug is that the regular expression on https://github.com/chef/ohai/blob/main/lib/ohai/plugins/linux/lspci.rb#L54 is missing a \ before the ., therefore will will match 0001:02:03.4 as 01:02:03 which is plain broken. But fixing that we end up with 02:03.4 which is also plain wrong as it loses the root complex. I see a few ways we could fix this:

  1. Highly disruptive: we add -D to lspci, which means we'll always get the root complex even when it's 0000 . This is terrible 'cause it will basically break everything
  2. We do this but only with an option in ohai. Less terrible, but means people have to know about that option
  3. We do this but only when we notice more than one root complex. That's not really doable
  4. We add a new field, root_complex in the PCI structure, which is 0000 by default and the actual root complex for everything else

I like 4 the most, as it mirrors closely what lspci does (https://github.com/pciutils/pciutils/blob/master/lspci.c#L285) or maybe
5. We do it the same way lspci does it: if the root complex is 0, we just store the BDF as the ID, if it's not, we store the root complex + BDF we can't really do 4 'cause we might have collisions... Like if you have 0000:01:02.3 and 0001:01:02.3 then they have to be stored differently. So we can either do: Option A: 01:02.3 and 0001:01:02.3 (this can co-exist with everything today) Option B: 00000:1:02.3 and 0001:01:02.3 (this most likely needs to become pci2)

Test Plan:
Before:

$ ohai | jq .pci\|keys
[
  "00:00.0",
  "01:00.0",
  "01:00.1",
  "02:00:0",
  "04:00:0",
  "08:00:0",
  "08:01:0",
  "08:02:0",
  "08:03:0",
  "08:04:0",
  "08:05:0",
  "09:00:0",
  "09:01:0"
]

After:

$ ohai | jq .pci\|keys
[
  "0002:00:00.0",
  "0004:00:00.0",
  "0008:00:00.0",
  "0008:01:00.0",
  "0008:02:01.0",
  "0008:02:02.0",
  "0008:03:00.0",
  "0008:04:00.0",
  "0008:05:00.0",
  "0009:00:00.0",
  "0009:01:00.0",
  "00:00.0",
  "01:00.0",
  "01:00.1"
]

Related Issue

pci collection should always include the domain #1693

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Chore (non-breaking change that does not add functionality or fix an issue)

Checklist:

  • I have read the CONTRIBUTING document.
  • I have run the pre-merge tests locally and they pass.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • If Gemfile.lock has changed, I have used --conservative to do it and included the full output in the Description above.
  • All new and existing tests passed.
  • All commits have been signed-off for the Developer Certificate of Origin.

@Babar Babar requested review from a team as code owners April 21, 2024 07:52
@Babar Babar force-pushed the lspci-fix-root-complex branch 3 times, most recently from be979bc to 656eb32 Compare April 22, 2024 05:25
@tpowell-progress
Copy link
Contributor

@Babar can you add a couple cases in spec/unit/plugins/linux/lspci_spec.rb to match the additional regex component specifically?

Copy link
Contributor

@tpowell-progress tpowell-progress left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand the test cases to cover additional regex component.

@tpowell-progress tpowell-progress self-assigned this Apr 23, 2024
@Babar Babar force-pushed the lspci-fix-root-complex branch from 656eb32 to 8d49639 Compare April 26, 2024 06:12
Summary:
I'm guessing most machines only have one PCIe root port, but things like ARM hosts tend to have more, and that breaks as it creates collisions.

As we want to keep backwards compatibility, and `lspci` is run with the defaults that doesn't show the root port if it's `0000`, only add it when it's non-zero, and also add it as a new extra field called `root_port` to denote the fact that there is one (useful when using the BDF to find files on disk, as when there is no `root_port`, one needs to add the extra `0000:`.

More detailed explanation I wrote on the slack channel:
On Linux, we run `lspci -vnnmk`, which will show the BDF in the first `Device:` field. So the first bug is that the regular expression on https://github.com/chef/ohai/blob/main/lib/ohai/plugins/linux/lspci.rb#L54 is missing a `\` before the `.`, therefore will will match `0001:02:03.4` as `01:02:03` which is plain broken. But fixing that we end up with `02:03.4` which is also plain wrong as it loses the root complex.
I see a few ways we could fix this:

1. Highly disruptive: we add -D to lspci, which means we'll always get the root complex even when it's 0000 . This is terrible 'cause it will basically break everything
2. We do this but only with an option in ohai. Less terrible, but means people have to know about that option
3. We do this but only when we notice more than one root complex. That's not really doable
4. We add a new field, root_complex in the PCI structure, which is 0000 by default and the actual root complex for everything else

I like 4 the most, as it mirrors closely what lspci does (https://github.com/pciutils/pciutils/blob/master/lspci.c#L285)
or maybe 5. We do it the same way lspci does it: if the root complex is 0, we just store the BDF as the ID, if it's not, we store the root complex + BDF
we can't really do 4 'cause we might have collisions... Like if you have 0000:01:02.3 and 0001:01:02.3 then they have to be stored differently. So we can either do:
Option A: 01:02.3 and 0001:01:02.3 (this can co-exist with everything today)
Option B: 00000:1:02.3 and 0001:01:02.3 (this most likely needs to become pci2)

Test Plan:
Before:
```
$ ohai | jq .pci\|keys
[
  "00:00.0",
  "01:00.0",
  "01:00.1",
  "02:00:0",
  "04:00:0",
  "08:00:0",
  "08:01:0",
  "08:02:0",
  "08:03:0",
  "08:04:0",
  "08:05:0",
  "09:00:0",
  "09:01:0"
]
```

After:
```
$ ohai | jq .pci\|keys
[
  "0002:00:00.0",
  "0004:00:00.0",
  "0008:00:00.0",
  "0008:01:00.0",
  "0008:02:01.0",
  "0008:02:02.0",
  "0008:03:00.0",
  "0008:04:00.0",
  "0008:05:00.0",
  "0009:00:00.0",
  "0009:01:00.0",
  "00:00.0",
  "01:00.0",
  "01:00.1"
]
```

Reviewers: jaymzh

Closes: chef#1693
Signed-off-by: Olivier Raginel <[email protected]>
@Babar Babar force-pushed the lspci-fix-root-complex branch from 8d49639 to e6a4242 Compare April 26, 2024 06:15
Copy link

Quality Gate Passed Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
No data about Duplication

See analysis details on SonarCloud

@Babar
Copy link
Contributor Author

Babar commented Apr 27, 2024

@tpowell-progress : added one. Can add more but wasn't sure what I should be testing.

@Babar Babar requested a review from tpowell-progress May 24, 2024 16:36
@tpowell-progress tpowell-progress merged commit 6337ab5 into chef:main May 28, 2024
28 of 29 checks passed
@Babar Babar deleted the lspci-fix-root-complex branch October 8, 2024 05:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants