-
Notifications
You must be signed in to change notification settings - Fork 870
WeeklyTelcon_20221101
- Dialup Info: (Do not post to public mailing list or public wiki)
- Austen Lauria (IBM)
- Brendan Cunningham (Cornelis Networks)
- Brian Barrett (AWS)
- David Bernhold (ORNL)
- Edgar Gabriel (UoH)
- Geoffrey Paulsen (IBM)
- Harumi Kuno (HPE)
- Howard Pritchard (LANL)
- Joseph Schuchart
- Josh Fisher (Cornelis Networks)
- Josh Hursey (IBM)
- Thomas Naughton (ORNL)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia)
- Aurelien Bouteiller (UTK)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Christoph Niethammer (HLRS)
- Erik Zeiske
- George Bosilca (UTK)
- Hessam Mirsadeghi (UCX/nVidia)
- Jan (Sandia)
- Jeff Squyres (Cisco)
- Jingyin Tang
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Raghu Raja (AWS)
- Ralph Castain (Intel)
- Sam Gutierrez (LLNL)10513
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Tommy Janjusic (nVidia)
- Xin Zhao (nVidia)
Default placement, if np <=2 then map by core, else map by NUMA (if defined) else map by Package. But issue is that a customer has a Package inside of NUMA * OMPI recently has a user that DID hit this. They were mapping by NUMA inside the package, and not what was expecting. * Specify map-by-package solving * Hard to debug by looking at lstopo. If someone gets something weird when trying to map-by numa, try map by package. What should we do for default mapping policy (or ANY mapping policy), but don't say what the ranking policy, what should the ranking policy be?
- Historically, ranking mirrored the mapping policy.
- but it's pointed out this isn't the optimal placement (since most apps communicate with neighbors).
- So then it was proposed to map by SLOT.
- But then the user looks, and gets confused because that's not what they thought they were getting.
- Please think about this, and decide and lock this down.
- Brian thinks the default has to be rank by SLOT. (NUMA or Package, less strong thoughts), but in absence of any information.
- Initial thought was that if user specifies non-default mapping, they then NEED to specify a ranking and vice versa.
- Can print a useful error message.
- We can't make everyone happy in this case, so this might be best option.
- if users don't want to specify this every time, they can set an env var, or make an entry in conf file.
- v4.1.5
- Posted an RC1 last week. Brian forgot to send email to devel.
- Schedule is still end-of-month.
- May be the last v4.1.5 unless lots of bugs.
- Patch that needs some work, didn't compile. We'd take if it passes.
-
RC went out a couple of weeks ago.
-
We'll need at least one more RC before we release.
-
HAN/Adapt is remaining blocker.
- Finally figured out why timings were so variable.
- Because we select Bruck for Barrier for no reason...
- since OSC times barrier as well, that was the cause for the variations he was seeing.
- There's a patch that proposes to only use HAN if the rank-distribution if we
- Don't think we should block v5.0 longer
- Don't think we'll figure out how to make HAN faster than tuned if
- Finally figured out why timings were so variable.
-
Don't have a good reason yet why HAN's Barrier is slower.
-
We promised better collective performance for v5, but we have not delivered.
- What do we do?
- Two choices:
- Ship now and say that we're sorry our collective performance
- We'd need some messaging about how we're handling this.
- How do we talk to the community about this.
- Ship now and say that we're sorry our collective performance
- Are there any cases where this work actually improves thing?
- Something a bit positive where this work
- Goes back to where ranks aren't ordered by SLOT.
- Don't understand why only those are better.
- Two choices:
- Do we make it better in the common case? - No.
- What do we do?
-
Super Computing 2021
- ULFM, Threading MCA framework, MTL OFI, UCC
- Pretty sure we DID messaging around this.
-
Have had a number of new PRs.
- Did make changes to Tuned and had a PR where priorities were adjusted.
- Seeing better performance for OMPI than Intel MPI.
- Whatever the "out-of-box" performance is what they are getting. *
- If you only have a few ranks per node, then HAN doesn't help that much.
-
Preparing for release.
- Nov 14th release date.
- Remaining known blocking issues:
- OSHMEM blocker issue #10978
- OPAL LIFO tests fail on 390x - suspects bad gcc. says it works with v4.1, but fails with v5.0
- Doesn't seem to have support for 128bit architectures. Can't use C11
- Jenkins Pipeline fix (No issue)
-
Jenkins - make tarball issue.
- RPM builds dont work in Jenkins on v5.0.x
- Doesn't block RC, but DOES block release.
- RPM builds dont work in Jenkins on v5.0.x
-
HAN/Adapt - #10963
- Still some concerns that need to be addressed.
-
Docs - Remaining blocking issue (besides above) for v5.0.0
-
mpirun --help
is OUT OF DATE. - A number of doc issues open.
- See https://github.com/open-mpi/ompi/projects/3 for more info.
- The open-mpi FAQ - refers to things like v1.7
- Should the open-mpi.org say for v5.0
- Like the see all of them feature.
-
- Merged to main, and to v5.0.x
- Try it in v5.0.0rc9
- Still delayed.
- We're probably not getting together in person anytime soon.
- So we'll send around a doodle to have time to talk about our rules.
- Reflect the way we worked several years ago, but not really right now.
- we're to review the admin steering committee in July (per our rules):
- we're to review the technical steering committee in July (per our rules):
- We should also review all the OMPI github, slack, and coverity members during the month of July.
- Jeff will kick that off sometime this week or next week.
- In the call we mentioned this, but no real discussion.
- Wiki for face to face: https://github.com/open-mpi/ompi/wiki/Meeting-2022
- Might be better to do a half-day/day-long virtual working session.
- Due to company's travel policies, and convenience.
- Could do administrative tasks here too.
- Might be better to do a half-day/day-long virtual working session.
- Open MPI missed submitting request for BoF this year.
- MPI Forum will be presenting.