-
Notifications
You must be signed in to change notification settings - Fork 871
Meeting 2021 07_Minutes
(due to COVID-19, this will be virtual instead of a face-to-face meeting)
The meeting will be determined by most availability. See https://doodle.com/poll/rd7szze3agmyq4m5?utm_source=poll&utm_medium=link
Meeting dates:
- Thursday, July 22, 2021
- 12-2pm US Pacific time.
- 3-5pm US Eastern time.
- 8-10pm GMT
- Thursday, July 29, 2021
- 8-10am US Pacific
- 11am-1pm US Eastern
- 4-6pm GMT
This is a link to a non-public repo for the Webex info (posting Webex links publicly just invites spam; sorry folks).
If you do not have access to the non-public repo, please email Jeff Squyres to get the Webex info.
Please put your name down here if you plan to attend. All rumors of snacks were greatly exaggerated.
- Geoff Paulsen (IBM)
- Josh Hursey (IBM)
- Jeff Squyres (Cisco)
- Michael Heinz, Brendan Cunningham (Cornelis)
- Raghu Raja (Enfabrica)
- Howard Pritchard (LANL)
- Ralph Castain (Nanook) (partial attendance)
- Austen Lauria (IBM)
- William Zhang (AWS)
- Brian Barrett (AWS, July 22 only)
- Nathan Hjelm (Google)
- Todd Kordenbrock (HPE/SNL)
- Thomas Naughton (ORNL)
Please add Agenda items we need to discuss here.
-
[Owner?] MPI 4.0 Compliance.
-
MPI-4 stuff we already have (either partially or completely - from the MPI-4.0 doc changelog):
- 7: Persistent collectives
- Fujitsu did this in MPIX_ and has been moved to MPI_ on master
- [ACTION: Geoff go ensure on v5.0.x]
- 24: Sessions https://github.com/open-mpi/ompi/pull/9097
- Status: Howard is planning to squash many of the commits down (especially newest ones)
- Has been making progress in hpc/ompi fork
- Main thing that needs to be addressed before merging...
- The way that Vader was not written (at least fast boxes) so that when it's closed it pushes all of the traffic to wire.
- When Session finalize happens with Vader right after a reduce, getting segvs since fastboxes aren't writing all buffers before closing / freeing.
- About one Month off assuming Vader issue isn't too much work.
- Got rid of some topo stuff that was not included in MPI 4 standard.
- Only MPI4 standard stuff, not other pieces.
- Nathan rewired the way that Finalize works, so lots of changes in the way that OMPI works, neccasary
- Jeff asked a question around Session Init / Finalize - how this interacts with PMIx Init/ Finalize.
- Howard has been testing. Found some issues and fixed.
- 7: Persistent collectives
-
MPI-4 stuff no one is working on yet (from the MPI-4.0 doc changelog):
- 3: Embiggened bindings
- 4: error handling (is this all done by the recent UT/FT work?)
- 6: MPI_ISENDRECV and MPI_ISENDRECV_REPLACE
- 9: Partitioned communication
- 10+11: MPI_COMM_TYPE_HW_[UN]GUIDED
- 12: Update COMM_[I]DUP w.r.t. info propagation
- 13: MPI_COMM_IDUP_WITH_INFO
- 15: new info hints
- 16: updated semantics of MPI_*_[GET|SET]_INFO
- 17: update MPI_DIMS_CREATE (this might be done?)
- 18: alignment requirements for windows
- 21: MPI_ERR_PROC_ABORTED error class (was this added by UT/FT work?)
- 22: Add MPI_INFO_GET_STRING
- 23: Deprecate MPI_INFO_GET[_VALUELEN]
- 25: Add MPI_INFO_CREATE_ENV
- 26: Error reporting before INIT / after FINALIZE (was this added by UT/FT work?)
- 27: Updated error handling (was this done by UT/FT work?)
- 28: Updated semantics in MPI_WIN_ALLOCATE_SHARED
- 29: Audit F08 binding for MPI_STATUS_SET_CANCELED
- 30: Add MPI_T_* callbacks
- 32: Audit: have MPI_T functions return MPI_ERR_INVALID_INDEX instead of MPI_ERR_INVALID_ITEM
- 33: Deprecate MPI_SIZEOF
-
-
[Howard] PMIx Event handling - which events do we want to handle?
- Don't have a default error handler (blocker) - Could have hung procs. Actually in the Sessions PR https://github.com/open-mpi/ompi/pull/9097 we do have a default error handler.
- In Sessions, left in ULFM proc_error_abort code, the test for ___ didn't work. He added a single process callback, and a second callback for o
- Might be something about the error handler that PMIx is calling, goes into an opal_event_list or something else.
- Places we could do better (instead of tearing down the job) - ASPIRE
- 3 types of event handlers (in priority order)
- single event handlers (look first)
- multi-code handlers, and aim it at a single event in a single call
- Default handler (pass in NULL), meaning all events will use this handler.
- As soon as a single event handler handles an event, PMIX doesn't call the more generic handlers.
- Can specify which processes use which handler.
- Can also specify to NOT use default handler.
- ULFM code specified NOT to use the default handler. So ULFM needs it's own event handling.
- LOST_CONNECTION event handler is missing on master.
- This is also blocked from going to default handler.
- This is what allows processes to kill themselves when job is terminated by scheduler.
- Will need a PR before v5.0.0 [ACTION create a blocking issue for v5.0 to TRACK]
-
[Jeff] Uniform application of OFI and UCX component selection mechanisms
-
What is the strategy that should be used for OFI and UCX components?
-
E.g., https://github.com/open-mpi/ompi/issues/9123
- Summary: User builds a "universal" Open MPI to use across several different clusters, including support for both OFI and UCX.
- Somehow the Wrong thing is happening by default in a UCX-based cluster
-
Action: Let's review what the current OFI / UCX selection mechanisms are
-
@rhc54's proposal for fabric selection: https://github.com/open-mpi/ompi/issues/9123#issuecomment-877824063
-
@jjhursey + @jsquyres proposal from 1+ year ago:
mpirun --net TYPE
whereTYPE
is defined in a text config file somewhere (i.e., customizable by the sysadmin), and basically gets transmorgaphied into a set of MCA params+values. E.g.,mpirun --net TCP
ormpirun --net UCX-TCP
pulls the definition of thoseTYPE
s from a config file containing:# Definitions for mpirun --net TYPE UCX-TCP = -mca plm ucx -x UCX_.._TLS tcp TCP = -mca plm ob1 -mca tcp,vader,self
- Could easily amend: use
--net
CLI option as the highest priority, and then take info from PMIx as fallback if user CLI option is not specified. - Issue: Do need to keep this simple enough to implement in a reasonable amount of time.
- What if we add this to mpirun (that turns around and calls prun/prterun)?
- That approach wouldn't work for non mpirun schedulers.
- WHY would we do what the lower level already has selection mechanisms?
- We don't like telling customers to configure OMPI and also configure the lower levels.
- Intel MPI today has something like this, but it doesn't always do it "correctly", which is very confusing.
- But this is duplicating behavior, and if we don't duplicate perfectly, we're causing confusion.
- We should go by the "90%" rule. Most users will just want something simple (ex: "Use OFI"), some of these other items are for "advanced users"
- Could easily amend: use
-
@bosilca's thought that we should be using
*/mca/common/*
more (e.g., have multiple components of the same network type share common functionality for selection) -
@rhc54: Word of caution. This agenda item conflates two issues - the default selection of what should happen and the user-facing method of informing OMPI on what to do. We should resolve the default selection problem first and separately as this is what is causing the current problems. Default selection must work correctly whether you use "mpirun" or "srun" or "prun" (with a PRRTE DVM), and it should work correctly out-of-the-box without anyone (user or sysadmin) providing parameters.
-
A customer is trying to build a "full featured" Open MPI with both libfabric and UCX support.
- Trying to use UCX, but something in libfabric segved.
- Defaults or user or priorities were all designed around the concept of "one and only one right answer". Wasn't any ambiguity, but that environment has changed.
- libfabric and UCX are both marching towards having a superset of networks they support.
- There are "better" choices based on performance.
-
Defaults are hard to define and hard to get the correct. But if we had a magical oracle to figure that out, still have some issues between PMLs, MTLs, and BTLs around Initialization.
-
Will need to solve this Centrally rather than spread around the code-base.
-
What does it mean to say "OFI" (which component should be selected?)
-
"Use TCP" - goes down a rat's nest of which component to use.
-
Because of how we've implemented One-Sided, the BTLs are integral to how we do this.
- Deprecating the One-Sided component makes the BTLs integral (Still correct decision tho)
-
Don't know if we want to do device IDs...
- If we want to do something "Linux-Only" we could iterate over devices.
- Don't want to have to update some table every-time a new device is released.
- Vendor-IDs might be sufficent
-
Don't want separate components having separate device trees. Would rather have a centralized location in a single-order.
-
Jeff advocating for Framework with multiple device-components.
- Brian arguing against since those then need to be prioritized, and many very small components.
-
Can't call lower level init calls (very heavy and other problems)
-
C or Text file for mapping... not sure yet.
- XML is a pain to parse
-
Might want some wildcarding (everything from Vendor X do Y)
- What is the output of this? List of components to use?
- Maybe something like a global mca var to say "This is what should be used"
- If output ends up with more than one thing, we're in an ambiguous state, so punt to user.
-
This has always been a NICE to have, but now we REALLY NEED IT.
-
Easy to maintain over time, vendors will want to update this over time.
-
Selection of all components will influence of PMLs, BTLs, MTLs
-
Need to keep this scalable. Don't want all nodes/procs reading from hwloc tree during init
- PMIx does this once and puts it in shared memory.
- CH4 - look first from PMIx, if can't get it then fallback.
- PMIx already knows what fabrics are present, so could pass in Vendor IDs of NICs found.
- either look at hwloc ourselves or get from PMIx (both?)
- If we get in an ambiguous situation abort and ask user??
- Probably the right thing to do, please tell us X
-
Really trying to prevent initialization.
- There are cases you'll see a device, but it's in an uninitialized state.
-
When we have a new NIC from a new vendor and don't know what to do.
- Really bad for customers, particularly because customers are stuck on older Open MPIs because their ISVs haven't done a new build.
- We'd have to backport new vendor ids to very old OMPIs because OMPI changes it's ABI every few years.
- This argues for Text file, so vendors can update and so could admins.
-
What if we put it in mca param file? mpirun (and pmix/srun) sends that everywhere.
- Two different formats in same text file? That's gross, but we do that now???
-
CAN we express this information in a NOT HORRIBLE to read text file?
- Could be a second file that PMIx forwards everywhere.
- mca params gets expressed as env vars to application.
- Other items get expressed as PMIx key/value pair.
-
Any prioritization info here too?
- No, just identification. If there's multiple matches, we go to ambiguous state.
-
How are we going to get this DONE?
-
What is going in this file?
- Vendor and Part IDs and map this to a string?
- Does new part force a new release for us?
-
If PMIx is already giving this info, why would we do this ourselves?
- Jeff and Ralph will put together a proposal
-
Linux ONLY?
- MacOS wants TCP
- BSD as well.
-
General scheme:
- OMPI will get this "identification" information (PMIx/hwloc)
- Will get a list of strings back, if we get 2 or more that's ambiguous, and ask user.
- Get 1, use it.
- Get 0: use TCP?
- NO! That means you have a new Part.
- Maybe just roll the dice on OFI / UCX?
- Probably can't enumerate all ethernet cards in a sane way.
- Don't forget about RoCE. :)
- EFA is same VendorID but different DeviceID (for eth vs RDMA 'modes')
- Linux-Only solution TCP/IP device "netdev". Mechanisms are different, but can figure out on MACOS as well.
- OMPI will get this "identification" information (PMIx/hwloc)
-
Stepping back, Vendors who have a preference between PMLs - REALLY the first level problem
- EFA IDs has a big range, so can glob these.
- Interesting if Mellanox has something similar.
- After this, does it really matter? Since we've made a UCX/libfabric decision, lower level decisions are handled by the component.
-
When there's 0 of UCX or EFI Devices, then fall down to BTLs?
- Yes.
-
Some discussion about a "network" framework to order items.
- Could be the single static mca component framework
- Bad abstraction break everywhere.
- How does Opal decide provider selection, how does it know if it needs tag matching or not.
- In many ways this is worse with OPAL/MPI split.
-
VendorID (is the pci device id) is the same for Mellanox RoCE vs RDMA cards.
-
-
July 29 [no particular owner] Plans for better support for GPU offload - do we have any? How important is this to our users?
-
[Josh] MPIR-Shim CI testing
-
[Ralph] Overhaul "mpirun --help"
- Follow the Hydra model of a high-level help and then help/option (i.e., "mpirun --map-by --help"
-
[Brian] Open MPI ABI stability
-
Our lack of a stable ABI across versions is very problematic for customers.
-
Customers who are still stuck on Open MPI 1.10.
-
If we go this route, the miriade of sublibraries
- But users don't link against these sub-libraries.
- One version of linker on Linux.
- We bumped the non-backward compatible library (Github Issue, Debian brought it up) We broke ABI for OPAL, but not MPI. We versioned correctly, but this still broke the application. They had to relink the application.
-
Might want to squash all libraries down.
-
Should add tests for this.
-
Doing this will break ABI again. So do we want this for v5.0.0?
-
ldd (does recursion) of app, you'll see libopal, and libmpi
- Looking at elf, still just libmpi.so (which links against libopal)
- So today, just need to keep libmpi.so
-
But still need to go back and study this Debian issue, but probably don't have as big of a problem as we did.
-
Where are we on ABI only applies to MPI, and not internal. Is this still true?
- We could make this true.... Two parts of what we mean by ABI.
- ABI of the library is whatever's "public". Other libraries only bump version when subset of all "public"
- Fortran links against both, so it's okay.
- Fortran calls some non-public APIs, so we'd need to version these too.
- Do you support using a libFortran from build A, and libmpi from build B where they might be different.
- If the answer is NO, then we're okay
- Could implement a runtime check fairly trivially.
- But would have to be a library constructor since can't verify which is called first.
- Possibly easier ways than this.
- Fortran Precompiled Mod files.
- So ABI from our point of view, is if they use the same Fortran compiler, we guarantee.
- We could make this true.... Two parts of what we mean by ABI.
-
Probably not a rush to fix for v5.0.0 since this problem has and will be an issue.
-
Probably dont need to fold libopenpal into libmpi, since apps just link against libmpi, and it links against libopal
- [Action someone needs to go re-review Debian issue]
-
Does ANYONE still use OPAL directly?
-
If MPI 5.0 standard is going to standardize ABI on MPICH ABI, we might not need to do much.
- https://github.com/cea-hpc/wi4mpi
- BUILD project also referenced.
- Also an issue because some container archetectures are trying to pull MPI from outside of the container.
-
NOTES FROM AFTER THE MEETING:
-
Brian + Jeff met to discuss ABI issues.
-
We verified that since Open MPI v4.0.x,
mpicc
and friends just do-lmpi
(and Fortran libraries). Meaning: they do not specifically-lopen-pal
.-
ldd
of an MPI app executable will somewhat-confusingly show youlibopen-pal
, but that's becauseldd
chases down the recursion.
-
-
Hence, we think that ABI actually isn't an issue for v5.0 and beyond.
-
Specifically, Brian did some testing. We think we're ok with OPAL ABI versioning issues, at least on modern linux. Brian tried Amazon Linux 2, RHEL 8, SLES 15, and Ubuntu 20.04, and all passed this test:
- Build Open MPI master (c:r:a == 0:0:0 for all libraries) and install
- Build MPI application against installation
- Bump Open PAL's c:r:a to 1:0:0 and rebuild / install
- Run mpi application built against original libmpi against new libmpi, verify it runs
-
As expected,
ldd
caught the change in library version number after the c:r:a change, so the right things happened. -
Bottom line: I think we're ok with our current situation; no need to try and do the one big library thing (e.g., for v5.0.0), at least not right now.
-
-