-
Notifications
You must be signed in to change notification settings - Fork 870
WeeklyTelcon_20190813
Geoffrey Paulsen edited this page Jul 25, 2023
·
2 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- Artem Polyakov (Mellanox)
- Brendan Cunningham (Intel)
- Brian Barrett (Amazon)
- Geoff Paulsen (IBM)
- Harumi Kuno (HPE)
- Josh Hursey (IBM)
- Michael Heinz (Intel)
- Noah Evans (Sandia)
- Ralph Castain (Intel)
- Todd Kordenbrock
- Akshay Venkatesh (nVidia)
- Aravind Gopalakrishnan (Intel)
- Arm (UTK)
- Brandon Yates (Intel)
- Dan Topa (LANL)
- David Bernhold
- Edgar Gabriel (UH)
- Geoffroy Vallee
- George Bosilca (UTK)
- Howard Pritchard (LANL)
- Jake Hemstad
- Jeff Squyres (Cisco)
- Joshua Ladd (Mellanox)
- Mark Allen (IBM)
- Matias Cabral
- Matthew Dosanjh (Sandia)
- Nathan Hjelm
- Peter Gottesman (Cisco)
- Thomas Naughton
- Xin Zhao (Mellanox)
- mohan
-
Git submodules
- This PR is in progress. Requires CI owners to add
--recursive
to their Jenkin's git clone commands. - As a first step, Jeff created:
- PR 6821 "hwloc201 use a submodule"
- Brian will not have cycles for a weeks.
- Jeknins has an issue that Brian.
- This PR is in progress. Requires CI owners to add
-
What to do with OFI BTL and OFI MTL
- Harumi Kuno (HPE) - Discussion about OMPI's component philosophy
- mail archive: https://www.mail-archive.com/[email protected]/msg20736.html
- ofi/BTL and MTL components can step on each other.
- PSM2 - when a user of PSM2 calls PSM2_Finalize, as long as there's a PSM2 provider, PSM2 is refcounting is only observed in initializing not in finallizing, meaning first finalize, was finalizing entire job.
- No progress Brendan is looking at this on PSM2 side.
- What is the plans for PSM2 and the MTL, etc?
- Still fully supporting PSM2. PSM1 is end-of-life-ing the adapters in march of 2020. Will probably remove PSM1 code from v5.0 and master. Michael Heinz
- Update Harumi Kuno - Jeff raised some issues with OFI common PR to return to master (older issue 2519), build issue. Think we
-
Status of Scale testing
- No update. Blocking on Amazon time, lower priority.
- Issue 6786 "OMPI 4.0.1 TCP connection errors beyond 86 nodes"
- Issue 6198 "SSH launch fails when host file has more than 64 hosts"
- IBM is also working on something like this as well (for ssh launch)
- Prefer this every night, instead of each PR.
-
Issue 6799 "UFM buffers failing in culpGetMemHandle ?"
- No update
-
- https://engineering.mongodb.com/post/succeeding-with-clangformat-part-1-pitfalls-and-planning
- Should get this cleaned up. Need one big PR fix.
- Whitespace vs Tab cleanup.
- Good conversation on PR.
- Should we have CI for this?
- MongoDB did something similar, and branches, and issues, and why they went with CLANG.
- After folks write the scripts, then adding to CI is no problem.
- Want it to be EASY to add local githooks so CI isn't first line for these.
- Giant clean up commits should be done on each
- Implementation details:
- It might be easy to use clang for the CI / formatting.
- clang enforces a set of things, but it may require more than
- We have a requirement in Open MPI that says you write 'if (NULL == var)'
- very hard to enforce this in perl, and gcc can't give us AST to do at that level.
- run clang far enough to get AST, to do formatting.
- you can now run clang_format.py reformat-branch T R (using T and R from the algorithm above) to easily bring a stranded topic branch forward after a reformat commit.
- If we have to add yet another dependency (like clang), most of us don't use clang, so adding a bunch of painful.
- White space is how this started, and perhaps just fix white space stuff. And both githooks and CI to enforce.
- scripts are in mentioned in PR.
- Most of these scripts UPDATE the git commit, and so for CI we want them just to check.
- Command line example on how to add to add to git hooks.
- Brian thinks he owns next steps - basic style checking in CI.
- Complete
- No update
- Suggest just doing hwloc (stable and not too much development) first
- No update
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.4
Review v3.1.x Milestones v3.1.4
- Nothing to report.
- v3.0.x MPIR_Breakpoint issue need a bit more data why -O3
- Tested new PMIx
- Exposed a few new test suite issues in "ibm", but fixed
Review v4.0.x Milestones v4.0.2
- Howard is out this week. Once Datatype PR is merged, will spin RC1 to begin testing.
- Akshay will test new datatypes with CUDA.
- Will test on master maybe v4.0.x too.
- No update 8/13
- PR against v4.0.x to pull in latest PMIx release merged.
- Many bugfixes waiting for 4.0.1, we should try to get 4.0.2 out the door.
- OB1 get protocol problem Issues 6568 - Nice, but not a blocker since everything but MCA has CMA
- George is back from vacation, want two things before rc1
- Datatype work, master PR for datatypes
- Also ob1 get/put path problem
- Edgar just reported a bug
- Howard is verifying 6613 MPIR Disappearing queue on re-attach.
- PR6806 - Want to wait until CI is back. Do we have any tests to test this?
- Howard will reproduce and add to ibm suite
- 2nd Put issue PR 6568 (Vader deadlocking with 4MB transfers)
- waiting on George to return (end of the month)
- New Datatype work https://github.com/open-mpi/ompi/pull/6695 (master)
- Want for v4.0.2
- Now approved for master.
- waiting on George to return (end of the month). We could merge to master, but if any issues, we'd need George to fix.
-
https://github.com/open-mpi/ompi/issues/6568 - put protocol has lost it's pipelining.
- Combination of both ob1 and vader.
- Right now only shows in vader, because all others prefer get protocol.
- Vader generate a bunch of 32K frags. so for 4MBs overwhelms vader.
- Does NOT occur with single copy like CMA or KNEM.
- Marked as a blocker, but wont block RCs, just
- Is this a regression? Not sure if it was ever implemented.
- Used to be some pipelining, used to work. Not sure why it's showing up.
- Everything George knows is in the ticket.
- Need a throttle for large messages.
- Issue 6789 - OMPI crashes when configured with ucx version
- Issue with PML UCX conflicting with btl_uct - memory hooks
- New this week: Howard not convinced it's memory hooks.
- Howard can't reproduce. Asking user to
Review Master Master Pull Requests
- PR6556 and 6621 should go to the release branches.
- no update
- Good reminder that we now need to be careful about OPAL's ABI.
- Not a great way to test CI before
- When do we get rid of 32bit?
- Still don't have any release manager.
- Need to identify someone in next few months.
- 3.1.4 is out
- 2.2.3 is in RC.
- 4.0 just rough schedule now. Trying to get standard RFCs out this month.
- Branching for PMIx v4.0 might be September.
- a bunch of stuff going on, but nothing necessarily impacting OMPI.
- Made a change for Nathan - allow you to get locality of other processes on node.
- Allows you to hook up with shared memory
- The version master PMIx can support network coordinates of any NIC, and depending
on type of network can map for each process.
- "network coordinates" - map to MPI network topology definition.
- Fujitsu, Cray is implementing.
- In PMIx when do instant-on, the scheduler queries the ___ plugin to get a payload of info you want. If the process is bound to a certain socket, this is the NIC they should use, and these others are available. Then you assign the endpoint to that NIC.
- Requires Instant-On? - simple to do without instant-on if you want to.
- Howard has someone coming onboard in LANL next month.
- Tom filed a PRTE PR recently, so making some progress.
- Open-MPI would like the mpirun launch versus the lam-boot 2 command approach
- Aug 7th - web-ex meeting.
- Talked about what needed to happen, and confirmed want to go down this path
- laid out a few steps of what needs to happen.
- Some hinges on submodule automation.
- OLD - Gile's PRRTE work was done differently than we're not proposing. New proposal uses submodules, etc.
- PR6339 - he's closed, and re-opened a new branch to look at.
- Howard reviewed PR6339, and likes everything that Giles did.
- IBM has to triage some failures on master and v4.0.x