-
Notifications
You must be signed in to change notification settings - Fork 870
WeeklyTelcon_20200121
- Dialup Info: (Do not post to public mailing list or public wiki)
- Geoffrey Paulsen (IBM)
- Todd Kordenbrock (Sandia)
- Edgar Gabriel (UH)
- Joseph Schuchart
- Ralph Castain (Intel)
- William Zhang (AWS)
- Jeff Squyres (Cisco)
- Artem Polyakov (Mellanox)
- Howard Pritchard (LANL)
- Joshua Ladd (Mellanox)
- Brian Barrett (AWS)
- Brendan Cunningham (Intel)
- Harumi Kuno (HPE)
- Michael Heinz (Intel)
- Nathan Hjelm (Google)
- Thomas Naughton (ORNL)
- Austen Lauria (IBM)
- Charles Shereda (LLNL)
- David Bernhold (ORNL)
- Noah Evans (Sandia)
- George Bosilca (UTK)
- Matthew Dosanjh (Sandia)
- Brandon Yates (Intel)
- Erik Zeiske
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Xin Zhao (Mellanox)
- mohan (AWS)
- Akshay Venkatesh (NVIDIA)
- Josh Hursey (IBM)
-
Coverity coverage for PRRTE
- Ralph will send note to Coverity to add the project.
- Brian should be able to work on nightly build piece later this week.
-
Anything to do to make Cray CI more stable?
- Jenkins timeout is pretty low, and thinks it failed before Cray is done with autogen.
- Have a timeout, could make it larger...
- Howard changed directories to /scratch, but that didn't help.
- Could make timeout bigger, but it's already 10 minutes, why can't we get through autogen in 10 min?
- Issue is it gets stuck doing fetch pull on submodules. Huge number of small file writes.
- Two timers in jenkins:
- is total job timer (1-2 hours?)
- is if it hasn't seen output in 10 minutes, it kills the job.
- When filesystem is really slow, even MAKE will be really slow.
- Two timers in jenkins:
- Cray is a highly shared resource with lots of others.
- Don't need to do deep clone, can just do shallow clone (to only get latest version, and not all history).
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.6
Review v3.1.x Milestones v3.1.6
- Pushed out RC last week.
- Still need fix in ompio/api abstraction break. (7318)
- RHEL 8 linker seems to be finding this.
Review v4.0.x Milestones v4.0.3
-
v4.0.3 in the works.
- Put out v4.0.3rc1 over the weekend.
- Schedule: End of january.
- Try to get rc1 built this Friday
-
Howard PRed #7321 to v4.0.x
- xpmem worked on v3.x, so don't think it needs cherry-picking back.
- Nathan to see if these fixes are relevant on 3.0.x and 3.1.x
-
Issue 7220 - vader not cleaning up properly (vader backing files).
- in v3.x series, uses pmix 2.x (can't register cleanup files)
- Nathan: old workaround after add-procs all processes unlink?
- No longer doing this because moved files from /tmp to /dev/shmem (v3.0?)
- This would bring up more bugs for users with very small /tmp.
- in v4.0.x, (uses pmix 3.x, and CAN register files for cleanup)
- sigterm forgets to call pmix interface to cleanup registered files.
- in session directory always cleanup, but in /dev/shmem
- in v3.x series, uses pmix 2.x (can't register cleanup files)
-
Issue 6960 (closed) had something cherry-picked to release branch, but it's still not fixed.
- Configuring
--enable-ipv6
shouldn't preclude ipv4. - Do we need to cherry-pick 6964 back into v4.0.x ?
- Fix this in PRRTE.
- Configuring
- Schedule: April 2020?
- Geoff will update the milestone.
- Portland Oregon, Feb 17, 2020.
- Please register on Wiki page, since Jeff has to register you.
- Date looks good. Feb 17th right before MPI Forum
- 2pm monday, and maybe most of Tuesday
- Cisco has a portland facility and is happy to host.
- about 20-30 min drive from MPI Forum, will probably need a car.
Review Master Master Pull Requests
- PMIx v3.1.5 is probably NOT in January.
-
Been working on PRTE
-
Strange issue is: Suck up libevent and hwloc into opal staticly, but in Pmix link against libopal to get access to these components. Even with name shifting (under opal names) it can call down into opal. pmix_error_log, found himself in opal_output with an unitialized hostname that segfaults.
- Need to find a way to link directly to pmix, hwloc,
- even have disable-dlopen set.
- Problem: want one process (seperate from MPI process) (i.e. prrte) that calls prrte_init, and ends up linking in opal, because it's the embedded coded.
- How should we split these out?
- Make libtool convenence libraries of them.
- prrte rather than linking against libtool, links against the convenence libraries.
- convenence libraries then just get sucked into the code.
- where this fails, is that you can't link against both these convenence libraries and libopal?
- configury? doesn't prrte need to know if we're linking embedded or external?
- Brian will write up some thoughts on this on Friday.
-
ORTE-removal/PRRTE PR is ready to be committed.
- Mellanox CI is still failing on OSHMEM.
- yes this got resolved. Segfault they were seeing is exactly this Strange issue above.
- Hand testing is looking fine.
- using an ORTE parameter, and then OSHMEM then fails because dir doesn't exist or wrong permissions.
- Mellanox CI is still failing on OSHMEM.
-
Still a bunch of things to do after this PR goes in.
- Still 1+ month of effort before Open MPI v5.0 could be ready with this.
- see: https://github.com/openpmix/prrte/issues/298
-
Singleton comm-spawn... how do we make this work? - PMIx understands it.
- Do we need to support singleton comm-spawn starting the PRRTEs?
- Now that we will support a persistant infrastructure, maybe we just require users to start it first.
-
Address comm-spawn issues that have been raised.