-
-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train rf model in background thread #1178
base: main
Are you sure you want to change the base?
Conversation
…ersion - minor fix: type hint
Heyo, just a question why this is done as a thread and not a subprocess. Thread is easier by many stretches but has quite a few downsides compared to a process, at least in Python: Thread:
Process:
|
Yes, that's what I was on about in #1170 (comment):
So, actually, I think with running it in a separate thread, instead of having e.g., 90% of the time of the main process doing training, we now might end up giving it only 50%, and leaving the other 50% for other stuff like reporting and inference etc. if there's a demand for that. Also, having threads instead of processes (either through multiprocessing or dask), doesn't take away one core from the workers doing cost function evaluation. Perhaps that's not much of an advantage, though.
I think this should already be fixed by daemonizing the thread (docs, further explanation).
Not sure how random forest serialization is handled. I know that for LE: Oh, wait... I just noticed that it also implements Also, for the training data ( |
Ah, damn, I wasn't aware of python/cpython#80116. (also this) |
Heyo, some extra feedback:
Tldr; concurrent programming in python is a mystical art with so many-caveats, and really experimentation is the only way to identify what works consistently. Try out |
…ry (backed by mmap on posix-compliant OSs) for data, and queues for synchronization
…f re-opening every time
- work in progress: switch from threads to processes
…ed within background thread - finish first version of switch from threads to processes
Hi @eddiebergman, I couldn't figure out a way to use Now, the code I pushed isn't yet tested, neither manually/visually, nor by unit tests. Also, I'd like to add a flag that would allow the user to switch to the old, synchronous behavior, which I think would help test things, and also allow users to fall back on the old code until the new code is reliable enough. However, all the logic should be there (unless I made some gross mistakes), so that should give us at least a basis for discussion. In this stage, I wanted to ask you about how the Also, given the non-deterministic nature of OS scheduling, running the same optimization session twice, even with the same rng seed might result in different results, especially if running with a different number of workers, but I understand that's already the case , as per the
Other than that, if you want to throw an eye on what I've implemented so far and, if you, perhaps, find any flagrant mistakes, any feedback and support would be greatly appreciated. Thanks, Bogdan |
… wait until training is done before being able to query the model to suggest a new config to try
- better synchronization / signalling between optimization loop and training loop - refactor: - improve shared array semantics - encapsulation: reuse more allocation / cleanup code - defensive: extra checks - other minor fixes / improvements
… multiple sessions in parallel, e.g., when running tests
…ests to terminate the training loop process gracefully (to as high as an extent as possible) - add (and then disable) some code that prints to console to help debug inter-process synchronization - refactor: renames for improved legibility - refactor: encapsulate and reuse - add option to run testing code without pytest for debug - modify some testing code to avoid deprecation warnings
- refactor: - renames: improved legibility - easier debug printing for sync between opt and train loop processes
Hi @eddiebergman, I managed to bring the code to a point where it passes all the tests without hanging. Indeed, it wasn't a pretty job :). Could you now, please, advise on what's needed for this PR to be merged? Here are some potential improvements I could think of: 1. UPDATE: DONE: Option to Disable ConcurrencyAdd a flag to revert to old implementation, and not just behavior. Note, I've already added the option to wait for training to be completed after every submission of fresh training data, but I was thinking of disabling multiprocessing altogether) 2. Unit TestsWould you like me to add some unit tests? How do you think they should look like? 3. Training Data IPCIn order to decrease risks, I could eliminate my own homebrew You suggested I know you would prefer not to add further dependencies to the project, but I thought it doesn't hurt to ask :). I found the This would also simplify the inter-process signalling required to release and unlink the shared memory, and would avoid potential leaks (see 5., below). 4. RNGStill haven't heard back from you about the 5. Leaks, hangsAs per the code here, in some complex dependency graphs, the garbage collector doesn't get to call the This shouldn't normally be an issue neither in the common use cases, nor when running tests with |
Hi again, I just added an option to disable multiprocessing altogether and switching back to the old behavior. Would this, perhaps, help allow the merging of the pull request, and somehow expose this to the user as an experimental feature, at his own risk? Thanks! |
I ran a session similar to the one the data from which the cpu load plot in #1170 (comment) was generated, and also the performance reported in the motivation section of #1169 (comment):
Currently, it doesn't seem that the code I wrote has managed to fix it. So here are some plots: 1. The original 46h session2. Another 48h session (smooth)LE: 20.4k trials on 17k distinct configs in 48h 3.a. UPDATED: Latest 48h Session3.b. UPDATED: Latest 48h Session (smooth)ObservationAfter 24h the session got through 14450 trials on 64 cores, and cpu load is about 42%. LE: 22k trials on 18.4k distinct configs in 48h ConclusionEither the code is broken, or the bottleneck happens for a different reason, e.g., because of Hyperband. LE: There's a non-insignificant gain of about 10% in the number of trials/configs when enabling the code in this PR (i.e., between 2 and 3 above). So, in light of these new observations, I think it's that a new hypothesis, that this code fixes a bottleneck but there might be another one somewhere else, seems to become more likely. Next steps:
|
…more easily identifiable in monitoring tools
Additional CPU load graphsHi guys. I ran another bunch of optimization sessions turning off various components of SMAC to try and identify where the bottleneck lies. Here are the results: 1. Hyperband disabled1.1. With Background Training6.3k trials in 48 hours 1.2. Without Background Training (my smac3 clone)6.16k trials in 48 hours 1.3. Without Background Training (original smac3 repo)This was just to test if my code introduces any bottlenecks. 6.24k trials in 48 hours 2. Random Sampling
2.1. Hyperband Disabled
2.2. With Hyperband Enabled27.6 trials on 23k unique configs in 48 hours ConclusionsIt seems now that the problem lies neither with training the RF model (when doing random sampling, there are fewer trials done than when using RF; mind, however, that trial time, although deterministic, varies with the chosen params), neither with Hyperband. I guess that some profiling is highly necessary to identifiy the performance bottlenecks in a parallel setting. |
work in progress on #1170