-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Different approach with a few benefits #230
Comments
It would help to see the |
There you go Original
New cumulative
The prototype doesn't yet count the number of hits (not difficult to add) |
That looks promising, but the tracing seems to have a 4x overhead based on the return function in the base case (assuming it was hit 3 times in your cumulative version). While that's not too significant in this case, it can be very significant in tighter loops, and it would be beneficial to create a tight loop to test this in. |
How do you infer a 4x overhead from that output? I just tested @profile
def f():
for _ in range(int(1e7)):
g()
@profile
def g():
pass
f() and it takes 22.37s with v4.0.3 and 18.53s with the new approach. The prototype runs the new approach in addition to the original one, so I temporarily disabled the original part for this crude benchmark. |
Ah, I looked at the return statement in the base case and it looked like it took 1.2us per hit in your version whereas the original took 0.3us per hit. Usually, very simple lines like a plain return statement are virtually instant in Python, and almost all the time taken is in the profiler. |
Here's a benchmark that shows around a 11x slowdown with the current line_profiler vs unprofiled code (total time 500ms unprofiled vs 5.4s profiled): Time Elapsed tight: 494.96ms
Time Elapsed tight: 5390.15ms
Slowdown from profiling: 4895.19ms, 10.89x
Total time: 5.39021 s
File: tight.py
Function: main at line 3
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 def main():
4 1 3.1 3.1 0.0 start = time.perf_counter()
5 20000001 2628245.8 0.1 48.8 for x in range(20000000):
6 20000000 2761886.3 0.1 51.2 y = x
7 1 31.5 31.5 0.0 elapsed_ms = round((time.perf_counter()-start)*1000, 2)
8 1 42.2 42.2 0.0 print(f"Time Elapsed tight: {elapsed_ms}ms")
9 1 1.3 1.3 0.0 return elapsed_ms Your code gives these results: Time Elapsed tight: 498.77ms
Time Elapsed tight: 18367.76ms
Slowdown from profiling: 17868.989999999998ms, 36.83x
Wrote profile results to tight.py.lprof
Timer unit: 1e-06 s
Total time: 5.91982 s
File: /home/theel/Coding/line_profiler/line_profiler_cumulative/tight.py
Function: main at line 3
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 def main():
4 1 1.7 1.7 0.0 start = time.perf_counter()
5 20000001 2961819.0 0.1 50.0 for x in range(20000000):
6 20000000 2957938.7 0.1 50.0 y = x
7 1 18.6 18.6 0.0 elapsed_ms = round((time.perf_counter()-start)*1000, 2)
8 1 39.2 39.2 0.0 print(f"Time Elapsed tight: {elapsed_ms}ms")
9 1 0.6 0.6 0.0 return elapsed_ms I believe the total time elapsed shown by line_profiler is incorrect, as I definitely waited at least 14 seconds for yours to finish (based off watching the clock). This indicates that your code has a ~37x slowdown, which is ~3x more overhead than the current line_profiler. |
The thing to remember (which I forgot in my first comment here!) is that line_profiler's measurements of time taken in the output aren't actually the overhead, as line_profiler calls hpTimer() to try to measure the overhead, and eliminate it from the output. |
I fail to reproduce your results. Could you try to run them again with the new commit. I changed a few things so you can get a fair comparison with your benchmark. |
This version is significantly better, but still about 1.5x slower: Time Elapsed tight: 513.24ms
Time Elapsed tight: 9631.74ms
Slowdown from profiling: 9118.5ms, 18.77x
Wrote profile results to tight.py.lprof
Timer unit: 1e-06 s
Total time: 4.17773 s
File: /home/theel/Coding/line_profiler/line_profiler_cumulative/tight.py
Function: main at line 3
Line # Hits Time Per Hit % Time Line Contents
==============================================================
3 def main():
4 1 1.7 1.7 0.0 start = time.perf_counter()
5 1 2005221.9 2005221.9 48.0 for x in range(20000000):
6 1 2172444.6 2172444.6 52.0 y = x
7 1 21.0 21.0 0.0 elapsed_ms = round((time.perf_counter()-start)*1000, 2)
8 1 38.0 38.0 0.0 print(f"Time Elapsed tight: {elapsed_ms}ms")
9 1 0.7 0.7 0.0 return elapsed_ms |
The weird thing is, my results are the complete opposite. v4.0.3 gives me "Time Elapsed tight: 12030.22ms" while the new commit "Time Elapsed tight: 6169.69ms" (tripple checked). My process
|
Great to hear :) I think most improvements can be carried over. I'll give it a try sometime next week |
See notEvil#1 |
Nice, thanks! So if I understand correctly, performance is fine. Regarding the differences to the current approach, the implementation at the bottom of |
The latest commit now counts the number of line hits and provides the times in a backwards compatible way.
"Total time" reported for |
An example of what call stats would allow: graph.dot.pdf |
I'm happy to announce the next iteration :) Getting useful numbers in case of recursive code was a lot more difficult than anticipated, but here they are nonetheless. The last commit
It is impossible to derive Now the output for the initial example is
Testet withCodeimport time
def wait(n):
time.sleep(n * 0.01)
@profile
def f(n):
f1(n)
f2(n)
f3(n)
f4(n)
f5(n)
f6(n)
@profile
def f1(n):
if n == 0:
return
wait(1)
f1(n - 1)
@profile
def f2(n):
if n == 0:
return
wait(1)
(f2(n - 1), f2(n // 2))
@profile
def f3(n):
if n == 0:
return
wait(1)
f3_2(n - 1)
@profile
def f3_2(n):
if n == 0:
return
wait(1)
f3(n - 1)
@profile
def f4(n):
if n == 0:
return
wait(1)
(f4_2(n - 1), f4_2(n // 2))
@profile
def f4_2(n):
if n == 0:
return
wait(1)
(f4(n - 1), f4(n // 2))
@profile
def f5(n):
if n == 0:
return
wait(1)
f5_2(n - 1)
@profile
def f5_2(n):
if n == 0:
return
wait(1)
f5_3(n - 1)
@profile
def f5_3(n):
if n == 0:
return
wait(1)
f5(n - 1)
@profile
def f6(n):
if n == 0:
return
wait(1)
(f6_2(n - 1), f6_2(n // 2))
@profile
def f6_2(n):
if n == 0:
return
wait(1)
(f6_3(n - 1), f6_3(n // 2))
@profile
def f6_3(n):
if n == 0:
return
wait(1)
(f6(n - 1), f6(n // 2))
f(6) Output
|
Thanks for this great work! I'm going to rebase my PR to your fork on this latest work, could you take a look at it once I do that? After this work of yours is merged, I can start working on adding CPython 3.12 support to line_profiler. |
Thanks :) |
I just tested the performance, and it is indeed significantly worse than before. Previously, notEvil#1 was able to pull it down to 13x overhead over no profiling, in the tight loop case from above, whereas now with the same optimizations applied, it'd 16x overhead over no profiling. That being said, I believe I can optimize it further, as some of your data structure choices may be suboptimal performance-wise. |
@notEvil thanks for sharing this! many years ago i wrote a line profiler for another dynamic language runtime and I think I used a similar call/return stacking approach. if you get a chance, would you mind taking a look at the merge conflicts pulling in the latest |
@tmm1 sry for not responding! I will check the conflicts soon, I promise ^^ I've pushed a new commit* which should improve confidence in the solution. The previous version was manually tested on a few examples with manual inspection of the output. The testing is not perfect:
What I like:
So now that it "works" and won't get any more complex, I'd say its a good time to review and benchmark. https://github.com/notEvil/line_profiler/tree/sub * changes
** If you don't know Hypothesis yet, I'd definitely recommend checking it out! |
@tmm1 just merged main and except for trivial conflicts it went smooth. The randomized tests still pass :) @Theelx you created a PR back then for better performance and comparisons. I didn't merge because I didn't want to mix the two proposals and it looked like it was based off of #203 but without git reference (merge). Would you consider the PR mature enough to merge with sub on a test branch? |
Just a quick update: I tried to measure and improve performance and got some interesting results. The benchmark is https://gitlab.com/notEvil/code_collapse applied to itself which spends most of the time in Black and the remaining time in nested generators (tree traversal) and LibCST.
As you can see, I wasn't able to significantly improve performance of sub. Maybe time opt is worth the tradeoff (creates some bias), what do you think? I will try to figure out why Python 3.12 is performing so bad compared to 3.11 on my project. Regarding the bad performance when line profiling on 3.12, can you reproduce this? e.g. jsonpickle Update 2023-11-04 Python 3.12 is a bit slower than 3.11 in general, but the large difference is due to Black not compiling for 3.12 (due to a bug in Mypyc). So the benchmark is flawed in so far as it presents a measure of slowdown that is not representative (most time is spent in compiled code which doesn't add any overhead). Maybe we could run pyperformance somehow. Also regarding key opt, turns out it is not uncommon for short functions to have the exact same bytecode ( * either directly from #203 or inspired by |
Good news I guess:
Average factor (geometric mean, complete case only)
I managed to adjust pyperformance so it uses line_profiler (with autoprofile) and ran it*. See https://github.com/notEvil/line_profiler/tree/sub/pyperformance and https://github.com/notEvil/pyperformance/tree/line_profiler if you want to do the same. Set up the Pipenv environment and run The column Update 2023-11-06 Added the opt variants and sorted by average factor. Interestingly, the success of time opt indicates benchmarks where the entry check fails a lot. However, map opt is on par and doesn't add bias. So overall my take is: preshed is not worth it**, time opt might be and map opt, tss opt and key opt are. * on Intel 4570S with 3 of them in parallel (1 idle core). Most nan are due to conflicts (e.g. bind to port). |
Hi,
I've been tinkering with line_profiler and implemented a different approach to line profiling:
https://github.com/notEvil/line_profiler/tree/sub
Essentially, it keeps track of "calls" (
PyTrace_CALL
andPyTrace_RETURN
are used for generators and async as well) and by doing so records "total time": time spent within a function call, generator or async step, excluding time spent in line profiled functions that were called. From there it calculates "cumulative time".This approach has multiple benefits:
Potential downside is increased overhead, but in my testing it wasn't much slower.
I've also created a Neovim plugin (not published yet) to see the differences:
Original
New cumulative
New total
Tested on a real application (40s runtime, no async)
The text was updated successfully, but these errors were encountered: