Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] fix zero based range performance #3932

Open
wants to merge 2 commits into
base: nightly
Choose a base branch
from

Conversation

sstadick
Copy link

@sstadick sstadick commented Jan 8, 2025

See #3931 all details.

This is the minimal possible change to get range(end) performance matching range(0, end). The cleaner approach would be to have range(end) just call range(0, end), most likely.

The potential optimization in the original version of the _ZeroStartingRange that could skip a call to max seems to make it harder for the compiler to generate efficient assembly for loops (see linked issue for example assembly). This is possibly a compiler bug, and _ZeroStartingRange could be re-instated to its original implementation.

@sstadick sstadick requested a review from a team as a code owner January 8, 2025 02:52
@sstadick sstadick force-pushed the fix-zero-based-range branch from 8b2b7ed to 20668b0 Compare January 8, 2025 02:55
@sstadick sstadick changed the title fix: zero based range [stdlib] fix zero based range performance Jan 8, 2025
@sstadick sstadick force-pushed the fix-zero-based-range branch from 20668b0 to 6a60e10 Compare January 8, 2025 02:59
@sstadick
Copy link
Author

sstadick commented Jan 8, 2025

The CI failure should be resolved by #3922. It is unrelated to the change I'm making and seems due to this change: 4903cf9#diff-f8da59dec02a848d700736433c150c14adfea4c79cb453ded5c247c1c597b654.

Copy link
Contributor

@martinvuyk martinvuyk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sstadick thanks for contributing :)

return self.end - curr
var start = self.start
self.start += 1
return start

@always_inline
fn __has_next__(self) -> Bool:
return self.__len__() > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another place where we can add a micro-optimization :D

Suggested change
return self.__len__() > 0
return self.end > self.start


@always_inline
fn __has_next__(self) -> Bool:
return self.__len__() > 0

@always_inline
fn __len__(self) -> Int:
return self.curr
return max(0, self.end - self.start)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make this branchless. No idea how often this gets called or if the optimization would have a positive effect. Would you be willing to measure the diff using something like

return (self.end - self.start) & -int(self.end > self.start)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally! I'll post some numbers when I have them

@sstadick
Copy link
Author

sstadick commented Jan 8, 2025

@martinvuyk, I tried both suggestions, but did not get an improvement. The bench numbers look near identical, and more importantly, the output assembly vs range(0, end) looks identical.

What I tried:

import benchmark
from benchmark import keep, Unit
from memory import Span
from ir_utils.dump import dump_ir

from builtin._stubs import _IntIterable
from builtin.range import _StridedRange


@register_passable("trivial")
struct _ZeroStartingRange(Sized, ReversibleRange, _IntIterable):
    var start: Int
    var end: Int

    @always_inline
    @implicit
    fn __init__(out self, end: Int):
        self.start = 0
        self.end = end

    @always_inline
    fn __iter__(self) -> Self:
        return self

    @always_inline
    fn __next__(mut self) -> Int:
        var start = self.start
        self.start += 1
        return start

    @always_inline
    fn __has_next__(self) -> Bool:
        return self.end > self.start

    @always_inline
    fn __len__(self) -> Int:
        return (self.end - self.start) & -int(self.end > self.start)

    @always_inline
    fn __getitem__(self, idx: Int) -> Int:
        debug_assert(idx < self.__len__(), "index out of range")
        return self.start + index(idx)

    @always_inline
    fn __reversed__(self) -> _StridedRange:
        return range(self.end - 1, self.start - 1, -1)


fn main() raises:
    var input = List[UInt8]()
    for i in range(1000000):
        input.append(i)
    var data = Span(input)
    var s1 = 0
    for i in _ZeroStartingRange(len(data)):
        s1 += int(data[i])
    var s2 = 0
    for i in range(0, len(data)):
        s2 += int(data[i])
    if s1 != s2:
        raise "Invalid custom iterator"

    @parameter
    fn range_custom_iter_implicit_zero() raises:
        var sum = 0
        for i in _ZeroStartingRange(len(data)):
            sum += int(data[i])
        keep(sum)

    @parameter
    fn range_without_start() raises:
        var sum = 0
        for i in range(len(data)):
            sum += int(data[i])
        keep(sum)

    @parameter
    fn range_with_start() raises:
        var sum = 0
        for i in range(0, len(data)):
            sum += int(data[i])
        keep(sum)

    @parameter
    fn range_iter() raises:
        var sum = 0
        for value in data:
            sum += int(value[])
        keep(sum)

    # dump_ir[range_with_start, "range_with_start"]()
    # dump_ir[range_without_start, "range_without_start"]()
    # dump_ir[
    #     range_custom_iter_implicit_zero, "range_custom_iter_implicit_zero"
    # ]()

    print("Custom Iter")
    var report = benchmark.run[range_custom_iter_implicit_zero]()
    report.print(Unit.ms)
    print("Without start")
    report = benchmark.run[range_without_start]()
    report.print(Unit.ms)
    print("With start")
    report = benchmark.run[range_with_start]()
    report.print(Unit.ms)
    print("With direct iter")
    report = benchmark.run[range_iter]()
    report.print(Unit.ms)

Output:

Custom Iter
--------------------------------------------------------------------------------
Benchmark Report (ms)
--------------------------------------------------------------------------------
Mean: 0.324559812458138
Total: 2422.839
Iters: 7465
Warmup Total: 0.321
Fastest Mean: 0.324559812458138
Slowest Mean: 0.324559812458138

Without start
--------------------------------------------------------------------------------
Benchmark Report (ms)
--------------------------------------------------------------------------------
Mean: 0.43432440260680666
Total: 2399.208
Iters: 5524
Warmup Total: 0.428
Fastest Mean: 0.43432440260680666
Slowest Mean: 0.43432440260680666

With start
--------------------------------------------------------------------------------
Benchmark Report (ms)
--------------------------------------------------------------------------------
Mean: 0.32141025290498976
Total: 2351.116
Iters: 7315
Warmup Total: 0.31
Fastest Mean: 0.32141025290498976
Slowest Mean: 0.32141025290498976

With direct iter
--------------------------------------------------------------------------------
Benchmark Report (ms)
--------------------------------------------------------------------------------
Mean: 0.3155641093357982
Total: 2389.767
Iters: 7573
Warmup Total: 0.31
Fastest Mean: 0.31556410933579826

Assembly:
range_with_start.txt
range_custom_iter_implicit_zero.txt

@martinvuyk
Copy link
Contributor

I'm guessing the difference will only be evident for functional style code where the iterator length or __has_more__ methods get called more often. But nice, it means you really found the main issue 💪. FYI this area is something where future goals are to unify the range types as much as possible, but since it is so performance sensitive there have been previous attempts that were canceled :( (see PR #2949)

@sstadick
Copy link
Author

sstadick commented Jan 8, 2025

Totally makes sense! Which version of this would you prefer:

  1. Leave it as is in my PR, which leaves _ZeroStartingRange but makes it nearly identical to _SequentialRange in all but constructor.
  2. Keep _ZeroStartingRange but apply your suggestions to optimize for future functional style calls that may use those methods more, maybe also apply them to _SequentialRange.
  3. Have fn range(end) (and variants of that) just call fn range(0, end) instead.

@martinvuyk
Copy link
Contributor

I think someone from the stdlib team would have to chime in. But if you want a small change merged, then I'd leave the code as is (plus the micro-optimizations if you want to).

But, if you don't have much problem waiting then I'd say you could completely drop _ZeroStartingRange and call _SequentialRange (adding the micro-optimizations if you want to) in the range() builtin. Then the stdlib team will have to do benchmarks against Modular's codebase, which takes time.

@sstadick
Copy link
Author

sstadick commented Jan 8, 2025

Ha! I thought you were one of the stdlib devs! I'll leave it as is for now till I get some more feedback, and then maybe a separate PR to remove _ZeroStartingRange entirely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants