[stdlib] fix zero based range performance #3932

sstadick · 2025-01-08T02:52:44Z

See #3931 all details.

This is the minimal possible change to get range(end) performance matching range(0, end). The cleaner approach would be to have range(end) just call range(0, end), most likely.

The potential optimization in the original version of the _ZeroStartingRange that could skip a call to max seems to make it harder for the compiler to generate efficient assembly for loops (see linked issue for example assembly). This is possibly a compiler bug, and _ZeroStartingRange could be re-instated to its original implementation.

Signed-off-by: Seth Stadick <[email protected]>

sstadick · 2025-01-08T03:19:09Z

The CI failure should be resolved by #3922. It is unrelated to the change I'm making and seems due to this change: 4903cf9#diff-f8da59dec02a848d700736433c150c14adfea4c79cb453ded5c247c1c597b654.

martinvuyk

Hi @sstadick thanks for contributing :)

martinvuyk · 2025-01-08T17:29:17Z

stdlib/src/builtin/range.mojo

-        return self.end - curr
+        var start = self.start
+        self.start += 1
+        return start

    @always_inline
    fn __has_next__(self) -> Bool:
        return self.__len__() > 0


Another place where we can add a micro-optimization :D

Suggested change

return self.__len__() > 0

return self.end > self.start

martinvuyk · 2025-01-08T17:33:20Z

stdlib/src/builtin/range.mojo


    @always_inline
    fn __has_next__(self) -> Bool:
        return self.__len__() > 0

    @always_inline
    fn __len__(self) -> Int:
-        return self.curr
+        return max(0, self.end - self.start)


I think we could make this branchless. No idea how often this gets called or if the optimization would have a positive effect. Would you be willing to measure the diff using something like

return (self.end - self.start) & -int(self.end > self.start)

Totally! I'll post some numbers when I have them

sstadick · 2025-01-08T18:20:03Z

@martinvuyk, I tried both suggestions, but did not get an improvement. The bench numbers look near identical, and more importantly, the output assembly vs range(0, end) looks identical.

What I tried:

import benchmark
from benchmark import keep, Unit
from memory import Span
from ir_utils.dump import dump_ir

from builtin._stubs import _IntIterable
from builtin.range import _StridedRange


@register_passable("trivial")
struct _ZeroStartingRange(Sized, ReversibleRange, _IntIterable):
    var start: Int
    var end: Int

    @always_inline
    @implicit
    fn __init__(out self, end: Int):
        self.start = 0
        self.end = end

    @always_inline
    fn __iter__(self) -> Self:
        return self

    @always_inline
    fn __next__(mut self) -> Int:
        var start = self.start
        self.start += 1
        return start

    @always_inline
    fn __has_next__(self) -> Bool:
        return self.end > self.start

    @always_inline
    fn __len__(self) -> Int:
        return (self.end - self.start) & -int(self.end > self.start)

    @always_inline
    fn __getitem__(self, idx: Int) -> Int:
        debug_assert(idx < self.__len__(), "index out of range")
        return self.start + index(idx)

    @always_inline
    fn __reversed__(self) -> _StridedRange:
        return range(self.end - 1, self.start - 1, -1)


fn main() raises:
    var input = List[UInt8]()
    for i in range(1000000):
        input.append(i)
    var data = Span(input)
    var s1 = 0
    for i in _ZeroStartingRange(len(data)):
        s1 += int(data[i])
    var s2 = 0
    for i in range(0, len(data)):
        s2 += int(data[i])
    if s1 != s2:
        raise "Invalid custom iterator"

    @parameter
    fn range_custom_iter_implicit_zero() raises:
        var sum = 0
        for i in _ZeroStartingRange(len(data)):
            sum += int(data[i])
        keep(sum)

    @parameter
    fn range_without_start() raises:
        var sum = 0
        for i in range(len(data)):
            sum += int(data[i])
        keep(sum)

    @parameter
    fn range_with_start() raises:
        var sum = 0
        for i in range(0, len(data)):
            sum += int(data[i])
        keep(sum)

    @parameter
    fn range_iter() raises:
        var sum = 0
        for value in data:
            sum += int(value[])
        keep(sum)

    # dump_ir[range_with_start, "range_with_start"]()
    # dump_ir[range_without_start, "range_without_start"]()
    # dump_ir[
    #     range_custom_iter_implicit_zero, "range_custom_iter_implicit_zero"
    # ]()

    print("Custom Iter")
    var report = benchmark.run[range_custom_iter_implicit_zero]()
    report.print(Unit.ms)
    print("Without start")
    report = benchmark.run[range_without_start]()
    report.print(Unit.ms)
    print("With start")
    report = benchmark.run[range_with_start]()
    report.print(Unit.ms)
    print("With direct iter")
    report = benchmark.run[range_iter]()
    report.print(Unit.ms)

Output:

Custom Iter
--------------------------------------------------------------------------------
Benchmark Report (ms)
--------------------------------------------------------------------------------
Mean: 0.324559812458138
Total: 2422.839
Iters: 7465
Warmup Total: 0.321
Fastest Mean: 0.324559812458138
Slowest Mean: 0.324559812458138

Without start
--------------------------------------------------------------------------------
Benchmark Report (ms)
--------------------------------------------------------------------------------
Mean: 0.43432440260680666
Total: 2399.208
Iters: 5524
Warmup Total: 0.428
Fastest Mean: 0.43432440260680666
Slowest Mean: 0.43432440260680666

With start
--------------------------------------------------------------------------------
Benchmark Report (ms)
--------------------------------------------------------------------------------
Mean: 0.32141025290498976
Total: 2351.116
Iters: 7315
Warmup Total: 0.31
Fastest Mean: 0.32141025290498976
Slowest Mean: 0.32141025290498976

With direct iter
--------------------------------------------------------------------------------
Benchmark Report (ms)
--------------------------------------------------------------------------------
Mean: 0.3155641093357982
Total: 2389.767
Iters: 7573
Warmup Total: 0.31
Fastest Mean: 0.31556410933579826

Assembly:
range_with_start.txt
range_custom_iter_implicit_zero.txt

martinvuyk · 2025-01-08T18:49:33Z

I'm guessing the difference will only be evident for functional style code where the iterator length or __has_more__ methods get called more often. But nice, it means you really found the main issue 💪. FYI this area is something where future goals are to unify the range types as much as possible, but since it is so performance sensitive there have been previous attempts that were canceled :( (see PR #2949)

sstadick · 2025-01-08T19:19:53Z

Totally makes sense! Which version of this would you prefer:

Leave it as is in my PR, which leaves _ZeroStartingRange but makes it nearly identical to _SequentialRange in all but constructor.
Keep _ZeroStartingRange but apply your suggestions to optimize for future functional style calls that may use those methods more, maybe also apply them to _SequentialRange.
Have fn range(end) (and variants of that) just call fn range(0, end) instead.

martinvuyk · 2025-01-08T19:30:59Z

I think someone from the stdlib team would have to chime in. But if you want a small change merged, then I'd leave the code as is (plus the micro-optimizations if you want to).

But, if you don't have much problem waiting then I'd say you could completely drop _ZeroStartingRange and call _SequentialRange (adding the micro-optimizations if you want to) in the range() builtin. Then the stdlib team will have to do benchmarks against Modular's codebase, which takes time.

sstadick · 2025-01-08T19:49:31Z

Ha! I thought you were one of the stdlib devs! I'll leave it as is for now till I get some more feedback, and then maybe a separate PR to remove _ZeroStartingRange entirely.

sstadick requested a review from a team as a code owner January 8, 2025 02:52

sstadick force-pushed the fix-zero-based-range branch from 8b2b7ed to 20668b0 Compare January 8, 2025 02:55

sstadick changed the title ~~fix: zero based range~~ [stdlib] fix zero based range performance Jan 8, 2025

fix: improve perf of zero based range iterator

6a60e10

Signed-off-by: Seth Stadick <[email protected]>

sstadick force-pushed the fix-zero-based-range branch from 20668b0 to 6a60e10 Compare January 8, 2025 02:59

martinvuyk reviewed Jan 8, 2025

View reviewed changes

Merge branch 'nightly' into fix-zero-based-range

92bb738

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[stdlib] fix zero based range performance #3932

[stdlib] fix zero based range performance #3932

sstadick commented Jan 8, 2025

sstadick commented Jan 8, 2025 •

edited

Loading

martinvuyk left a comment

martinvuyk Jan 8, 2025

martinvuyk Jan 8, 2025

sstadick Jan 8, 2025

sstadick commented Jan 8, 2025

martinvuyk commented Jan 8, 2025

sstadick commented Jan 8, 2025 •

edited

Loading

martinvuyk commented Jan 8, 2025

sstadick commented Jan 8, 2025

[stdlib] fix zero based range performance #3932

Are you sure you want to change the base?

[stdlib] fix zero based range performance #3932

Conversation

sstadick commented Jan 8, 2025

sstadick commented Jan 8, 2025 • edited Loading

martinvuyk left a comment

Choose a reason for hiding this comment

martinvuyk Jan 8, 2025

Choose a reason for hiding this comment

martinvuyk Jan 8, 2025

Choose a reason for hiding this comment

sstadick Jan 8, 2025

Choose a reason for hiding this comment

sstadick commented Jan 8, 2025

martinvuyk commented Jan 8, 2025

sstadick commented Jan 8, 2025 • edited Loading

martinvuyk commented Jan 8, 2025

sstadick commented Jan 8, 2025

sstadick commented Jan 8, 2025 •

edited

Loading

sstadick commented Jan 8, 2025 •

edited

Loading