Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

@floop slows down function #10

Open
non-Jedi opened this issue May 14, 2020 · 6 comments
Open

@floop slows down function #10

non-Jedi opened this issue May 14, 2020 · 6 comments

Comments

@non-Jedi
Copy link

function prime_factors1(n::Integer)
    factors = Int[]
    # Sadly, using an iterator on a hot loop like this is a bit too slow.
    for i in Iterators.flatten((2, 3:2:n))
        while n % i == 0
            push!(factors, i)
            n = div(n, i)
        end
        n == 1 && break
    end
    factors
end

function prime_factors3(n)
    @floop begin
        n
        factors = Int[]
        for i in Iterators.flatten((2, 3:2:n))
            while n % i == 0
                push!(factors, i)
                n = div(n, i)
            end
            n == 1 && break
        end
    end
    factors
end

As far as I can tell, prime_factors3 is the @floop equivalent of prime_factors1, but it runs significantly slower with a lot more allocations:

julia> @btime prime_factors1($98623589);
  1.786 ms (4 allocations: 208 bytes)

julia> @btime prime_factors3($98623589);
  2.996 ms (206331 allocations: 6.30 MiB)
tkf added a commit that referenced this issue May 15, 2020
This improves the type inference result for #10.  However, it does not
improve the performance of it.
@tkf
Copy link
Member

tkf commented May 15, 2020

I think this is a really nice example. I got inference working with #11 but the performance is not great yet (it looks like the inliner just give up). There are a few other ideas to try.

tkf added a commit to JuliaFolds/Transducers.jl that referenced this issue May 15, 2020
@tkf
Copy link
Member

tkf commented May 15, 2020

OK, with JuliaFolds/Transducers.jl#282, I now have

julia> @btime prime_factors1($98623589);
  1.500 ms (4 allocations: 208 bytes)

julia> @btime prime_factors3($98623589);
  1.850 ms (206329 allocations: 6.30 MiB)

With a bit of cheating (you don't need to put factors in the state as you don't re-assign it):

function prime_factors3_cheating(n)
    factors = Int[]
    @floop begin
        n
        for i in Iterators.flatten((2, 3:2:n))
            while n % i == 0
                push!(factors, i)
                n = div(n, i)
            end
            n == 1 && break
        end
    end
    factors
end

I get

julia> @btime prime_factors3_cheating($98623589);
  724.095 μs (2 allocations: 128 bytes)

I think it's possible to automatically exclude factors from the state by the macro.

@tkf
Copy link
Member

tkf commented May 15, 2020

Just as a reference, the unrolled version https://julialang.zulipchat.com/#narrow/stream/225542-helpdesk/topic/Fast.20chained.20iterator.3F/near/197555379

function prime_factors2(n::Integer)
    factors = Int[]
    while n % 2 == 0
        push!(factors, 2)
        n = div(n, 2)
    end
    n == 1 && return factors
    for i in 3:2:n
        while n % i == 0
            push!(factors, i)
            n = div(n, i)
        end
        n == 1 && return factors
    end
    factors
end

is still faster:

julia> @btime prime_factors2($98623589);
  653.200 μs (2 allocations: 128 bytes)

@non-Jedi
Copy link
Author

I assume the macro expands to basically the equivalent of my attempt at a hand-written transducers solution when moving factors outside the @floop? Or does it use foreach?

function prime_factors4(n)
    factors = Int[]
    foldl(Cat(), (2, 3:2:n); init=n) do n, i
        while n % i == 0
            push!(factors, i)
            n ÷= i
        end
        n == 1 ? reduced(n) : n
    end
    factors
end

There was a large performance gap between the foldl and the foreach transducers implementations I tried which looks to be due to JuliaLang/julia#15276. Just curious if your work somehow makes the prime_factors5 version below fast.

julia> function prime_factors5(n)
           factors = Int[]
           foreach(Cat(), (2, 3:2:n)) do i
               while n % i == 0
                   push!(factors, i)
                   n ÷= i
               end
               n == 1 && reduced()
           end
           factors
       end
prime_factors5 (generic function with 1 method)

julia> @btime prime_factors1($98623589);
  1.786 ms (4 allocations: 208 bytes)

julia> @btime prime_factors4($98623589);
  1.643 ms (5 allocations: 176 bytes)

julia> @btime prime_factors5($98623589);
  16.747 ms (410286 allocations: 6.26 MiB)

@tkf
Copy link
Member

tkf commented May 15, 2020

I assume the macro expands to basically the equivalent of my attempt at a hand-written transducers solution when moving factors outside the @floop?

Yeah, that's right.

Just curious if your work somehow makes the prime_factors5 version below fast.

As you said, there is the boxing problem so I don't think it's possible to make the foreach version fast (at least with the current julia compiler). A simpler solution is to just put the variables to be re-assigned in the accumulator of foldl.

tkf added a commit to JuliaFolds/Transducers.jl that referenced this issue May 15, 2020
* Inline more transducer methods

Required for JuliaFolds/FLoops.jl#10

* Remove "broken annotation" from the now-passing test
tkf added a commit that referenced this issue May 15, 2020
This improves the type inference result for #10.  However, it does not
improve the performance of it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants