Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Parallel Ensemble Simulations with generic linear algebra #110

Open
albertomercurio opened this issue Jul 13, 2021 · 8 comments
Open
Labels

Comments

@albertomercurio
Copy link

Hello,

I'm trying to solve and ODE with my GPU. I tested the example code present in the DiffEqGPU.js repository, and it worked fine.

I have the following function to implement in the ODE Problem

function drho_dt(rho, p, t)
    A, w_l= p
    return ( L + A * sin(w_l * t) * L_t ) * rho
end

or equivalently

function drho_dt(drho, rho, p, t)
    A, w_l = p
    mul!(drho, L + A * sin(w_l * t) * L_t, rho)
    return nothing
end

where A and w_l are two numbers, while L and L_t are two global matrices. Following the documentation, I used the following code to run a Parallel Ensemble:

A = 0.05
w_l = 1.0

p = [A, w_l]

tspan = (0.0, 10.0 / gam_q)

w_l_l = LinRange{Float64}(0.1, 1.9, 10)

function prob_func(prob,i,repeat)
  remake(prob,p=[prob.p[1], w_l_l[i]])
end

prob = ODEProblem(drho_dt, rho0_vec, tspan, p)
ensemble_prob = EnsembleProblem(prob, prob_func=prob_func, safetycopy=false)
@time sim = solve(ensemble_prob, BS3(), METHOD!!, trajectories=length(w_l_l))

where METHOD!! can be one of the methods, e.g. EnsembleSerial(), EnsembleThreads(), EnsembleCPUArray() or EnsembleGPUArray().

The problem is divided in two cases:

  1. When L and L_t are sparse matrices. In this case all the methods work, except for EnsembleGPUArray(). In this case the error is
InvalidIRError: compiling kernel gpu_gpu_kernel_oop(Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel_oop), typeof(drho_dt), CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{Float64, 1}, Float64) resulted in invalid LLVM IR
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] overdub
   @ ./In[8]:62
 [2] macro expansion
   @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:25
 [3] overdub
   @ ~/.julia/packages/KernelAbstractions/Yy47c/src/macros.jl:80
 [4] overdub
   @ ~/.julia/packages/Cassette/N5kbV/src/overdub.jl:0
Reason: unsupported dynamic function invocation (call to overdub)
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:27
 [2] overdub
   @ ~/.julia/packages/KernelAbstractions/Yy47c/src/macros.jl:80
 [3] overdub
   @ ~/.julia/packages/Cassette/N5kbV/src/overdub.jl:0

Stacktrace:
  [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{typeof(Cassette.overdub), Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel_oop), typeof(drho_dt), CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{Float64, 1}, Float64}}}, args::LLVM.Module)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/wHjqZ/src/validation.jl:111
  [2] macro expansion
    @ ~/.julia/packages/GPUCompiler/wHjqZ/src/driver.jl:319 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/TimerOutputs/ZQ0rt/src/TimerOutput.jl:236 [inlined]
  [4] macro expansion
    @ ~/.julia/packages/GPUCompiler/wHjqZ/src/driver.jl:317 [inlined]
  [5] emit_asm(job::GPUCompiler.CompilerJob, ir::LLVM.Module; strip::Bool, validate::Bool, format::LLVM.API.LLVMCodeGenFileType)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/wHjqZ/src/utils.jl:62
  [6] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/.julia/packages/CUDA/02Kjq/src/compiler/execution.jl:317
  [7] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/wHjqZ/src/cache.jl:89
  [8] cufunction(f::typeof(Cassette.overdub), tt::Type{Tuple{Cassette.Context{nametype(CUDACtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicCheck, Nothing, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, typeof(DiffEqGPU.gpu_gpu_kernel_oop), typeof(drho_dt), CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{ComplexF64, 1}, CuDeviceMatrix{Float64, 1}, Float64}}; name::String, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/.julia/packages/CUDA/02Kjq/src/compiler/execution.jl:288
  [9] macro expansion
    @ ~/.julia/packages/CUDA/02Kjq/src/compiler/execution.jl:102 [inlined]
 [10] (::KernelAbstractions.Kernel{CUDAKernels.CUDADevice, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.gpu_gpu_kernel_oop)})(::Function, ::Vararg{Any, N} where N; ndrange::Int64, dependencies::CUDAKernels.CudaEvent, workgroupsize::Int64, progress::Function)
    @ CUDAKernels ~/.julia/packages/CUDAKernels/8wtKq/src/CUDAKernels.jl:194
 [11] SciML/DifferentialEquations.jl#55
    @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:414 [inlined]
 [12] ODEFunction
    @ ~/.julia/packages/SciMLBase/oTP8b/src/scimlfunctions.jl:334 [inlined]
 [13] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5, true, CuArray{ComplexF64, 2}, Nothing, Float64, CuArray{Float64, 2}, Float64, Float64, Float64, Float64, Vector{CuArray{ComplexF64, 2}}, ODESolution{ComplexF64, 3, Vector{CuArray{ComplexF64, 2}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{CuArray{ComplexF64, 2}}}, ODEProblem{CuArray{ComplexF64, 2}, Tuple{Float64, Float64}, true, CuArray{Float64, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Tsit5, OrdinaryDiffEq.InterpolationData{ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{CuArray{ComplexF64, 2}}, Vector{Float64}, Vector{Vector{CuArray{ComplexF64, 2}}}, OrdinaryDiffEq.Tsit5Cache{CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}}, DiffEqBase.DEStats}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Tsit5Cache{CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}, OrdinaryDiffEq.DEOptions{Float64, Float64, Float64, Float64, PIController{Rational{Int64}}, typeof(DiffEqGPU.diffeqgpunorm), typeof(opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), DiffEqGPU.var"#12#18", DataStructures.BinaryMinHeap{Float64}, DataStructures.BinaryMinHeap{Float64}, Nothing, Nothing, Int64, Tuple{}, Tuple{}, Tuple{}}, CuArray{ComplexF64, 2}, ComplexF64, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, CuArray{ComplexF64, 2}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/perform_step/low_order_rk_perform_step.jl:623
 [14] __init(prob::ODEProblem{CuArray{ComplexF64, 2}, Tuple{Float64, Float64}, true, CuArray{Float64, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::Tsit5, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, dtmin::Nothing, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, beta1::Nothing, beta2::Nothing, qoldinit::Rational{Int64}, controller::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#12#18", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:456
 [15] #__solve#466
    @ ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:4 [inlined]
 [16] #solve_call#58
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:61 [inlined]
 [17] solve_up(prob::ODEProblem{CuArray{ComplexF64, 2}, Tuple{Float64, Float64}, true, CuArray{Float64, 2}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::CuArray{ComplexF64, 2}, p::CuArray{Float64, 2}, args::Tsit5; kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:unstable_check, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#12#18", Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:82
 [18] #solve#59
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:70 [inlined]
 [19] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}, u0::Matrix{ComplexF64}, p::Matrix{Float64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:319
 [20] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray, I::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:284
 [21] macro expansion
    @ ./timing.jl:287 [inlined]
 [22] __solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleGPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:201
 [23] #solve#61
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:96 [inlined]
 [24] top-level scope
    @ ./timing.jl:210 [inlined]
 [25] top-level scope
    @ ./In[9]:0
 [26] eval
    @ ./boot.jl:360 [inlined]
 [27] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094
  1. When L and L_t are dense matrices. In this case not even EnsembleCPUArray() works, but the others yes. The error for the GPU method should be the same. While the error for the CPU one is
TaskFailedException

    nested task error: conversion to pointer not defined for Base.Experimental.Const{ComplexF64, 2}
    Stacktrace:
      [1] error(::String)
        @ ./error.jl:33 [inlined]
      [2] overdub
        @ ./error.jl:33 [inlined]
      [3] unsafe_convert(::Type{Ptr{ComplexF64}}, ::Base.Experimental.Const{ComplexF64, 2})
        @ ./pointer.jl:67 [inlined]
      [4] overdub
        @ ./pointer.jl:67 [inlined]
      [5] unsafe_convert(::Type{Ptr{ComplexF64}}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
        @ ./subarray.jl:427 [inlined]
      [6] overdub
        @ ./subarray.jl:427 [inlined]
      [7] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/blas.jl:704 [inlined]
      [8] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:544 [inlined]
      [9] mul!(::Vector{ComplexF64}, ::Matrix{ComplexF64}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true}, ::Bool, ::Bool)
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:66 [inlined]
     [10] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:66 [inlined]
     [11] mul!(::Vector{ComplexF64}, ::Matrix{ComplexF64}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
     [12] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:275 [inlined]
     [13] overdub
        @ /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.6/LinearAlgebra/src/matmul.jl:51 [inlined]
     [14] overdub(::Cassette.Context{nametype(CPUCtx), KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.NoDynamicCheck, CartesianIndex{1}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}}, Nothing, KernelAbstractions.var"##PassType#257", Nothing, Cassette.DisableHooks}, ::typeof(*), ::Matrix{ComplexF64}, ::SubArray{ComplexF64, 1, Base.Experimental.Const{ComplexF64, 2}, Tuple{Base.Slice{Base.OneTo{Int64}}, Int64}, true})
        @ Cassette ~/.julia/packages/Cassette/N5kbV/src/overdub.jl:0
     [15] overdub
        @ ./In[11]:62 [inlined]
     [16] macro expansion
        @ ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:25 [inlined]
     [17] overdub
        @ ~/.julia/packages/KernelAbstractions/Yy47c/src/macros.jl:265 [inlined]
     [18] __thread_run(tid::Int64, len::Int64, rem::Int64, obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.cpu_gpu_kernel_oop)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}, args::Tuple{typeof(drho_dt), Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{Float64}, Float64}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
        @ KernelAbstractions ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:157
     [19] __run(obj::KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.cpu_gpu_kernel_oop)}, ndrange::Tuple{Int64}, iterspace::KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}, args::Tuple{typeof(drho_dt), Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{Float64}, Float64}, dynamic::KernelAbstractions.NDIteration.NoDynamicCheck)
        @ KernelAbstractions ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:130
     [20] (::KernelAbstractions.var"#33#34"{Tuple{KernelAbstractions.NoneEvent}, Nothing, typeof(KernelAbstractions.__run), Tuple{KernelAbstractions.Kernel{KernelAbstractions.CPU, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, typeof(DiffEqGPU.cpu_gpu_kernel_oop)}, Tuple{Int64}, KernelAbstractions.NDIteration.NDRange{1, KernelAbstractions.NDIteration.DynamicSize, KernelAbstractions.NDIteration.DynamicSize, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}, CartesianIndices{1, Tuple{Base.OneTo{Int64}}}}, Tuple{typeof(drho_dt), Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{Float64}, Float64}, KernelAbstractions.NDIteration.NoDynamicCheck}})()
        @ KernelAbstractions ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:22

Stacktrace:
  [1] wait
    @ ./task.jl:322 [inlined]
  [2] wait
    @ ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:65 [inlined]
  [3] wait
    @ ~/.julia/packages/KernelAbstractions/Yy47c/src/cpu.jl:64 [inlined]
  [4] (::DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)})(du::Matrix{ComplexF64}, u::Matrix{ComplexF64}, p::Matrix{Float64}, t::Float64)
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:414
  [5] ODEFunction
    @ ~/.julia/packages/SciMLBase/oTP8b/src/scimlfunctions.jl:334 [inlined]
  [6] initialize!(integrator::OrdinaryDiffEq.ODEIntegrator{Tsit5, true, Matrix{ComplexF64}, Nothing, Float64, Matrix{Float64}, Float64, Float64, Float64, Float64, Vector{Matrix{ComplexF64}}, ODESolution{ComplexF64, 3, Vector{Matrix{ComplexF64}}, Nothing, Nothing, Vector{Float64}, Vector{Vector{Matrix{ComplexF64}}}, ODEProblem{Matrix{ComplexF64}, Tuple{Float64, Float64}, true, Matrix{Float64}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, Tsit5, OrdinaryDiffEq.InterpolationData{ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Vector{Matrix{ComplexF64}}, Vector{Float64}, Vector{Vector{Matrix{ComplexF64}}}, OrdinaryDiffEq.Tsit5Cache{Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{ComplexF64}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}}, DiffEqBase.DEStats}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, OrdinaryDiffEq.Tsit5Cache{Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{ComplexF64}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}}, OrdinaryDiffEq.DEOptions{Float64, Float64, Float64, Float64, PIController{Rational{Int64}}, typeof(DiffEqGPU.diffeqgpunorm), typeof(opnorm), Nothing, CallbackSet{Tuple{}, Tuple{}}, typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), DiffEqGPU.var"#12#18", DataStructures.BinaryMinHeap{Float64}, DataStructures.BinaryMinHeap{Float64}, Nothing, Nothing, Int64, Tuple{}, Tuple{}, Tuple{}}, Matrix{ComplexF64}, ComplexF64, Nothing, OrdinaryDiffEq.DefaultInit}, cache::OrdinaryDiffEq.Tsit5Cache{Matrix{ComplexF64}, Matrix{ComplexF64}, Matrix{ComplexF64}, OrdinaryDiffEq.Tsit5ConstantCache{Float64, Float64}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/perform_step/low_order_rk_perform_step.jl:623
  [7] __init(prob::ODEProblem{Matrix{ComplexF64}, Tuple{Float64, Float64}, true, Matrix{Float64}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, alg::Tsit5, timeseries_init::Tuple{}, ts_init::Tuple{}, ks_init::Tuple{}, recompile::Type{Val{true}}; saveat::Tuple{}, tstops::Tuple{}, d_discontinuities::Tuple{}, save_idxs::Nothing, save_everystep::Bool, save_on::Bool, save_start::Bool, save_end::Nothing, callback::Nothing, dense::Bool, calck::Bool, dt::Float64, dtmin::Nothing, dtmax::Float64, force_dtmin::Bool, adaptive::Bool, gamma::Rational{Int64}, abstol::Nothing, reltol::Nothing, qmin::Rational{Int64}, qmax::Int64, qsteady_min::Int64, qsteady_max::Int64, beta1::Nothing, beta2::Nothing, qoldinit::Rational{Int64}, controller::Nothing, fullnormalize::Bool, failfactor::Int64, maxiters::Int64, internalnorm::typeof(DiffEqGPU.diffeqgpunorm), internalopnorm::typeof(opnorm), isoutofdomain::typeof(DiffEqBase.ODE_DEFAULT_ISOUTOFDOMAIN), unstable_check::DiffEqGPU.var"#12#18", verbose::Bool, timeseries_errors::Bool, dense_errors::Bool, advance_to_tstop::Bool, stop_at_next_tstop::Bool, initialize_save::Bool, progress::Bool, progress_steps::Int64, progress_name::String, progress_message::typeof(DiffEqBase.ODE_DEFAULT_PROG_MESSAGE), userdata::Nothing, allow_extrapolation::Bool, initialize_integrator::Bool, alias_u0::Bool, alias_du0::Bool, initializealg::OrdinaryDiffEq.DefaultInit, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ OrdinaryDiffEq ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:456
  [8] #__solve#466
    @ ~/.julia/packages/OrdinaryDiffEq/a5qvy/src/solve.jl:4 [inlined]
  [9] #solve_call#58
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:61 [inlined]
 [10] solve_up(prob::ODEProblem{Matrix{ComplexF64}, Tuple{Float64, Float64}, true, Matrix{Float64}, ODEFunction{true, DiffEqGPU.var"#55#59"{typeof(drho_dt), typeof(DiffEqGPU.gpu_kernel_oop)}, UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, sensealg::Nothing, u0::Matrix{ComplexF64}, p::Matrix{Float64}, args::Tsit5; kwargs::Base.Iterators.Pairs{Symbol, Any, NTuple{4, Symbol}, NamedTuple{(:unstable_check, :callback, :merge_callbacks, :internalnorm), Tuple{DiffEqGPU.var"#12#18", Nothing, Bool, typeof(DiffEqGPU.diffeqgpunorm)}}})
    @ DiffEqBase ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:82
 [11] #solve#59
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:70 [inlined]
 [12] batch_solve_up(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, probs::Vector{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}}, alg::Tsit5, ensemblealg::EnsembleCPUArray, I::UnitRange{Int64}, u0::Matrix{ComplexF64}, p::Matrix{Float64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:319
 [13] batch_solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleCPUArray, I::UnitRange{Int64}; kwargs::Base.Iterators.Pairs{Symbol, DiffEqGPU.var"#12#18", Tuple{Symbol}, NamedTuple{(:unstable_check,), Tuple{DiffEqGPU.var"#12#18"}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:284
 [14] macro expansion
    @ ./timing.jl:287 [inlined]
 [15] __solve(ensembleprob::EnsembleProblem{ODEProblem{Vector{ComplexF64}, Tuple{Float64, Float64}, false, Vector{Float64}, ODEFunction{false, typeof(drho_dt), UniformScaling{Bool}, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, typeof(SciMLBase.DEFAULT_OBSERVED), Nothing}, Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, SciMLBase.StandardODEProblem}, typeof(prob_func), typeof(SciMLBase.DEFAULT_OUTPUT_FUNC), typeof(SciMLBase.DEFAULT_REDUCTION), Nothing}, alg::Tsit5, ensemblealg::EnsembleCPUArray; trajectories::Int64, batch_size::Int64, unstable_check::Function, kwargs::Base.Iterators.Pairs{Union{}, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ DiffEqGPU ~/.julia/packages/DiffEqGPU/gGgMY/src/DiffEqGPU.jl:201
 [16] #solve#61
    @ ~/.julia/packages/DiffEqBase/oe7VF/src/solve.jl:96 [inlined]
 [17] top-level scope
    @ ./timing.jl:210 [inlined]
 [18] top-level scope
    @ ./In[11]:0
 [19] eval
    @ ./boot.jl:360 [inlined]
 [20] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094
@ChrisRackauckas
Copy link
Member

Yes, I don't think it can generate generic linear algebra calls.

@ChrisRackauckas ChrisRackauckas changed the title GPU Parallel Ensemble Simulations doesn't work Jul 13, 2021
@ChrisRackauckas ChrisRackauckas transferred this issue from SciML/DifferentialEquations.jl Jul 13, 2021
@ChrisRackauckas
Copy link
Member

Did you try using StaticArrays?

@albertomercurio
Copy link
Author

Did you try using StaticArrays?

I'm using very large matrices (usually 1500x1500), this is the reason i use sparse matrices. However, also with a 100x100 matrix it tooks a lot of time to convert L to a Static Array

SMatrix{N, N}(Matrix(L))
@ChrisRackauckas
Copy link
Member

That makes no sense to solve with EnsembleGPUArray. Read the DiffEqGPU documentation on the two forms of GPU problems. You're using the form for 100 ODEs or less, not the one for 1000 ODEs or more.

https://github.com/SciML/DiffEqGPU.jl#within-method-gpu-parallelism-with-direct-cuarray-usage

That's more applicable here.

@albertomercurio
Copy link
Author

Do you mean that I need to convert L and L_t into CuArray and than evolve each ODE serially with EnsembleSerial()? If it is the case, can i use EnsembleThreads() to parallelize that CUDA ODE again?

@ChrisRackauckas
Copy link
Member

That would evolve the ODE in parallel on the GPU. You want to use EnsembleSerial unless you have multiple GPUs which you would want to use at the same time.

@albertomercurio
Copy link
Author

Ok perfect. Thank you for your help.

However I have the following question. I'm yet able to perform this ODE directly on the GPU, using the differential function

L = CUDA.CUSPARSE.CuSparseMatrixCSR( L )
L_t = CUDA.CUSPARSE.CuSparseMatrixCSR( L_t )

function f(rho, p, t)
    A, w_l = p
    rho_tmp = similar(rho0_vec)
    return L * rho + CUDA.CUSPARSE.mv!('N', A * sin(w_l * t), L_t, rho, 0.0, rho_tmp, '0')
end

where this time L and L_t are CuSparse matrices, and they don't take up a lot of memory as the dense matrix form.
This method is a little bit faster than the CPU sparse one, but of course slower if I parallelize the CPU method in the Ensemble Simulation.
Whit this CUDA sparse method the GPU is loaded only up to the 10%, and so the memory. If I could be able to parallelize all the ODEs in the ensemble I think it would get a lot faster than the EnsembleThreads() method.

Why there is no way to implement this method for an EnsembleGPUArray() simulation?

@ChrisRackauckas
Copy link
Member

To GPU it like this, you'd want to expand it out to its scalar form with something like ModelingToolkit, in order to then do the GPU codegen. EnsembleGPUArray is all about kernel code generation, so it's not using CUDA kernels like those from CUSPARSE since it's really made to generate nonlinear kernels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 participants