Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Construct opt #361

Merged
merged 3 commits into from
Mar 26, 2018
Merged

Construct opt #361

merged 3 commits into from
Mar 26, 2018

Conversation

ancapdev
Copy link
Contributor

I've been trying to use TimeSeries.jl on some high frequency datasets and ran into some performance bottlenecks in TimeArray construction. In particular the uniqueness test is heavy.

I've made a couple of changes here that help my use cases:

  • Check sorted and uniqueness in one pass, allocation free.
  • Add option to disable construction checks entirely.

Rough performance improvements for something on the scale of my use case, 10 million entries, 1 column:

  • Previous: ~4 seconds
  • New: ~8ms
  • New unchecked: ~20us

If these changes look reasonable I have a few more I'd like to add. To extend this I'd change some of the algorithms to construct TimeArray objects unchecked. I also noticed a few places that don't handle Float32 properly, and I'd like to rectify that.

@codecov-io
Copy link

codecov-io commented Mar 22, 2018

Codecov Report

Merging #361 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #361      +/-   ##
==========================================
- Coverage   86.22%   86.19%   -0.03%     
==========================================
  Files          10       10              
  Lines         479      478       -1     
==========================================
- Hits          413      412       -1     
  Misses         66       66
Impacted Files Coverage Δ
src/timearray.jl 98.48% <100%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d462cd5...0e2383b. Read the comment docs.

@femtotrader
Copy link
Contributor

@iblis17 could you review this PR ?
Thanks

@iblislin iblislin self-assigned this Mar 23, 2018
@iblislin
Copy link
Collaborator

Could you add some test cases?

@iblislin
Copy link
Collaborator

Seem it can be faster, could you try this patch out?

--- a/src/timearray.jl
+++ b/src/timearray.jl
@@ -23,19 +23,23 @@ struct TimeArray{T, N, D <: TimeType, A <: AbstractArray{T, N}} <: AbstractTimeS
             timestamp::AbstractVector{D},
             values::A,
             colnames::Vector{String},
-            meta::Any;
-            unchecked = false) where {T, N, D <: TimeType, A <: AbstractArray{T, N}}
+            meta::Any,
+            unchecked::Bool) where {T, N, D <: TimeType, A <: AbstractArray{T, N}}
         nrow, ncol = size(values, 1, 2)
@@ -46,12 +50,12 @@ end
 TimeArray(d::AbstractVector{D}, v::AbstractArray{T, N},
           c::Vector{S}=fill("", size(v,2)),
           m::Any=nothing;
-          args...) where {T, N, D <: TimeType, S <: AbstractString} =
-    TimeArray{T, N, D, typeof(v)}(d, v, map(String, c), m; args...)
+          unchecked=false) where {T, N, D <: TimeType, S <: AbstractString} =
+    TimeArray{T, N, D, typeof(v)}(d, v, map(String, c), m, unchecked)
 TimeArray(d::D, v::AbstractArray{T, N}, c::Vector{S}=fill("", size(v, 2)),
           m::Any=nothing;
-          args...) where {T, N, D <: TimeType, S <: AbstractString} =
-    TimeArray{T, N, D, typeof(v)}([d], v, map(String, c), m; args...)
+          unchecked=false) where {T, N, D <: TimeType, S <: AbstractString} =
+    TimeArray{T, N, D, typeof(v)}([d], v, map(String, c), m, unchecked)

@ancapdev
Copy link
Contributor Author

Tests, sure, you mean for the new unchecked path? Other than that there's no new functionality.

Regards the defaulted regular argument vs keyword argument, I can benchmark, but I expect it to make very marginal performance difference given the context of everything else going on (e.g., replace_dupes()), and I to my mind the API is much worse. Unnamed boolean arguments don't tell you much at the call site, and if you were to later extend the API it could get messy. Happy to do whatever is desired, but this doesn't seem the best trade off.

@iblislin
Copy link
Collaborator

Unnamed boolean arguments don't tell you much at the call site, and if you were to later extend the API it could get messy. Happy to do whatever is desired, but this doesn't seem the best trade off.

I expect that user usually call the outer constructor instead of inner constructor, so the positional argument of inner constructor is fine for me. User still has TimeArray(.., unchecked = true).

But, yeah, as you said, it's only very marginal performance difference.
If you do not want to change it, I'm ok.

@ancapdev
Copy link
Contributor Author

Cool, if we can get this change in, I'll be rebasing and preparing a PR for optimizing the merge operation next, a more interesting change ;)

@iblislin iblislin added this to the 0.12.0 milestone Mar 26, 2018
@iblislin
Copy link
Collaborator

Cool, if we can get this change in

sorry, you mean this PR or my patch?

@ancapdev
Copy link
Contributor Author

This PR

@iblislin iblislin merged commit e503b61 into JuliaStats:master Mar 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants