r - Fast linear regression by group -
i have 500k users , need compute linear regression (with intercept) each of them.
each user has around 30 records.
i tried dplyr
, lm
, way slow. around 2 sec user.
df%>% group_by(user_id, add = false) %>% do(lm = lm(y ~ x, data = .)) %>% mutate(lm_b0 = summary(lm)$coeff[1], lm_b1 = summary(lm)$coeff[2]) %>% select(user_id, lm_b0, lm_b1) %>% ungroup() )
i tried use lm.fit
known faster doesn't seem compatible dplyr
.
is there fast way linear regression group?
you can use basic formulas calculating slope , regression. lm
lot of unnecessary things if care 2 numbers. here use data.table
aggregation, in base r (or dplyr
):
system.time( res <- dt[, { ux <- mean(x) uy <- mean(y) slope <- sum((x - ux) * (y - uy)) / sum((x - ux) ^ 2) list(slope=slope, intercept=uy - slope * ux) }, by=user.id ] )
produces 500k users ~30 obs each (in seconds):
user system elapsed 7.35 0.00 7.36
or 15 microseconds per user. , confirm working expected:
> summary(dt[user.id==89663, lm(y ~ x)])$coefficients estimate std. error t value pr(>|t|) (intercept) 0.1965844 0.2927617 0.6714826 0.5065868 x 0.2021210 0.5429594 0.3722580 0.7120808 > res[user.id == 89663] user.id slope intercept 1: 89663 0.202121 0.1965844
data:
set.seed(1) users <- 5e5 records <- 30 x <- runif(users * records) dt <- data.table( x=x, y=x + runif(users * records) * 4 - 2, user.id=sample(users, users * records, replace=t) )
Comments
Post a Comment