r - Fast linear regression by group -


i have 500k users , need compute linear regression (with intercept) each of them.

each user has around 30 records.

i tried dplyr , lm , way slow. around 2 sec user.

  df%>%                              group_by(user_id, add =  false) %>%       do(lm = lm(y ~ x, data = .)) %>%       mutate(lm_b0 = summary(lm)$coeff[1],              lm_b1 = summary(lm)$coeff[2]) %>%       select(user_id, lm_b0, lm_b1) %>%       ungroup()     ) 

i tried use lm.fit known faster doesn't seem compatible dplyr.

is there fast way linear regression group?

you can use basic formulas calculating slope , regression. lm lot of unnecessary things if care 2 numbers. here use data.table aggregation, in base r (or dplyr):

system.time(   res <- dt[,      {       ux <- mean(x)       uy <- mean(y)       slope <- sum((x - ux) * (y - uy)) / sum((x - ux) ^ 2)       list(slope=slope, intercept=uy - slope * ux)     }, by=user.id   ] ) 

produces 500k users ~30 obs each (in seconds):

 user  system elapsed   7.35    0.00    7.36  

or 15 microseconds per user. , confirm working expected:

> summary(dt[user.id==89663, lm(y ~ x)])$coefficients              estimate std. error   t value  pr(>|t|) (intercept) 0.1965844  0.2927617 0.6714826 0.5065868 x           0.2021210  0.5429594 0.3722580 0.7120808 > res[user.id == 89663]    user.id    slope intercept 1:   89663 0.202121 0.1965844 

data:

set.seed(1) users <- 5e5 records <- 30 x <- runif(users * records) dt <- data.table(   x=x, y=x + runif(users * records) * 4 - 2,    user.id=sample(users, users * records, replace=t) ) 

Comments

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - Cannot secure connection using TLS -