ruby on rails - Unicorn workers timeout after "zero downtime" deploy with capistrano -
i'm running rails 3.2.21 app , deploy ubuntu 12.04.5 box using capistrano (nginx , unicorn).
i have app set zero-downtime deploy (at least thought), config files looking more or less these.
here's problem: when deploy done , restarts unicorn, when watch unicorn.log see fire new workers, reap old ones... app hangs 2-3 minutes. request app @ point hits timeout window (which set 40 seconds) , returns app's 500 error page.
here first part of output unicorn.log unicorn restarting (i have 5 unicorn workers):
i, [2015-04-21t23:06:57.022492 #14347] info -- : master process ready i, [2015-04-21t23:06:57.844273 #15378] info -- : worker=0 ready i, [2015-04-21t23:06:57.944080 #15381] info -- : worker=1 ready i, [2015-04-21t23:06:58.089655 #15390] info -- : worker=2 ready i, [2015-04-21t23:06:58.230554 #14541] info -- : reaped #<process::status: pid 15551 exit 0> worker=4 i, [2015-04-21t23:06:58.231455 #14541] info -- : reaped #<process::status: pid 3644 exit 0> worker=0 i, [2015-04-21t23:06:58.249110 #15393] info -- : worker=3 ready i, [2015-04-21t23:06:58.650007 #15396] info -- : worker=4 ready i, [2015-04-21t23:07:01.246981 #14541] info -- : reaped #<process::status: pid 32645 exit 0> worker=1 i, [2015-04-21t23:07:01.561786 #14541] info -- : reaped #<process::status: pid 15534 exit 0> worker=2 i, [2015-04-21t23:07:06.657913 #14541] info -- : reaped #<process::status: pid 16821 exit 0> worker=3 i, [2015-04-21t23:07:06.658325 #14541] info -- : master complete
afterwards, app hangs 2-3 minutes, here what's happening:
e, [2015-04-21t23:07:38.069635 #14347] error -- : worker=0 pid:15378 timeout (41s > 40s), killing e, [2015-04-21t23:07:38.243005 #14347] error -- : reaped #<process::status: pid 15378 sigkill (signal 9)> worker=0 e, [2015-04-21t23:07:39.647717 #14347] error -- : worker=3 pid:15393 timeout (41s > 40s), killing e, [2015-04-21t23:07:39.890543 #14347] error -- : reaped #<process::status: pid 15393 sigkill (signal 9)> worker=3 i, [2015-04-21t23:07:40.727755 #16002] info -- : worker=0 ready i, [2015-04-21t23:07:43.212395 #16022] info -- : worker=3 ready e, [2015-04-21t23:08:24.511967 #14347] error -- : worker=3 pid:16022 timeout (41s > 40s), killing e, [2015-04-21t23:08:24.718512 #14347] error -- : reaped #<process::status: pid 16022 sigkill (signal 9)> worker=3 i, [2015-04-21t23:08:28.010429 #16234] info -- : worker=3 ready
eventually, after 2 or 3 minutes, app starts being responsive again, more sluggish. can see in new relic (the horizontal line marks deploy, , light blue area indicates ruby):
i have identical staging server, , cannot replicate issue in staging... granted, staging under no load (i'm person trying make page requests).
here config/unicorn.rb file:
root = "/home/deployer/apps/myawesomeapp/current" working_directory root pid "#{root}/tmp/pids/unicorn.pid" stderr_path "#{root}/log/unicorn.log" stdout_path "#{root}/log/unicorn.log" shared_path = "/home/deployer/apps/myawesomeapp/shared" listen "/tmp/unicorn.myawesomeapp.sock" worker_processes 5 timeout 40 preload_app true before_exec |server| env['bundle_gemfile'] = "#{root}/gemfile" end before_fork |server, worker| if defined?(activerecord::base) activerecord::base.connection.disconnect! end old_pid = "#{root}/tmp/pids/unicorn.pid.oldbin" if file.exists?(old_pid) && server.pid != old_pid begin process.kill("quit", file.read(old_pid).to_i) rescue errno::enoent, errno::esrch end end end after_fork |server, worker| if defined?(activerecord::base) activerecord::base.establish_connection end end
and paint complete picture, in capistrano deploy.rb, unicorn restart task looks this:
namespace :deploy task :restart, roles: :app, except: { no_release: true } run "kill -s usr2 `cat #{release_path}/tmp/pids/unicorn.pid`" end end
any ideas why unicorn workers timeout right after deploy? thought point of zero-downtime keep old ones around until new ones spun , ready serve?
thanks!
update
i did deploy, , time kept eye on production.log see going on there. suspicious thing following lines, mixed in normal requests:
dalli/sasl authenticating 7510de dalli/sasl: 7510de dalli/sasl authenticating 7510de dalli/sasl: 7510de dalli/sasl authenticating 7510de dalli/sasl: 7510de
update #2
as suggested of answers below, changed before_fork
block add sig = (worker.nr + 1) >= server.worker_processes ? :quit : :ttou
workers incrementally killed off.
same result, terribly slow deploy, same spike illustrated in graph above. context, out of 5 worker processes, first 4 sent ttou signal, , 5th sent quit. still, not seem have made difference.
i came across similar problem while trying set rails/nginx/unicorn on digital ocean. able zero-downtime deploys work after tweaking few things. here few things try:
- reduce number of worker process.
- increase memory of server. getting timeouts on 512mb ram droplet. seemed fix issue when increased 1gb.
- use "capistrano3-unicorn" gem.
- if preload_app true, use restart (usr2). if false, use reload (hup).
- ensure "tmp/pids" in set linked_dirs in deploy.rb.
- use
px aux | grep unicorn
make sure old processes being removed. - use
kill [pid]
manually stop unicorn processes still running.
here's unicorn config reference:
working_directory '/var/www/yourapp/current' pid '/var/www/yourapp/current/tmp/pids/unicorn.pid' stderr_path '/var/www/yourapp/log/unicorn.log' stdout_path '/var/www/yourapp/log/unicorn.log' listen '/tmp/unicorn.yourapp.sock' worker_processes 2 timeout 30 preload_app true before_fork |server, worker| old_pid = "/var/www/yourapp/current/tmp/pids/unicorn.pid.oldbin" if old_pid != server.pid begin sig = (worker.nr + 1) >= server.worker_processes ? :quit : :ttou process.kill(sig, file.read(old_pid).to_i) rescue errno::enoent, errno::esrch end end end
deploy.rb
lock '3.4.0' set :application, 'yourapp' set :repo_url, 'git@bitbucket.org:username/yourapp.git' set :deploy_to, '/var/www/yourapp' set :linked_files, fetch(:linked_files, []).push('config/database.yml', 'config/secrets.yml', 'config/application.yml') set :linked_dirs, fetch(:linked_dirs, []).push('log', 'tmp/pids', 'tmp/cache', 'tmp/sockets', 'vendor/bundle', 'public/system') set :format, :pretty set :log_level, :info set :rbenv_ruby, '2.1.3' namespace :deploy after :restart, :clear_cache on roles(:web), in: :groups, limit: 3, wait: 10 end end end after 'deploy:publishing', 'deploy:restart' namespace :deploy task :restart #invoke 'unicorn:reload' invoke 'unicorn:restart' end end
Comments
Post a Comment