sql - PostgreSQL concurrent transaction issues -
i'm building crawler. multiple crawling workers access same postgresql database. sadly i'm encountering issues main transaction presented here:
begin isolation level serializable; update webpages set locked = true url in ( select distinct on (source) url webpages ( last null or last < refreshfrequency ) , locked = false limit limit ) returning *; commit; urlurl (string)sourcedomain name (string)lastlast time page crawled (date)lockedboolean set indicate webpage being crawled (boolean)
i tried 2 different transaction isolation levels:
isolation level serializable, errorscould not serialize access due concurrent updateisolation level read committed, duplicateurls concurrent transactions due data being "frozen" time transaction first committed (i think)
i'm new postgresql , sql in general i'm not sure fix issue.
update:
postgresql version 9.2.x.
webpage table definition:
create table webpages ( last timestamp time zone, locked boolean default false, url text not null, source character varying(255) primary key );
clarification
the question leaves room interpretation. how understand task:
lock maximum of limit urls fulfill criteria , not locked, yet. spread out load on sources, every url should come different source.
db design
assuming separate table source: makes job faster , easier. if don't have such table, create it, it's proper design anyway:
create table source ( source_id serial not null primary key , source text not null ); create table webpage ( source_id int not null references source url text not null primary key locked boolean not null default false, -- may not needed last timestamp not null default '-infinity' -- makes query simpler ); alternatively can use recursive cte efficiently:
basic solution advisory locks
i using advisory locks make safe , cheap in default read committed isolation level:
update webpage w set locked = true ( select (select url webpage source_id = s.source_id , (last >= refreshfrequency) not true , locked = false , pg_try_advisory_xact_lock(url) -- true free limit 1 -- 1 url per source ) url ( select source_id -- fk column in webpage source order random() limit limit -- random selection of "limit" sources ) s update ) l w.url = l.url returning *; alternatively, work only advisory locks , not use table column locked @ all. run the select statement. locks kept until end of transaction. can use pg_try_advisory_lock() instead keep locks till end of session. update once set last when done (and possible release advisory lock).
other major points
in postgres 9.3 or later use
lateraljoin instead of correlated subquery.i chose
pg_try_advisory_xact_lock()because lock can (and should) released @ end of transaction. detailed explanation advisory locks:you less
limitrows if sources have no more url crawl.the random selection of sources wild educated guess, since information not available. if
sourcetable big, there faster ways:refreshfrequencyshould calledlastest_last, since it's not "frequency",timestampordate.
recursive alternatve
to full limit number of rows if available, use recursive cte , iterate sources until found enough or no more can found.
as mentioned above, may not need column locked @ , operate advisory locks (cheaper). set last @ end of transaction, before start next round.
with recursive s ( select source_id, row_number() on (order random()) rn source -- might exclude "empty" sources ... ) , page(source_id, rn, ct, url) ( select 0, 0, 0, ''::text -- dummy init row union select s.source_id, s.rn , case when t.url <> '' p.ct + 1 else p.ct end -- inc. if url found last round , (select url webpage source_id = t.source_id , (last >= refreshfrequency) not true , locked = false -- may not needed , pg_try_advisory_xact_lock(url) -- true free limit 1 -- 1 url per source ) url -- try, may come empty page p join s on s.rn = p.rn + 1 case when p.url <> '' p.ct + 1 else p.ct end < limit -- limit here ) select url page url <> ''; -- exclude '' , null alternatively, if need manage locked, too, use query above update.
further reading
you love skip locked in the upcoming postgres 9.5:
related:
Comments
Post a Comment