sql - PostgreSQL concurrent transaction issues -
i'm building crawler. multiple crawling workers access same postgresql database. sadly i'm encountering issues main transaction presented here:
begin isolation level serializable; update webpages set locked = true url in ( select distinct on (source) url webpages ( last null or last < refreshfrequency ) , locked = false limit limit ) returning *; commit;
url
url (string)source
domain name (string)last
last time page crawled (date)locked
boolean set indicate webpage being crawled (boolean)
i tried 2 different transaction isolation levels:
isolation level serializable
, errorscould not serialize access due concurrent update
isolation level read committed
, duplicateurl
s concurrent transactions due data being "frozen" time transaction first committed (i think)
i'm new postgresql , sql in general i'm not sure fix issue.
update:
postgresql version 9.2.x.
webpage
table definition:
create table webpages ( last timestamp time zone, locked boolean default false, url text not null, source character varying(255) primary key );
clarification
the question leaves room interpretation. how understand task:
lock maximum of limit
urls fulfill criteria , not locked, yet. spread out load on sources, every url should come different source.
db design
assuming separate table source
: makes job faster , easier. if don't have such table, create it, it's proper design anyway:
create table source ( source_id serial not null primary key , source text not null ); create table webpage ( source_id int not null references source url text not null primary key locked boolean not null default false, -- may not needed last timestamp not null default '-infinity' -- makes query simpler );
alternatively can use recursive cte efficiently:
basic solution advisory locks
i using advisory locks make safe , cheap in default read committed
isolation level:
update webpage w set locked = true ( select (select url webpage source_id = s.source_id , (last >= refreshfrequency) not true , locked = false , pg_try_advisory_xact_lock(url) -- true free limit 1 -- 1 url per source ) url ( select source_id -- fk column in webpage source order random() limit limit -- random selection of "limit" sources ) s update ) l w.url = l.url returning *;
alternatively, work only advisory locks , not use table column locked
@ all. run the select
statement. locks kept until end of transaction. can use pg_try_advisory_lock()
instead keep locks till end of session. update
once set last
when done (and possible release advisory lock).
other major points
in postgres 9.3 or later use
lateral
join instead of correlated subquery.i chose
pg_try_advisory_xact_lock()
because lock can (and should) released @ end of transaction. detailed explanation advisory locks:you less
limit
rows if sources have no more url crawl.the random selection of sources wild educated guess, since information not available. if
source
table big, there faster ways:refreshfrequency
should calledlastest_last
, since it's not "frequency",timestamp
ordate
.
recursive alternatve
to full limit number of rows if available, use recursive
cte , iterate sources until found enough or no more can found.
as mentioned above, may not need column locked
@ , operate advisory locks (cheaper). set last
@ end of transaction, before start next round.
with recursive s ( select source_id, row_number() on (order random()) rn source -- might exclude "empty" sources ... ) , page(source_id, rn, ct, url) ( select 0, 0, 0, ''::text -- dummy init row union select s.source_id, s.rn , case when t.url <> '' p.ct + 1 else p.ct end -- inc. if url found last round , (select url webpage source_id = t.source_id , (last >= refreshfrequency) not true , locked = false -- may not needed , pg_try_advisory_xact_lock(url) -- true free limit 1 -- 1 url per source ) url -- try, may come empty page p join s on s.rn = p.rn + 1 case when p.url <> '' p.ct + 1 else p.ct end < limit -- limit here ) select url page url <> ''; -- exclude '' , null
alternatively, if need manage locked
, too, use query above update
.
further reading
you love skip locked
in the upcoming postgres 9.5:
related:
Comments
Post a Comment