sql - PostgreSQL concurrent transaction issues -

- August 15, 2012

i'm building crawler. multiple crawling workers access same postgresql database. sadly i'm encountering issues main transaction presented here:

begin isolation level serializable;     update webpages     set locked = true     url in          (             select distinct on (source) url             webpages                             (                     last null                     or                     last < refreshfrequency                 )                 ,                 locked = false             limit limit         )     returning *; commit;

url url (string)
source domain name (string)
last last time page crawled (date)
locked boolean set indicate webpage being crawled (boolean)

i tried 2 different transaction isolation levels:

isolation level serializable, errors could not serialize access due concurrent update
isolation level read committed, duplicate urls concurrent transactions due data being "frozen" time transaction first committed (i think)

i'm new postgresql , sql in general i'm not sure fix issue.

update:
postgresql version 9.2.x.
webpage table definition:

create table webpages (   last timestamp time zone,   locked boolean default false,   url text not null,   source character varying(255) primary key );

clarification

the question leaves room interpretation. how understand task:

lock maximum of limit urls fulfill criteria , not locked, yet. spread out load on sources, every url should come different source.

db design

assuming separate table source: makes job faster , easier. if don't have such table, create it, it's proper design anyway:

create table source (   source_id serial not null primary key , source    text not null );  create table webpage (   source_id int not null references source   url       text not null primary key   locked    boolean not null default false,        -- may not needed   last      timestamp not null default '-infinity' -- makes query simpler );

alternatively can use recursive cte efficiently:

optimize group query retrieve latest record per user

basic solution advisory locks

i using advisory locks make safe , cheap in default read committed isolation level:

update webpage w set    locked = true  (    select (select url              webpage             source_id = s.source_id            ,   (last >= refreshfrequency) not true            ,    locked = false            ,    pg_try_advisory_xact_lock(url)  -- true free            limit  1     -- 1 url per source           ) url     (       select source_id  -- fk column in webpage         source       order  random()       limit  limit      --  random selection of "limit" sources       ) s       update    ) l  w.url = l.url returning *;

alternatively, work only advisory locks , not use table column locked @ all. run the select statement. locks kept until end of transaction. can use pg_try_advisory_lock() instead keep locks till end of session. update once set last when done (and possible release advisory lock).

other major points

in postgres 9.3 or later use lateral join instead of correlated subquery.
i chose pg_try_advisory_xact_lock() because lock can (and should) released @ end of transaction. detailed explanation advisory locks:
- postgres update ... limit 1
you less limit rows if sources have no more url crawl.
the random selection of sources wild educated guess, since information not available. if source table big, there faster ways:
- best way select random rows postgresql
refreshfrequency should called lastest_last, since it's not "frequency", timestamp or date.

recursive alternatve

to full limit number of rows if available, use recursive cte , iterate sources until found enough or no more can found.

as mentioned above, may not need column locked @ , operate advisory locks (cheaper). set last @ end of transaction, before start next round.

with recursive s (    select source_id, row_number() on (order random()) rn    source  -- might exclude "empty" sources ...    ) , page(source_id, rn, ct, url) (    select 0, 0, 0, ''::text   -- dummy init row    union    select s.source_id, s.rn         , case when t.url <> ''                p.ct + 1                else p.ct end  -- inc. if url found last round         , (select url              webpage             source_id = t.source_id            ,   (last >= refreshfrequency) not true            ,    locked = false  -- may not needed            ,    pg_try_advisory_xact_lock(url)  -- true free            limit  1           -- 1 url per source           ) url            -- try, may come empty      page p    join   s on s.rn = p.rn + 1     case when p.url <> ''                p.ct + 1                else p.ct end < limit  -- limit here    ) select url   page  url <> '';             -- exclude '' , null

alternatively, if need manage locked, too, use query above update.

Search This Blog

Shefl