How to add a url suffix before performing a callback in scrapy -

- June 15, 2011

i have crawler works fine in collecting urls interested in. however, before retrieving content of these urls (i.e. ones satisfy rule no 3), update them, i.e. add suffix - '/fullspecs' - on right-hand side. means that, in fact, retrieve , further process - through callback function - updated ones. how can that?

rules = (         rule(linkextractor(allow=('something1'))),         rule(linkextractor(allow=('something2'))),         rule(linkextractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'), )

you can set process_value parameter lambda x: x+'/fullspecs' or function if want more complex.

you'd end with:

rule(linkextractor(allow=('something3'), deny=('something4', 'something5')),      callback='parse_archive', process_value=lambda x: x+'/fullspecs')

see more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor

Search This Blog

Shefl

How to add a url suffix before performing a callback in scrapy -

Comments

Post a Comment

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - Gamma correction doesn't look properly corrected, is this linear? -