How to add a url suffix before performing a callback in scrapy -
i have crawler works fine in collecting urls interested in. however, before retrieving content of these urls (i.e. ones satisfy rule no 3), update them, i.e. add suffix - '/fullspecs' - on right-hand side. means that, in fact, retrieve , further process - through callback function - updated ones. how can that?
rules = ( rule(linkextractor(allow=('something1'))), rule(linkextractor(allow=('something2'))), rule(linkextractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'), )
you can set process_value
parameter lambda x: x+'/fullspecs'
or function if want more complex.
you'd end with:
rule(linkextractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive', process_value=lambda x: x+'/fullspecs')
see more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor
Comments
Post a Comment