How to add a url suffix before performing a callback in scrapy -


i have crawler works fine in collecting urls interested in. however, before retrieving content of these urls (i.e. ones satisfy rule no 3), update them, i.e. add suffix - '/fullspecs' - on right-hand side. means that, in fact, retrieve , further process - through callback function - updated ones. how can that?

rules = (         rule(linkextractor(allow=('something1'))),         rule(linkextractor(allow=('something2'))),         rule(linkextractor(allow=('something3'), deny=('something4', 'something5')), callback='parse_archive'), ) 

you can set process_value parameter lambda x: x+'/fullspecs' or function if want more complex.

you'd end with:

rule(linkextractor(allow=('something3'), deny=('something4', 'something5')),      callback='parse_archive', process_value=lambda x: x+'/fullspecs') 

see more at: http://doc.scrapy.org/en/latest/topics/link-extractors.html#basesgmllinkextractor


Comments

Popular posts from this blog

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - No viable overloaded operator for references a map -