Python Web Scraping Cookbook by Michael Heydt

Python Web Scraping Cookbook by Michael Heydt

Author:Michael Heydt
Language: eng
Format: epub
Tags: COMPUTERS / Programming Languages / Python, COMPUTERS / Data Processing, COMPUTERS / Web / Search Engines
Publisher: Packt
Published: 2018-02-16T12:16:06+00:00


Controlling the length of a crawl

The length of a crawl, in terms of number of pages that can be parsed, can be controlled with the CLOSESPIDER_PAGECOUNT setting.

How to do it

We will be using the script in 06/07_limit_length.py. The script and scraper are the same as the NASA sitemap crawler with the addition of the following configuration to limit the number of pages parsed to 5:

if __name__ == "__main__":

process = CrawlerProcess({

'LOG_LEVEL': 'INFO',

'CLOSESPIDER_PAGECOUNT': 5

})

process.crawl(Spider)

process.start()

When this is run, the following output will be generated (interspersed in the logging output):

<200 https://www.nasa.gov/exploration/systems/sls/multimedia/sls-hardware-being-moved-on-kamag-transporter.html>

<200 https://www.nasa.gov/exploration/systems/sls/M17-057.html>

<200 https://www.nasa.gov/press-release/nasa-awards-contract-for-center-protective-services-for-glenn-research-center/>

<200 https://www.nasa.gov/centers/marshall/news/news/icymi1708025/>

<200 https://www.nasa.gov/content/oracles-completed-suit-case-flight-series-to-ascension-island/>

<200 https://www.nasa.gov/feature/goddard/2017/asteroid-sample-return-mission-successfully-adjusts-course/>

<200 https://www.nasa.gov/image-feature/jpl/pia21754/juling-crater/>



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.