Python Web Scraping, Second Edition by Packt Publishing

Python Web Scraping, Second Edition by Packt Publishing

Author:Packt Publishing
Language: eng
Format: mobi
Publisher: Packt Publishing
Published: 2017-05-29T09:14:54+00:00


def process_queue():

while len(crawl_queue):

url = crawl_queue.pop()

...

The first change is replacing our Python list with the new Redis-based queue, named RedisQueue. This queue handles duplicate URLs internally, so the seen variable is no longer required. Finally, the RedisQueue len method is called to determine if there are still URLs in the queue. Further logic changes to handle the depth and seen functionality are shown here:

## inside process_queue

if no_robots or rp.can_fetch(user_agent, url):

depth = crawl_queue.get_depth(url) or 0

if depth == max_depth:

print('Skipping %s due to depth' % url)

continue

html = D(url, num_retries=num_retries)

if not html:

continue

if scraper_callback:

links = scraper_callback(url, html) or []

else:

links = []

# filter for links matching our regular expression

for link in get_links(html, link_regex) + links:

if 'http' not in link:

link = clean_link(url, domain, link)

crawl_queue.push(link)

crawl_queue.set_depth(link, depth + 1)

The full code can be seen at http://github.com/kjam/wswp/blob/master/code/chp4/threaded_crawler_with_queue.py.

This updated version of the threaded crawler can then be started using multiple processes with this snippet:

import multiprocessing



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Popular ebooks
Whisky: Malt Whiskies of Scotland (Collins Little Books) by dominic roskrow(55908)
What's Done in Darkness by Kayla Perrin(26521)
Shot Through the Heart: DI Grace Fisher 2 by Isabelle Grey(19007)
The Fifty Shades Trilogy & Grey by E L James(18959)
Shot Through the Heart by Mercy Celeste(18879)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 10 by Isuna Hasekura and Jyuu Ayakura(16982)
Python GUI Applications using PyQt5 : The hands-on guide to build apps with Python by Verdugo Leire(16875)
Peren F. Statistics for Business and Economics...Essential Formulas 3ed 2025 by Unknown(16804)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 03 by Isuna Hasekura and Jyuu Ayakura & Jyuu Ayakura(16698)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 01 by Isuna Hasekura and Jyuu Ayakura & Jyuu Ayakura(16323)
The Subtle Art of Not Giving a F*ck by Mark Manson(14261)
The 3rd Cycle of the Betrayed Series Collection: Extremely Controversial Historical Thrillers (Betrayed Series Boxed set) by McCray Carolyn(14072)
Stepbrother Stories 2 - 21 Taboo Story Collection (Brother Sister Stepbrother Stepsister Taboo Pseudo Incest Family Virgin Creampie Pregnant Forced Pregnancy Breeding) by Roxi Harding(13420)
Scorched Earth by Nick Kyme(12715)
Drei Generationen auf dem Jakobsweg by Stein Pia(10922)
Suna by Ziefle Pia(10846)
Scythe by Neal Shusterman(10270)
International Relations from the Global South; Worlds of Difference; First Edition by Arlene B. Tickner & Karen Smith(9476)
Successful Proposal Strategies for Small Businesses: Using Knowledge Management ot Win Govenment, Private Sector, and International Contracts 3rd Edition by Robert Frey(9316)
This is Going to Hurt by Adam Kay(9098)