Sitemap crawler

For our first simple crawler, we will use the sitemap discovered in the example website's robots.txt to download all the web pages. To parse the sitemap, we will use a simple regular expression to extract URLs within the <loc> tags.

We will need to update our code to handle encoding conversions as our current download function simply returns bytes. Note that a more robust parsing approach called CSS selectors will be introduced in the next chapter. Here is our first example crawler:

import re

def download(url, user_agent='wswp', num_retries=2, charset='utf-8'): 
    print('Downloading:', url) 
    request = urllib.request.Request(url) 
    request.add_header('User-agent', user_agent)
    try: 
        resp = urllib.request.urlopen(request)
        cs = resp.headers.get_content_charset()
        if not cs:
            cs = charset
        html = resp.read().decode(cs)
    except (URLError, HTTPError, ContentTooShortError) as e:
        print('Download error:', e.reason)
        html = None 
        if num_retries > 0: 
            if hasattr(e, 'code') and 500 <= e.code < 600: 
            # recursively retry 5xx HTTP errors 
            return download(url, num_retries - 1) 
    return html

def crawl_sitemap(url): 
    # download the sitemap file 
    sitemap = download(url) 
    # extract the sitemap links 
    links = re.findall('<loc>(.*?)</loc>', sitemap) 
    # download each link 
    for link in links: 
        html = download(link) 
        # scrape html here 
        # ...

Now, we can run the sitemap crawler to download all countries from the example website:

    >>> crawl_sitemap('http://example.webscraping.com/sitemap.xml')
Downloading: http://example.webscraping.com/sitemap.xml
Downloading: http://example.webscraping.com/view/Afghanistan-1
Downloading: http://example.webscraping.com/view/Aland-Islands-2
Downloading: http://example.webscraping.com/view/Albania-3
...

As shown in our download method above, we had to update the character encoding to utilize regular expressions with the website response. The Python read method on the response will return bytes, and the re module expects a string. Our code depends on the website maintainer to include the proper character encoding in the response headers. If the character encoding header is not returned, we default to UTF-8 and hope for the best. Of course, this decoding will throw an error if either the header encoding returned is incorrect or if the encoding is not set and also not UTF-8. There are some more complex ways to guess encoding (see: https://pypi.python.org/pypi/chardet), which are fairly easy to implement.

For now, the Sitemap crawler works as expected. But as discussed earlier, Sitemap files often cannot be relied on to provide links to every web page. In the next section, another simple crawler will be introduced that does not depend on the Sitemap file.

If you don't want to continue the crawl at any time you can hit Ctrl + C or c md + C to exit the Python interpreter or program execution.