Scraping and downloading multiple files from web with Python

Print Friendly, PDF & Email

In recent posts, we have discussed some methods to scrap and download resources from the web. If you just want to download a few files, it doesn’t matter to iterate on the list of files sequentially. However, downloading a number of files is significantly faster when employing concurrent approach. We are going to present these approaches in the following paragraphs right now.

Recap on how to download remote files

As we said, there are some methods of downloading files from the internet by using one of these modules: requests, wget or urllib. The two formers are external modules you have to install before making use of them into your source code while the latter is a Python built-in module. To use either of these modules, importing it and perform a call to the specific methods to establish a connection with the remote server, then hit the intended resources to download. That’s a kind of basic steps to download remote files on the internet. We take a closer look at the following snippet.

Demo project

Let’s recall the demo project which was illustrated in the earlier post about parsing HTML document using XPath with lxml. Please read through the steps to set up and launch the project. If you are interested in figuring out the source code, have a look at this commit add5d5f. The outcome is the list of the URLs of the interesting images which can be used in the download module. At the time of writing this post, it has 56 URLs in total but there are 19 separate images. In other words,  56 URLs are not unique.

python parse-htmldoc-lxml.py
56
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_a4_avant_2_5059.jpg
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_a4_avant_2_5059.jpg
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/2016_audi_r8_e_tron_13538.jpg
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/2016_audi_r8_e_tron_13538.jpg
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_e_tron_7_5079.jpg
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_e_tron_7_5079.jpg
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_r8_13_5078.jpg
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_r8_13_5078.jpg
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_r8_v12_4_5081.jpg
http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_r8_v12_4_5081.jpg

Implement download function

We implemented the download method based on the idea presented in the post [2] to download a file from a given URL.

import requests
import os.path


def download_url(_url):
    print("downloading: ", _url)
    # assumes that the last segment after the / represents the file name
    # if url is abc/xyz/file.jpg, the file name will be file.jpg
    file_name_start_pos = _url.rfind("/") + 1
    file_name = _url[file_name_start_pos:]

    r = requests.get(_url, stream=True)
    if r.status_code == requests.codes.ok:
        with open(file_name, 'wb') as f:
            for data in r:
                f.write(data)
    return os.path.exists(file_name)

The method calls the requests module and returns true or false depending on the fact that the file is downloaded successfully or failed. The returned value could be useful to handle the loop in the following steps.

Let’s download these images in sequential and concurrent approach.

With the sequential approach

Below is the code to download all files sequentially.

def run_download_sequential(_urls):
    if _urls:
        for e in _urls:
            if e is not None:
                is_ok = download_url(e)
                if not is_ok:
                    print('failed to download %s'.format(e))

Running the code on my MacBook Pro 2.9 GHx 6-Core Intel Core i9, 16 GB 2400 MHz DDR4, the running time of three times is 68.74007987976074, 65.27087998390198 and  68.01511979103088 seconds to download 19 images in total.

With the concurrent approach

Downloading the images in parallel could be performed by applying the following code:

def run_download_parallel(_urls):
    # Run 5 multiple threads. Each call will take the next element in urls list
    results = ThreadPool(5).imap_unordered(download_url, _urls)
    for r in results:
        print(r)

Running the code on my MacBook Pro 2.9 GHx 6-Core Intel Core i9, 16 GB 2400 MHz DDR4, the running time of three times is 40.310486793518066, 37.18907308578491 and 32.56927800178528 seconds to download 19 images in total.

Analysis of the results

After three running times, the results obviously reveal that the parallel approach is mostly two-third faster than the sequential one. The table below summaries what have captured each approach.

Approach Test 1 Test 2 Test 3
Sequential 68.74007987976074 65.27087998390198 68.01511979103088
Parallel 40.310486793518066 37.18907308578491 32.56927800178528

Source code

The runnable source code can be found here 7bf0128. To comment on the corresponding line before running the script.

References

[1] How can I speed up fetching pages with urllib2 in python? https://stackoverflow.com/questions/3490173/how-can-i-speed-up-fetching-pages-with-urllib2-in-python, accessed on 7.9.2020

[2] How to Download Multiple Files Concurrently in Python, https://www.quickprogrammingtips.com/python/how-to-download-multiple-files-concurrently-in-python.html, accessed on 19.9.2020

Leave a Reply

Your email address will not be published. Required fields are marked *

*

code

This site uses Akismet to reduce spam. Learn how your comment data is processed.