In recent posts, we have discussed some methods to scrap and download resources from the web. If you just want to download a few files, it doesn’t matter to iterate on the list of files sequentially. However, downloading a number of files is significantly faster when employing concurrent approach. We are going to present these approaches in the following paragraphs right now.
Recap on how to download remote files
As we said, there are some methods of downloading files from the internet by using one of these modules: requests, wget or urllib. The two formers are external modules you have to install before making use of them into your source code while the latter is a Python built-in module. To use either of these modules, importing it and perform a call to the specific methods to establish a connection with the remote server, then hit the intended resources to download. That’s a kind of basic steps to download remote files on the internet. We take a closer look at the following snippet.
Let’s recall the demo project which was illustrated in the earlier post about parsing HTML document using XPath with lxml. Please read through the steps to set up and launch the project. If you are interested in figuring out the source code, have a look at this commit add5d5f. The outcome is the list of the URLs of the interesting images which can be used in the download module. At the time of writing this post, it has 56 URLs in total but there are 19 separate images. In other words, 56 URLs are not unique.
python parse-htmldoc-lxml.py 56 http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_a4_avant_2_5059.jpg http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_a4_avant_2_5059.jpg http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/2016_audi_r8_e_tron_13538.jpg http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/2016_audi_r8_e_tron_13538.jpg http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_e_tron_7_5079.jpg http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_e_tron_7_5079.jpg http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_r8_13_5078.jpg http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_r8_13_5078.jpg http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_r8_v12_4_5081.jpg http://files.all-free-download.com//downloadfiles/wallpapers/1920_1200/audi_r8_v12_4_5081.jpg
Implement download function
We implemented the download method based on the idea presented in the post  to download a file from a given URL.
import requests import os.path def download_url(_url): print("downloading: ", _url) # assumes that the last segment after the / represents the file name # if url is abc/xyz/file.jpg, the file name will be file.jpg file_name_start_pos = _url.rfind("/") + 1 file_name = _url[file_name_start_pos:] r = requests.get(_url, stream=True) if r.status_code == requests.codes.ok: with open(file_name, 'wb') as f: for data in r: f.write(data) return os.path.exists(file_name)
The method calls the
requests module and returns true or false depending on the fact that the file is downloaded successfully or failed. The returned value could be useful to handle the loop in the following steps.
Let’s download these images in sequential and concurrent approach.
With the sequential approach
Below is the code to download all files sequentially.
def run_download_sequential(_urls): if _urls: for e in _urls: if e is not None: is_ok = download_url(e) if not is_ok: print('failed to download %s'.format(e))
Running the code on my MacBook Pro 2.9 GHx 6-Core Intel Core i9, 16 GB 2400 MHz DDR4, the running time of three times is 68.74007987976074, 65.27087998390198 and 68.01511979103088 seconds to download 19 images in total.
With the concurrent approach
Downloading the images in parallel could be performed by applying the following code:
def run_download_parallel(_urls): # Run 5 multiple threads. Each call will take the next element in urls list results = ThreadPool(5).imap_unordered(download_url, _urls) for r in results: print(r)
Running the code on my MacBook Pro 2.9 GHx 6-Core Intel Core i9, 16 GB 2400 MHz DDR4, the running time of three times is 40.310486793518066, 37.18907308578491 and 32.56927800178528 seconds to download 19 images in total.
Analysis of the results
After three running times, the results obviously reveal that the parallel approach is mostly two-third faster than the sequential one. The table below summaries what have captured each approach.
|Approach||Test 1||Test 2||Test 3|
The runnable source code can be found here 7bf0128. To comment on the corresponding line before running the script.
 How can I speed up fetching pages with urllib2 in python? https://stackoverflow.com/questions/3490173/how-can-i-speed-up-fetching-pages-with-urllib2-in-python, accessed on 7.9.2020
 How to Download Multiple Files Concurrently in Python, https://www.quickprogrammingtips.com/python/how-to-download-multiple-files-concurrently-in-python.html, accessed on 19.9.2020