This post aims to present you how to download a resource from the given URL using the requests module. Of course, there are other modules which allow you to accomplish this purpose but I just focus on explaining how to do with the requests module and leave you discovering the other methods. Let’s get started now.
Table of Contents
Introduction
Below is a simple snippet to download Google’s logo in the Google search page via the link https://www.google.co.uk/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png
import requests
url = "https://www.google.co.uk/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png"
r = requests.get(url, allow_redirects=True)
open("google.ico", "wb").write(r.content)
The file named google.ico is saved into the current working directory. It’s easy as a piece of cake, right? In practice, we have to face more difficult situations that I am gonna show you now.
Not all URLs pointing to downloadable resources
The real world is you almost certainly handle circumstances where the resources in downloading are protected not allow users to download. For example, Youtube videos have been secured to prevent users from greedily downloading. People developers browser extensions or standalone applications to download Youtube videos, however, Google has detected such violent activities and increasingly protected their data. Therefore, it is important to check whether the resource of interest is allowed to download or not before sending a request. A snippet below simulates how to check that based on the Content-Type parameter of the header of the requesting URL.
import requests
def extract_content_type(_url):
r = requests.get(_url, allow_redirects=True)
return r.headers.get("Content-Type")
url = "https://www.google.co.uk/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png"
# open("google.ico", "wb").write(r.content)
print(extract_content_type(url))
url = "https://www.youtube.com/watch?v=ylk5AYyOcGI"
print(extract_content_type(url))
The output of the script above looks like
image/png text/html; charset=utf-8
The extract_content_type function returns a string as the mime type of the remote file. In the above example, what we are expecting from the Youtube URL is a video type rather than text/html while the first URL returns an expected value. In other words, the content type of a request is text/html which we just download a plain text or HTML document instead of well-known mime types such as image/png, video/mp4, etc.
Define a function to verify a downloadable resource
As explained in the previous section, checking a resource allowed to download is necessary before sending a request.
Checking Content-Type of the request header
The function below can do what we need by checking the content type from the header.
def is_downloadable(_url):
"""
Does the url contain a downloadable resource
"""
h = requests.head(_url, allow_redirects=True)
header = h.headers
content_type = header.get('content-type')
if 'text' in content_type.lower():
return False
if 'html' in content_type.lower():
return False
return True
Applying this function for the two URLs in the previous examples, it returns False for Youtube URL while True is returned with Google’s icon link.
Restricting the file size of the downloading resource
We might have another restriction on the downloading resource, for example, just downloading the file which the size is not greater than 100 MB. By inspecting the header of the request URL on the content-length property, the code below can work as expected.
content_length = header.get('content-length', None)
if content_length and content_length > 1e8: # 100 MB approx
return False
Getting the file name from the URL
Again, to obtain the file name of the downloading resource, we can use the Content-Disposition property of the request header.
def get_filename_from_url(_url):
"""
Get filename from content-disposition
"""
r = requests.get(_url, allow_redirects=True)
cd = r.headers.get('content-disposition')
if not cd:
return None
filename = re.findall('filename=(.+)', cd)
if len(filename) == 0:
return None
return filename[0]
The URL-parsing code in conjunction with the above method to get filename from the Content-Disposition header will work for most of the cases.
Voilà! If you have any judgments, please don’t hesitate to leave your comments in the comment box below.
