python book

Quickly download docs from google drive 2021

Recently I came across a situation where I needed to collect a large volume of data from google drive. This is a quick write-up of how I used Python to download docs from google drive.

Enabling API Access

Authentication with APIs is often the hardest part of using them, this wasn’t an exception. You can follow the guide here for the full directions on how to accomplish this but I’ve added some additional details below.

Step 1: Setting up a Project

First, create a new project in the google cloud console. Where you see “Google docs downloader” in the following screenshot, click it.

downloading docs from google drive

You’ll get a popup like:

download docs from google drive

Select “New project”, enter a name for your project and click “create”.

Click back into the project selection screen and select your new project.

Step 2: Enabling access to your API

Once you’re in your new project, select “APIs and services” from the menu at the left of the screen to help with download docs from google drive.

Select the “ENABLE APIS AND SERVICES” button

Search for “drive” in the search box to get a response like:

Select Google Drive API, then click “Enable”.

Select the “Create credentials” button

Select the gdrive API from the dropdown list:

Complete the OAuth screen with the name of your App.

Select the scopes you want:

In this instance, I’ve selected read-only scopes so that my gdrive data isn’t changed whilst I’m testing on it. But this aspect will restrict what your application can do, so try to select wisely.

Once you’ve picked your scopes click continue, then “Save and continue”.

Select “Desktop app” as your Application type.

Then click “create” and finally “download” to download your secret.

Rename the downloaded file as “credentials.json”.

Using the API

First, make sure that the “credentials.json” file is in your project. You’ll also need the appropriate libraries are installed using:

  pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

Importing dependencies

from __future__ import print_function
import os.path

from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
from google.auth.transport.requests import Request
from google.oauth2.credentials import Credentials

import io
from io import StringIO
from googleapiclient.http import MediaIoBaseDownload

The first step is to get all of the libraries imported. In this instance, we’re predominantly looking at importing google API libraries. We also import the IO library to help us with writing our document to a file.

Entering your scope

When requesting an Authorisation token, you need to state what level of access (or scope) you’re looking to access the API with. In our case, we’re only interested in reading from the “drive” API so have selected that appropriate scope.

# If modifying these scopes, delete the file token.json. # Scopes https://developers.google.com/drive/api/v2/about-auth
SCOPES = ['https://www.googleapis.com/auth/drive.readonly'] 

Identifying the document to download

In this example, we’re downloading a single document based on the document GUID. If you navigate to a document in google drive, you’re likely to get a URL like “https://docs.google.com/document/d/1H2VA9MvMdfg6Dwfghb_ecEbxlgCm0DTdK_KPxigx8Ag” the bold section is the document ID that we’re using.

DOCUMENT_ID = '1H2VA9MvMdfg6Dwfghb_ecEbxlgCm0DTdK_KPxigx8Ag' 

Getting a Token

This section is a common aspect of accessing google APIs. If you have your “credentials.json” file in the same folder as the application, the code below will verify the credentials and generate a new “Token.json” file which will allow you to access the API. This token will be limited in its access dependent on the “scope” you set above. If you have already run this code and a token file already exists, the token will be used.

creds = None
    # The file token.json stores the user's access and refresh tokens, and is
    # created automatically when the authorization flow completes for the first
    # time.
    if os.path.exists('token.json'):
        creds = Credentials.from_authorized_user_file('token.json', SCOPES)
    # If there are no (valid) credentials available, let the user log in.
    if not creds or not creds.valid:
        if creds and creds.expired and creds.refresh_token:
            creds.refresh(Request())
        else:
            flow = InstalledAppFlow.from_client_secrets_file(
                'credentials.json', SCOPES)
            creds = flow.run_local_server(port=0)
        # Save the credentials for the next run
        with open('token.json', 'w') as token:
            token.write(creds.to_json())

Creating a download service and downloading metadata

To help with downloading docs from google drive we need to create a download service using our authentication token.

 downloadService = build('drive', 'v3', credentials=creds)
    results = downloadService.files().get(fileId=DOCUMENT_ID, fields="id, name,mimeType,createdTime").execute()
   docMimeType = results['mimeType']

Once a download service is created, this code will look to query it to populate the “results” value. It accomplishes this by running a “get” method over the “files” data. The get method passes arguments to:
a) Identify a specific document (e.g. the fileId)
b) Request any fields you want to be returned by this query.

This initial call is looking to retrieve metadata only as we’ll handle the download later. Note that we’re also taking a note of the document’s mime-type so we can convert the document appropriately later.

Downloading the document

The first thing we’re doing in this code is looking up which export format is appropriate for the document’s mime type. This will ensure that the export doesn’t fail.

mimeTypeMatchup = {
  "application/vnd.google-apps.document": {
      "exportType":"application/vnd.openxmlformats-officedocument.wordprocessingml.document","docExt":"docx"
      }
}

Then we’re looking to get the document’s name and file extension so that when we download it we can give it the right name.

exportMimeType =mimeTypeMatchup[docMimeType]['exportType']
    docExt =mimeTypeMatchup[docMimeType]['docExt']
    docName = results['name']

    request = downloadService.files().export_media(fileId=DOCUMENT_ID, mimeType=exportMimeType) # Export formats : https://developers.google.com/drive/api/v3/ref-export-formats
    fh = io.FileIO(docName+"."+docExt, mode='w')

Next, we set-up a request to export media based on the fileId and the mimeType that’s needed for exporting. Finally, we create a FileIO handler in write mode to create the file.

The final download

Once we’ve set-up our IO handler and built a request to download the file, we then just need to loop through each chunk of data being downloaded until the download has been completed.

downloader = MediaIoBaseDownload(fh, request)
    done = False
    while done is False:
        status, done = downloader.next_chunk()
        print("Download %d%%." % int(status.progress() * 100))

And there we have it, by following these steps you’ll be able to download docs from google drive using the API.

Leave a Comment

Your email address will not be published. Required fields are marked *