How to Authenticate With Google Drive From JupyterHub
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.
Table of content
- Google Drive
- Create a Google Drive account
- Service Account
- Authenticate Google Drive from Jupyterhub
- Create a service account
- Authorize access to a Google Drive Folder using Service Account
- Deploy Jupyterhub on AWS EKS
- Store your service account credentials using Kubernetes Secrets and configure your Kubernetes yaml file to obtain the secret on every pod
- Connect to Google Drive folder from Jupyterhub using Google client library for Python
- Resources
Introduction:
Google Drive is a popular and free cloud-based storage provided by Google that allows users store, manage, organize, sync and access their files online. Most people prefer to store their data on Google Drive for various reasons, such as easy accessibility and collaboration. By the end of this article, you’ll have an understanding of how to authenticate Google Drive from Jupyterhub using Service Account and retrieve files in a Google Drive folder from your Jupyterhub environment hosted on Kubernetes.
Google Drive:
Google Drive also offers several features that make it a versatile file management tool. Users can create and edit files directly in Google Drive using Google Docs, Sheets, and Slides, which are free web-based alternatives to Microsoft Office.
One of the main reasons for the popularity of Google Drive is the accessibility and file agnostics, which means you can store any kind of file including PDF, Video, image, binary, code, apk, and any other file format. Additionally, you can access these files from anywhere on any device. Google Drive allows users to share files with others, making it easy to collaborate on a project with a workmate, and friends. Another useful feature of Google Drive is its integration with other Google services, such as Gmail, Google Calendar, and Google Photos, users can attach files from Google Drive to emails or events, or automatically back up photos and videos to Google Drive.
Additionally, Google Drive allows you to access and retrieve files and folders stored in Google Drive using REST API, SDKs or libraries with any programming language including Python, Java, Javascript, and PHP. These SDKs allow you to integrate or interact with Google Drive in your application(Internal or public use). Furthermore, it also, provision an OAuth 2.0 authentication which allows you as a developer to authenticate users and obtain access to their files or folder stored on Google Drive.
In addition to Google Drive data storage, collaboration and file sharing features, Drive also offer other features like;
Offline access: Data stored on Google Drive can be accessed offline using the Google Drive app on desktop or mobile devices. Additionally, you can edit and create new files while offline, and changes will be synced to the cloud when you come back online.
Advanced search: With the recent development or disruption in Machine learning for search optimization, Google Drive uses OCR(optical character recognition) powerful search function which allows you to search for files using keywords, file type, owner, and more.
Third-party app integration: Google Drive integrates with a wide range of third-party applications, including document editors, project management tools, and backup and sync services.
Version history: Google Drive automatically saves a version history of all files, allowing you to view and restore previous versions of your file. This is helpful when working on collaborative documents with your teammate, as it allows a team to see changes and revert back to a previous version if needed.
Now we have a basic understanding of Google Drive and its amazing features. Let’s move to the next step on service account.
Service Accounts:
Service accounts are an integral component of Google Cloud Platform (GCP) services. Service accounts provide a way to authorize access to GCP services programmatically, without the need for user intervention.
Service accounts are used by applications, and scripts to access GCP resources on behalf of a specific user or organization. For instance, you have to use a service account to allow a web application to access Google Drive data
Service accounts are associated with a set of configuration variables or keys such as private keys, client id, token URI, an authorization certificate, etc
Using a service account with Google Drive provides several benefits:
No user authentication: Service accounts uses server-to-server communication and do not require extra authentication layer, which can make the authentication process simpler and more secure
Access to restricted resources: While using Service accounts, you can grant access to restricted resources, such as specific Google Drive folders, and can be used to manage these resources.
Connect Google Drive with JupyterHub:
Follow the steps below to connect, authorize and extract data from Google Drive Folder from Jupyterhub hosted on Kubernetes.
Requirements:
Step 1: Create a service account
- Navigate to your Google cloud console, Be sure that you have an appropriate Google Cloud Project created. Navigate to IAM and Admin » Service Accounts.
- Next, click Create Service Account
- Input your preferred Service Account name (e.g “Saturn-cloud-Drive”)
You can also input extra information in the Service Account description field to give more details on the purpose of the account. Click on Create and Continue
Skip the Grant this service account access to the project because we wouldn’t be creating any role for this service account
If you want to allow other users within your network to share the Service Account privileges, You can add the user’s email address or the group name to grant the same set of permissions.
For this project, we will leave it blank and click Done.
Now we have created a service account, the next step is to assign a key to our service account.
To use a service account from outside of Google Cloud, such as on other platforms or on-premises. In our case, in JupyterHub, we must first establish the identity of the service account. Public/private key pairs provide a secure way of accomplishing this goal. When we create a service account key, the public portion is stored on Google Cloud, while the private portion is available only to us. For more information about public/private key pairs, see Service account keys.
- To create the new key, under the Actions column, click on Manage Keys » Add Key » Create new key
For the key type, select JSON and the JSON client file will be downloaded to your computer. This client file contains information such as the project_id, private_key, token_uri etc.
Step 2: Authorize access to a Google Drive Folder using Service Account
In this step, we will grant access to a folder on our Google Drive using the service account we have created, which means with our service account credentials, we can access files in that folder using any programming language SDK
- Copy your service account email
- Navigate to your Google Drive console, select a folder and click on Managed access. Paste your service account email on Add people or group textbox and grant access to Editor
Now we have successfully assigned or allocated a folder to our service account. Next, let’s host our JupyterHub and configure our Service account as a secret on Kubernetes. Then, access the files inside that folder.
Step 3: Deploy JupyterHub on AWS EKS
First, we have to set up JupyterHub on AWS EKS.
Follow the link below and skip to the Kubernetes section:
https://saturncloud.io/blog/jupyterhub_aws/
Note: Be sure to run the next step commands in the same terminal where you set up your Kubernetes.
Step 4: Store your service account credentials using Kubernetes Secrets and configure your Kubernetes yaml file to obtain the secret on every pod
Before we begin, we have to understand what secrets on Kubernetes do. Then, we will proceed with storing our service account credentials on Kubernetes Secrets so every pod or jupyterHub user can access data inside the Google Drive folder.
Secrets:
Kubernetes secrets is a secured and encrypted method of storing sensitive data such as passwords, tokens, configuration files or ssh key, and this information can be used in a pod without exposing the data.
In Kubernetes, secrets are stored in etcd, which is a database that stores configuration data for a cluster. While etcd does have some built-in security features, such as access controls and encryption, it may not be as secure as using specialized tools for storing sensitive information, such as hardware security modules (HSMs) or key management systems (KMS).
Here are some of the benefits of using Secrets:
Configuration management: When you store credentials in Secrets, you can ensure that all instances of an application or pods use the same credentials or configuration file, even if they are deployed across multiple nodes in the cluster. This helps prevent misconfigurations and ensures consistency across various environments in your clusteor.
Security: By storing configuration or sensitive data such as passwords and tokens using Kubernetes secrets, it provides an extra layer of security as compared to storing them in clear text. Secrets are stored securely in etcd and encrypted so that even if an unauthorized person gains access to your cluster or pod, they will not be able to access the secrets.
Collaboration: By storing configuration data in secrets, you make it easy for multiple teammates to work on the same application and utilize a central configuration.
Next, let’s store our service account credentials on secrets so we can access the credentials as an environment variable on every jupyterhub.
- Create a secret key name and upload your service account credentials json using the
-from-file
parameter. Ensure that your json credential is in the same Kubernetes directory
$ kubectl create secret generic driveconfig --from-file=secret.json
Alternatively, you can utilize the Kubectl configmap to store the credentials using the command below.
$ kubectl create configmap driveconfig --from-file=secret.json
- Open your Jupyterhub configuration file
$ sudo nano <your yaml configuration file>
# e.g sudo nano /tmp/jupyterhub.yaml
- Make the changes to your configuration file under
singleuser
key in your jupyterhub configuration file
# before
Singleuser:
extraEnv:{}
# substitute the code above with the code below
singleuser:
extraEnv:
DRIVE-SA:
valueFrom:
secretKeyRef:
name: <secret key name e.g driveconfig>
key: <secret file name e.g secret.json>
Note: When a notebook is created for each user, the environment variable will be accessible to the users.
- You can use the command below forward the service port so that you can access the Jupyterhub application on your browser with the address localhost:8080
$ kubectl --namespace=default port-forward service/proxy-public 8080:http
Step 5: Connect to Google Drive folder from Jupyterhub using Google client library for Python
!pip install --upgrade Google-api-python-client Google-auth-httplib2 Google-auth-oauthlib
!pip install oauth2client
# Import necessary modules and libraries
from __future__ import print_function
import pickle
import os.path
from Google_auth_oauthlib.flow import InstalledAppFlow
from Google.auth.transport.requests import Request
from apiclient.discovery import build
from oauth2client.service_account import ServiceAccountCredentials
import io
import os
import json
# Retrieve the service account credentials from environment variable
DRIVESA = os.environ.get(DRIVE-SA')
# Convert the service account credentials into a Python dictionary
DRIVESA_DICT = json.loads(DRIVESA)
# Set the scopes for the API request
SCOPES = ['https://www.Googleapis.com/auth/drive.readonly']
# Define a function to initialize the Google Drive service object
def initialize_drive():
"""
Initializes a Google Drive service object with the provided service account credentials
Returns:
An authorized Google Drive service object
"""
# Create credentials object from the provided service account credentials
credentials = ServiceAccountCredentials.from_json_keyfile_dict(
DRIVESA_DICT, SCOPES)
# Build the Google drive service object with the authorized credentials
service = build('drive', 'v3', credentials=credentials)
# Print the service object for debugging purposes
print(service)
return service
# Initialize the Google Drive service object
drive_service = initialize_drive()
# Define the ID of the Google Drive folder to retrieve files from
folder_id = 'xxxxxxxxxxxxxx'
# Retrieve a list of files from the specified Google Drive folder
results = drive_service.files().list(q=f"'{folder_id}' in parents and trashed = false",
fields="nextPageToken, files(id, name)").execute()
files = results.get('files', [])
# Loop through each file in the list of files and download its content
for file in files:
# Retrieve the file ID and name
file_id = file.get('id')
file_name = file.get('name')
# Create a request to download the file content
request = drive_service.files().get_media(fileId=file_id)
content = request.execute()
# Save the file content to the local directory
with open(file_name, 'wb') as f:
f.write(content)
The code above initializes a Google Drive service object using a service account and retrieves a list of files from a specified folder. It then downloads the content of each file in the list and saves it to the local directory.
First, we import the necessary modules and libraries, including the Google authentication libraries, the Google Drive API client library, and the JSON library. We then retrieve the secret stored as an environment variable and assign it to a variable DRIVESA
, which contains the JSON representation of the service account credentials.
The initialize_drive()
function is then defined to create an authenticated Google Drive service object using the provided service account credentials. This function returns the service object.
After the drive_service object is initialized, the code retrieves a list of files from a specified folder using the files().list()
method with the specified query parameters.
Next, the code loops through each file in the list of files and retrieves its file ID and name. It then creates a request to download the content of the file using the files().get_media()
method with the specified file ID. The content of the file is retrieved using the execute()
method of the request object.
Finally, the file is saved to the local directory using the open()
method.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.
Resources:
You may also be interested in:
About Saturn Cloud
Saturn Cloud is your all-in-one solution for data science & ML development, deployment, and data pipelines in the cloud. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Request a demo today to learn more.
Saturn Cloud provides customizable, ready-to-use cloud environments for collaborative data teams.
Try Saturn Cloud and join thousands of users moving to the cloud without
having to switch tools.