62

I would like to create regular Google Takeout backups (let's say every 3 months) and store them encrypted in some other cloud storage like DropBox or S3.

It does not have to be a cloud-to-cloud solution, though preferred. It does not have to be 100% automated, however the more the better.

bad_coder
  • 649

5 Answers5

10

This is a partial answer with partial automation. It may stop working in the future if Google chooses to crack down on automated access to Google Takeout. Features currently supported in this answer:

+---------------------------------------------+------------+---------------------+
|             Automation Feature              | Automated? | Supported Platforms |
+---------------------------------------------+------------+---------------------+
| Google Account log-in                       | No         |                     |
| Get cookies from Mozilla Firefox            | Yes        | Linux               |
| Get cookies from Google Chrome              | Yes        | Linux, macOS        |
| Request archive creation                    | No         |                     |
| Schedule archive creation                   | Kinda      | Takeout website     |
| Check if archive is created                 | No         |                     |
| Get archive list                            | Broken     | Cross-platform      |
| Download all archive files                  | Broken     | Linux, macOS        |
| Encrypt downloaded archive files            | No         |                     |
| Upload downloaded archive files to Dropbox  | No         |                     |
| Upload downloaded archive files to AWS S3   | No         |                     |
+---------------------------------------------+------------+---------------------+

Firstly, a cloud-to-cloud solution can't really work because there is no interface between Google Takeout and any known object storage provider. You've got to process the backup files on your own machine (which could be hosted in the public cloud, if you wanted) before sending them off to your object storage provider.

Secondly, as there is no Google Takeout API, an automation script needs to pretend to be a user with a browser to walk through the Google Takeout archive creation and download flow.


Automation Features

Google Account log-in

This is not yet automated. The script would need to pretend to be a browser and navigate possible hurdles such as two-factor authentication, CAPTCHAs, and other increased security screening.

Get cookies from Mozilla Firefox

I have a script for Linux users to grab the Google Takeout cookies from Mozilla Firefox and export them as environment variables. For this to work, the default/active profile must have visited https://takeout.google.com while logged in.

As a one-liner:

cookie_jar_path=$(mktemp) ; source_path=$(mktemp) ; firefox_profile=$(cat "$HOME/.mozilla/firefox/profiles.ini" | awk -v RS="" '{ if($1 ~ /^\[Install[0-9A-F]+\]/) { print } }' | sed -nr 's/^Default=(.*)$/\1/p' | head -1) ; cp "$HOME/.mozilla/firefox/$firefox_profile/cookies.sqlite" "$cookie_jar_path" ; sqlite3 "$cookie_jar_path" "SELECT name,value FROM moz_cookies WHERE host LIKE '%.google.com' AND (name LIKE 'SID' OR name LIKE 'HSID' OR name LIKE 'SSID' OR (name LIKE 'OSID' AND host LIKE 'takeout.google.com')) AND originAttributes LIKE '^userContextId=1' ORDER BY creationTime ASC;" | sed -e 's/|/=/' -e 's/^/export /' | tee "$source_path" ; source "$source_path" ; rm -f "$source_path" ; rm -f "$cookie_jar_path"

As a prettier Bash script:

#!/bin/bash
# Extract Google Takeout cookies from Mozilla Firefox and export them as envvars
#
# The browser must have visited https://takeout.google.com as an authenticated user.

Warn the user if they didn't run the script with source

[[ "${BASH_SOURCE[0]}" == "${0}" ]] && echo 'WARNING: You should source this script to ensure the resulting environment variables get set.'

cookie_jar_path=$(mktemp) source_path=$(mktemp)

In case the cookie database is locked, copy the database to a temporary file.

Edit the $firefox_profile variable below to select a specific Firefox profile.

firefox_profile=$( cat "$HOME/.mozilla/firefox/profiles.ini" | awk -v RS="" '{ if($1 ~ /^[Install[0-9A-F]+]/) { print } }' | sed -nr 's/^Default=(.*)$/\1/p' | head -1 ) cp "$HOME/.mozilla/firefox/$firefox_profile/cookies.sqlite" "$cookie_jar_path"

Get the cookies from the database

sqlite3 "$cookie_jar_path"
"SELECT name,value FROM moz_cookies WHERE host LIKE '%.google.com' AND ( name LIKE 'SID' OR name LIKE 'HSID' OR name LIKE 'SSID' OR (name LIKE 'OSID' AND host LIKE 'takeout.google.com') ) AND originAttributes LIKE '^userContextId=1' ORDER BY creationTime ASC;" | # Reformat the output into Bash exports sed -e 's/|/=/' -e 's/^/export /' | # Save the output into a temporary file tee "$source_path"

Load the cookie values into environment variables

source "$source_path"

Clean up

rm -f "$source_path" rm -f "$cookie_jar_path"

Get cookies from Google Chrome

I have a script for Linux and possibly macOS users to grab the Google Takeout cookies from Google Chrome and export them as environment variables. The script works on the assumption that Python 3 venv is available and the Default Chrome profile visited https://takeout.google.com while logged in.

As a one-liner:

if [ ! -d "$venv_path" ] ; then venv_path=$(mktemp -d) ; fi ; if [ ! -f "${venv_path}/bin/activate" ] ; then python3 -m venv "$venv_path" ; fi ; source "${venv_path}/bin/activate" ; python3 -c 'import pycookiecheat, dbus' ; if [ $? -ne 0 ] ; then pip3 install git+https://github.com/n8henrie/pycookiecheat@dev dbus-python ; fi ; source_path=$(mktemp) ; python3 -c 'import pycookiecheat, json; cookies = pycookiecheat.chrome_cookies("https://takeout.google.com") ; [print("export %s=%s;" % (key, cookies[key])) for key in ["SID", "HSID", "SSID", "OSID"]]' | tee "$source_path" ; source "$source_path" ; rm -f "$source_path" ; deactivate

As a prettier Bash script:

#!/bin/bash
# Extract Google Takeout cookies from Google Chrome and export them as envvars
#
# The browser must have visited https://takeout.google.com as an authenticated user.

Warn the user if they didn't run the script with source

[[ "${BASH_SOURCE[0]}" == "${0}" ]] && echo 'WARNING: You should source this script to ensure the resulting environment variables get set.'

Create a path for the Chrome cookie extraction library

if [ ! -d "$venv_path" ] then venv_path=$(mktemp -d) fi

Create a Python 3 venv, if it doesn't already exist

if [ ! -f "${venv_path}/bin/activate" ] then python3 -m venv "$venv_path"

fi

Enter the Python virtual environment

source "${venv_path}/bin/activate"

Install dependencies, if they are not already installed

python3 -c 'import pycookiecheat, dbus' if [ $? -ne 0 ] then pip3 install git+https://github.com/n8henrie/pycookiecheat@dev dbus-python fi

Get the cookies from the database

source_path=$(mktemp) read -r -d '' code << EOL import pycookiecheat, json cookies = pycookiecheat.chrome_cookies("https://takeout.google.com") for key in ["SID", "HSID", "SSID", "OSID"]: print("export %s=%s" % (key, cookies[key])) EOL python3 -c "$code" | tee "$source_path"

Clean up

source "$source_path" rm -f "$source_path" deactivate [[ "${BASH_SOURCE[0]}" == "${0}" ]] && rm -rf "$venv_path"

Clean up downloaded files:

rm -rf "$venv_path"

Request archive creation

This is not yet automated. The script would need to fill out the Google Takeout form and then submit it.

Schedule archive creation

There is no fully automated way to do this yet, but in May 2019, Google Takeout introduced a feature that automates the creation of 1 backup every 2 months for 1 year (6 backups total). This has to be done in the browser at https://takeout.google.com while filling out the archive request form:

Google Takeout: Customize archive format

Check if archive is created

This is not yet automated. If an archive has been created, Google sometimes sends an email to the user's Gmail inbox, but in my testing, this doesn't always happen for reasons unknown.

The only other way to check if an archive has been created is by polling Google Takeout periodically.

Get archive list

This section is currently broken.

Google stopped revealing the archive download links on the Takeout download page and has implemented a secure token that limits the download link retrieval for each archive to a maximum of 5 times.

I have a command to do this, assuming that the cookies have been set as environment variables in the "Get cookies" section above:

curl -sL -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" \
'https://takeout.google.com/settings/takeout/downloads' |
grep -Po '(?<=")https://storage.cloud.google.com/[^"]+(?=")' |
awk '!x[$0]++'

The output is a line-delimited list of URLs that lead to downloads of all available archives.
It's parsed from HTML with regex.

Download all archive files

This section is currently broken.

Google stopped revealing the archive download links on the Takeout download page and has implemented a secure token that limits the download link retrieval for each archive to a maximum of 5 times.

Here is the code in Bash to get the URLs of the archive files and download them all, assuming that the cookies have been set as environment variables in the "Get cookies" section above:

curl -sL -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" \
'https://takeout.google.com/settings/takeout/downloads' |
grep -Po '(?<=")https://storage.cloud.google.com/[^"]+(?=")' |
awk '!x[$0]++' |
xargs -n1 -P1 -I{} curl -LOJ -C - -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" {}

I've tested it on Linux, but the syntax should be compatible with macOS, too.

Explanation of each part:

  1. curl command with authentication cookies:

    curl -sL -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" \
  2. URL of the page that has the download links

    'https://takeout.google.com/settings/takeout/downloads' |
  3. Filter matches only download links

    grep -Po '(?
    
  4. Filter out duplicate links

    awk '!x[$0]++' |
  5. Download each file in the list, one by one:

    xargs -n1 -P1 -I{} curl -LOJ -C - -H "Cookie: SID=${SID}; HSID=${HSID}; SSID=${SSID}; OSID=${OSID};" {}

    Note: Parallelizing the downloads (changing -P1 to a higher number) is possible, but Google seems to throttle all but one of the connections.

    Note: -C - skips files that already exist, but it might not successfully resume downloads for existing files.

Encrypt downloaded archive files

This is not automated. The implementation depends on how you like to encrypt your files, and the local disk space consumption must be doubled for each file you are encrypting.

Upload downloaded archive files to Dropbox

This is not yet automated.

Upload downloaded archive files to AWS S3

This is not yet automated, but it should simply be a matter of iterating over the list of downloaded files and running a command like:

aws s3 cp TAKEOUT_FILE "s3://MYBUCKET/Google Takeout/"
Deltik
  • 19,971
7

Google Takeout lets you schedule exports every two months, so six a year and up to one year. You can choose to have the file added to a cloud drive or have a download link emailed (downloads are kept for one week only).

To program it, browse to the page of https://takeout.google.com/settings/takeout?pli=1.

You may select the Google data to include in the backup. The supported cloud drives are: Drive, Dropbox, OneDrive and Box. The dump formats are Zip or tgz.

You will find more information in the article How to download your Google data.


As Google Takeout does not provide an API, automating launching such a backup through the browser may not work when its user-interface will change.

It might be best to use Google Takeout to backup to some cloud disk, and automate the download of the new files.

You may consult this answer of mine for ways of accessing Google Drive for syncing. It is probably possible to map Google Drive to Windows, so using Windows tasks for syncing new backups to the local disk (although I haven't tried).

harrymc
  • 498,455
2

Instead of Direct APIs for backing up Google Takeout(which seems to be almost impossible to do as of now), you can back up your data to 3rd party storage solutions via Google Drive. Many Google service allow backup to Google Drive, and you can backup Google Drive using the following tools:

GoogleCL - GoogleCL brings Google services to the command line.

gdatacopier - Command line document management utilities for Google docs.

FUSE Google Drive - A FUSE user-space filesystem for Google Drive, written in C.

Grive - An independent open-source implementation of a Google Drive client. It uses the Google Document List API to talk to the servers in Google. The code is written in C++.

gdrive-cli - A command-line interface for GDrive. This uses the GDrive API, not the GDocs API, which is interesting. To use it, you need to register a chrome application. It must be at least installable by you, but need not be published. There is a boilerplate app in the repo you can use as a starting point.

python-fuse example - Contains some slides and examples of Python FUSE filesystems.

Most of these seem to be in the Ubuntu repositories. I've used Fuse, gdrive and GoogleCL myself and they all work fine. Depending on the level of control you want this will be really easy or really complex. That's up to you. It should be straight forward to do from an EC2/S3 server. Just figure the commands out one by one for everything you need and put it in a script on a cron job.

If you don't want to work so hard, you can also just use a service like Spinbackup. I'm sure there are others just as good but I haven't tried any.

krowe
  • 5,629
1

I found this question while searching for how to fix my google photos not showing up properly in google drive (which I'm already automatically backing up!).

So, to get your photos to show up in google drive, go to https://photos.google.com, settings and set it to show photos in a folder in drive.

Then use https://github.com/ncw/rclone to clone your entire google drive (which now includes photos as a 'normal' directory) down to your local storage.

djsmiley2kStaysInside
  • 6,943
  • 2
  • 36
  • 48
1

On the Google side, you can schedule the takeout export every 2 months, so it would be automatic then.

Once you downloaded the data, how you deal with it (e.g. upload to another cloud storage) should be simple to automate. This is a whole topic on its own, and maybe depends on what you want. So this is maybe better for some separate questions. I personally extract the data, put it into some Git Annex repo, and then sync that repo to whatever other media / cloud storage with the functions provided by Git Annex (it can do all that already). A solution like Git Annex also has the advantage that it will deduplicate files when you put multiple Google takeouts into it.

The main remaining difficulty is how to automate the download of the Google takeout files. This is esp annoying if it contains a lot of data, which would be split into lots of files (in my case >350 individual 2GB files). This is what I will describe now:

I'm developing exactly this right now. Code is here on GitHub (chrome-ext-google-takeout-downloader).

This is a Google Chrome extension. (My first one.) Currently you would enable it vial developer mode. I.e. on chrome://extensions/ you enable "Developer mode", and then "Load unpacked", and then select the directory of the extension.

You would go the website with your recent takeout (https://takeout.google.com/settings/takeout/downloads), and start the download of the first file.

The extension will then wait for the download to finish, and then automatically start the next download of the next part.

To make this work, you need to disable "Ask where to save each file before downloading" and select a download directory with enough space.

You also need to allow the website to download multiple files ("... automatic downloads of multiple files").

Unfortunately Google asks to reenter your Google password every 10 minutes or so. You can automate this as well by storing the Google password in the extension, and then it would automatically enter it for you. Do that at your own risk! Read through the code to understand what's happening with the password.

The backup-google-takeout.py script is an example which you would run in the background with the option --poll-zip-dir, which automatically would add the content of the zip files to a Git Annex.


Also related: Reddit: Google Takeout Archives - Downloading ALL the zip files at once?:

However if you select either tgz or tbz as the archive format instead of zip the archives will be chunked into much bigger chunks (~50GB) so there will be a lot less to download.

Albert
  • 6,889
  • 11
  • 41
  • 53