Efficiently write a Pandas dataframe to Google BigQuery

Question

I'm trying to upload a pandas.DataFrame to Google Big Query using the pandas.DataFrame.to_gbq() function documented here. The problem is that to_gbq() takes 2.3 minutes while uploading directly to Google Cloud Storage takes less than a minute. I'm planning to upload a bunch of dataframes (~32) each one with a similar size, so I want to know what is the faster alternative.

This is the script that I'm using:

dataframe.to_gbq('my_dataset.my_table', 
                 'my_project_id',
                 chunksize=None, # I have tried with several chunk sizes, it runs faster when it's one big chunk (at least for me)
                 if_exists='append',
                 verbose=False
                 )

dataframe.to_csv(str(month) + '_file.csv') # the file size its 37.3 MB, this takes almost 2 seconds 
# manually upload the file into GCS GUI
print(dataframe.shape)
(363364, 21)

My question is, what is faster?

Upload Dataframe using pandas.DataFrame.to_gbq() function
Saving Dataframe as CSV and then upload it as a file to BigQuery using the Python API
Saving Dataframe as CSV and then upload the file to Google Cloud Storage using this procedure and then reading it from BigQuery

Update:

Alternative 1 seems faster than Alternative 2 , (using pd.DataFrame.to_csv() and load_data_from_file() 17.9 secs more in average with 3 loops):

def load_data_from_file(dataset_id, table_id, source_file_name):
    bigquery_client = bigquery.Client()
    dataset_ref = bigquery_client.dataset(dataset_id)
    table_ref = dataset_ref.table(table_id)
    
    with open(source_file_name, 'rb') as source_file:
        # This example uses CSV, but you can use other formats.
        # See https://cloud.google.com/bigquery/loading-data
        job_config = bigquery.LoadJobConfig()
        job_config.source_format = 'text/csv'
        job_config.autodetect=True
        job = bigquery_client.load_table_from_file(
            source_file, table_ref, job_config=job_config)

    job.result()  # Waits for job to complete

    print('Loaded {} rows into {}:{}.'.format(
        job.output_rows, dataset_id, table_id))

I'd suggest you to use the pydatalab package (your third approach). We achieved big speed improvements on downloading from bigquery with that package against pandas native function — Nico Albers, Feb 20 '18 at 15:43
Those times seem high. What version of pandas-gbq are you using? Version 0.3.0 should be materially faster at uploading — Maximilian, Feb 21 '18 at 11:15
@NicoAlbers I'm surprised if there were a material difference between the libraries - I've found pandas-gbq similar-to-slightly-faster. Do you have any examples? — Maximilian, Feb 21 '18 at 11:16
I recently started a thread on performance between python & BQ: https://github.com/pydata/pandas-gbq/issues/133 — Maximilian, Feb 21 '18 at 11:17
I just realized that comparison was with an older version, as soon as I find time, I'll compare that — Nico Albers, Feb 21 '18 at 11:26

enle lin · Accepted Answer · 2018-03-01T14:52:52.067

I did the comparison for alternative 1 and 3 in Datalab using the following code:

from datalab.context import Context
import datalab.storage as storage
import datalab.bigquery as bq
import pandas as pd
from pandas import DataFrame
import time

# Dataframe to write
my_data = [{1,2,3}]
for i in range(0,100000):
    my_data.append({1,2,3})
not_so_simple_dataframe = pd.DataFrame(data=my_data,columns=['a','b','c'])

#Alternative 1
start = time.time()
not_so_simple_dataframe.to_gbq('TestDataSet.TestTable', 
                 Context.default().project_id,
                 chunksize=10000, 
                 if_exists='append',
                 verbose=False
                 )
end = time.time()
print("time alternative 1 " + str(end - start))

#Alternative 3
start = time.time()
sample_bucket_name = Context.default().project_id + '-datalab-example'
sample_bucket_path = 'gs://' + sample_bucket_name
sample_bucket_object = sample_bucket_path + '/Hello.txt'
bigquery_dataset_name = 'TestDataSet'
bigquery_table_name = 'TestTable'

# Define storage bucket
sample_bucket = storage.Bucket(sample_bucket_name)

# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(not_so_simple_dataframe)

# Write the DataFrame to GCS (Google Cloud Storage)
%storage write --variable not_so_simple_dataframe --object $sample_bucket_object

# Write the DataFrame to a BigQuery table
table.insert_data(not_so_simple_dataframe)
end = time.time()
print("time alternative 3 " + str(end - start))

and here are the results for n = {10000,100000,1000000}:

n       alternative_1  alternative_3
10000   30.72s         8.14s
100000  162.43s        70.64s
1000000 1473.57s       688.59s

Judging from the results, alternative 3 is faster than alternative 1.

Anonymous · Answer 2 · 2021-08-16T16:26:11.813

Having also had performance issues with to_gbq() I just tried the native google client and it's miles faster (approx 4x), and if you omit the step where you wait for the result, it's approx 20x faster.

Worth noting that best practice would be to wait for the result and check it, but in my case there's extra steps later on that validate the results.

I'm using pandas_gbq version 0.15 (the latest at the time of writing). Try this:

from google.cloud import bigquery
import pandas

df = pandas.DataFrame(
    {
        'my_string': ['a', 'b', 'c'],
        'my_int64': [1, 2, 3],
        'my_float64': [4.0, 5.0, 6.0],
        'my_timestamp': [
            pandas.Timestamp("1998-09-04T16:03:14"),
            pandas.Timestamp("2010-09-13T12:03:45"),
            pandas.Timestamp("2015-10-02T16:00:00")
        ],
    }
)

client = bigquery.Client()
table_id = 'my_dataset.new_table'

# Since string columns use the "object" dtype, pass in a (partial) schema
# to ensure the correct BigQuery data type.
job_config = bigquery.LoadJobConfig(schema=[
    bigquery.SchemaField("my_string", "STRING"),
])

job = client.load_table_from_dataframe(
    df, table_id, job_config=job_config
)

# Wait for the load job to complete. (I omit this step)
# job.result()

score 0 · Answer 3 · answered Jun 22 '22 at 00:05

0

You can use pandas.DataFrame.to_gbq()

Here is the doc

answered Jun 22 '22 at 00:05

Marc

522
1
5
15

Efficiently write a Pandas dataframe to Google BigQuery

3 Answers3

Linked