Bulk insert huge data into SQLite using Python

Question

I read this: Importing a CSV file into a sqlite3 database table using Python

and it seems that everyone suggests using line-by-line reading instead of using bulk .import from SQLite. However, that will make the insertion really slow if you have millions of rows of data. Is there any other way to circumvent this?

Update: I tried the following code to insert line by line but the speed is not as good as I expected. Is there anyway to improve it

for logFileName in allLogFilesName:
    logFile = codecs.open(logFileName, 'rb', encoding='utf-8')
    for logLine in logFile:
        logLineAsList = logLine.split('\t')
        output.execute('''INSERT INTO log VALUES(?, ?, ?, ?)''', logLineAsList)
    logFile.close()
connection.commit()
connection.close()

score 55 · Answer 1 · answered Aug 27 '15 at 02:11

55

Since this is the top result on a Google search I thought it might be nice to update this question.

From the python sqlite docs you can use

import sqlite3

persons = [
    ("Hugo", "Boss"),
    ("Calvin", "Klein")
]

con = sqlite3.connect(":memory:")

# Create the table
con.execute("create table person(firstname, lastname)")

# Fill the table
con.executemany("insert into person(firstname, lastname) values (?,?)", persons)

I have used this method to commit over 50k row inserts at a time and it's lightning fast.

answered Aug 27 '15 at 02:11

Fred

904
1
8
5

how do I read back from this? What should I select or .open in sqlite3 to retrieve or read values in this database? .schema does not show me this table when i open sqlite3 cli – VIshu Kamble Sep 26 '19 at 14:43
@VIshuKamble this creates in memory db, which is lost after you close the program. If you want to persist the data, use `sqlite3.connect("path/to/dbFile")`. – jnovacho Oct 05 '19 at 09:24
@jnovacho no I meant when the program is running. But thanks! – VIshu Kamble Oct 09 '19 at 19:01
@ant1g `:memory:` made no difference in my benchmarking vs an SSD btw: https://stackoverflow.com/a/76659706/895245 – Ciro Santilli OurBigBook.com Jul 11 '23 at 07:47

alecxe · Accepted Answer · 2020-06-16T16:52:11.180

25

Divide your data into chunks on the fly using generator expressions, make inserts inside the transaction. Here's a quote from sqlite optimization FAQ:

Unless already in a transaction, each SQL statement has a new transaction started for it. This is very expensive, since it requires reopening, writing to, and closing the journal file for each statement. This can be avoided by wrapping sequences of SQL statements with BEGIN TRANSACTION; and END TRANSACTION; statements. This speedup is also obtained for statements which don't alter the database.

Here's how your code may look like.

Also, sqlite has an ability to import CSV files.

edited Jun 16 '20 at 16:52

answered Aug 13 '13 at 21:53

alecxe

462,703
120
1,088
1,195

Can SQLite import many CSV files at once. I could not find a way to do that? – Shar Aug 13 '13 at 22:10
Don't think it is possible to import many csv files at once. Splitting the data in chunks and insert them in transactions should be the way to go, I think. – alecxe Aug 13 '13 at 22:34
import CSV files link is dead – Dev Jun 15 '20 at 08:06

score 19 · Answer 3 · answered Aug 13 '13 at 21:51

19

Sqlite can do tens of thousands of inserts per second, just make sure to do all of them in a single transaction by surrounding the inserts with BEGIN and COMMIT. (executemany() does this automatically.)

As always, don't optimize before you know speed will be a problem. Test the easiest solution first, and only optimize if the speed is unacceptable.

answered Aug 13 '13 at 21:51

davidfg4

476
3
9

I just tried what you suggested by inserting line by line. The speed is not too bad, but it is still not as fast as I hope it to be. Maybe my code was not well-written enough. I updated it in the question above. Do you have any suggestions? – Shar Aug 13 '13 at 22:26

Ciro Santilli OurBigBook.com · Answer 4 · 2023-07-13T18:49:44.897

cursor.execute vs cursor.executemany minimal synthetic benchmark

main.py

from pathlib import Path
import sqlite3
import csv

f = 'tmp.sqlite'
n = 10000000
Path(f).unlink(missing_ok=True)
connection = sqlite3.connect(f)
#connection.set_trace_callback(print)
cursor = connection.cursor()
cursor.execute("CREATE TABLE t (x integer)")
#for i in range(n):
#    cursor.execute(f"INSERT INTO t VALUES ({i})")
cursor.executemany(f"INSERT INTO t VALUES (?)", ((str(i),) for i in range(n)))
connection.commit()
connection.close()

Results according to time main.py:

method	storage	time (s)
`executemany`	SSD	8.6
`executemany`	`:memory:`	9.8
`executemany`	HDD	10.4
`execute`	SSD	31
`execute`	`:memory:`	29

Baseline time of for i in range(n): pass: 0.3s.

Conclusions:

executemany is about 3x faster
we are not I/O bound on execute vs executemany, we are likely memory bound. This might not be very surprising given that the final tmp.sqlite file is only 115 MB, and that my SSD can handle 3 GB/s

If I enable connection.set_trace_callback(print) to log queries as per: How can I log queries in Sqlite3 with Python? and reduce n = 5 I see the exact same queries for both execute and executemany:

CREATE TABLE t (x integer)
BEGIN 
INSERT INTO t VALUES ('0')
INSERT INTO t VALUES ('1')
INSERT INTO t VALUES ('2')
INSERT INTO t VALUES ('3')
INSERT INTO t VALUES ('4')
COMMIT

so it does not seem that the speed difference is linked to transactions, as both appear to execute within a single transaction. There are some comments on automatic transaction control at: https://docs.python.org/3/library/sqlite3.html#transaction-control but they are not super clear, I hope the logs are correct.

Insertion time baseline

I'll be looking out for the fastest possible method I can find to compare it to Python. So far, this generate_series approach is the winner:

f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" 'create table t(x integer)'
time sqlite3 "$f" 'insert into t select value as x from generate_series(1,10000000)'

finished as fast as 1.4s on SSD, therefore substantially faster than any Python method so far, and well beyond being SSD-bound.

Tested on Ubuntu 23.04, Python 3.11.2, Lenovo ThinkPad P51

SSD: Samsung MZVLB512HAJQ-000L7 512GB SSD, 3 GB/s nominal speed
HDD: Seagate ST1000LM035-1RK1 1TB, 140 MB/s nominal speed

Bulk insert huge data into SQLite using Python

4 Answers4

Linked

Related