cursor.execute vs cursor.executemany minimal synthetic benchmark
main.py
from pathlib import Path
import sqlite3
import csv
f = 'tmp.sqlite'
n = 10000000
Path(f).unlink(missing_ok=True)
connection = sqlite3.connect(f)
#connection.set_trace_callback(print)
cursor = connection.cursor()
cursor.execute("CREATE TABLE t (x integer)")
#for i in range(n):
#    cursor.execute(f"INSERT INTO t VALUES ({i})")
cursor.executemany(f"INSERT INTO t VALUES (?)", ((str(i),) for i in range(n)))
connection.commit()
connection.close()
Results according to time main.py:
| method | storage | time (s) | 
| executemany | SSD | 8.6 | 
| executemany | :memory: | 9.8 | 
| executemany | HDD | 10.4 | 
| execute | SSD | 31 | 
| execute | :memory: | 29 | 
 
Baseline time of for i in range(n): pass: 0.3s.
Conclusions:
- executemanyis about 3x faster
- we are not I/O bound on executevsexecutemany, we are likely memory bound. This might not be very surprising given that the finaltmp.sqlitefile is only 115 MB, and that my SSD can handle 3 GB/s
If I enable connection.set_trace_callback(print) to log queries as per: How can I log queries in Sqlite3 with Python? and reduce n = 5 I see the exact same queries for both execute and executemany:
CREATE TABLE t (x integer)
BEGIN 
INSERT INTO t VALUES ('0')
INSERT INTO t VALUES ('1')
INSERT INTO t VALUES ('2')
INSERT INTO t VALUES ('3')
INSERT INTO t VALUES ('4')
COMMIT
so it does not seem that the speed difference is linked to transactions, as both appear to execute within a single transaction. There are some comments on automatic transaction control at: https://docs.python.org/3/library/sqlite3.html#transaction-control but they are not super clear, I hope the logs are correct.
Insertion time baseline
I'll be looking out for the fastest possible method I can find to compare it to Python. So far, this generate_series approach is the winner:
f="10m.sqlite"
rm -f "$f"
sqlite3 "$f" 'create table t(x integer)'
time sqlite3 "$f" 'insert into t select value as x from generate_series(1,10000000)'
finished as fast as 1.4s on SSD, therefore substantially faster than any Python method so far, and well beyond being SSD-bound.
Tested on Ubuntu 23.04, Python 3.11.2, Lenovo ThinkPad P51
- SSD: Samsung MZVLB512HAJQ-000L7 512GB SSD, 3 GB/s nominal speed
- HDD: Seagate ST1000LM035-1RK1 1TB, 140 MB/s nominal speed