What's the best way to plot a very large pandas dataframe?

Question

I have a large pandas dataframe of shape (696, 20531) and I'd like to plot all of it's values in a histogram. Using the df.plot(kind='hist') seems to be taking forever. Is there a better way to do this?

Probably you could reduce you dataframe.. Or you need to plot 696 hists in one plot? — Anton Protopopov, Feb 01 '16 at 06:49
I need to plot all the values in that dataframe in one plot, without considering each row as a separate histogram. — jp89, Feb 01 '16 at 06:51

score 4 · Answer 1 · answered Feb 01 '16 at 07:23

Use DataFrame.stack():

import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.randn(5, 10))
print(df.to_string())

          0         1         2         3         4         5         6         7         8         9
0 -0.760559  0.317021  0.325524 -0.300139  0.800688  0.221835 -1.258592  0.333504  0.669925  1.413210
1  0.082853  0.041539  0.255321 -0.112667 -1.224011 -0.361301 -0.177064  0.880430  0.188540 -0.318600
2 -0.827121  0.261817  0.817216 -1.330318 -2.254830  0.447037  0.294458  0.672659 -1.242452  0.071862
3  1.173998  0.032700 -0.165357  0.572287  0.288606  0.261885 -0.699968 -2.864314 -0.616054  0.798000
4  2.134925  0.966877 -1.204055  0.547440  0.164349  0.704485  1.450768 -0.842088  0.195857 -0.448882

df.stack().hist()

@kilojoules it is the histogram of all values in the dataframe. — Stop harming Monica, Feb 01 '16 at 08:58

score 1 · Answer 2 · answered Mar 20 '19 at 10:36

Another approach would be to use DataFrame.sample() - which provides a random set (with seed random_state), of size n, from your dataframe. So you can then plot a sample (e.g. 1000 points, with repeatable randomness) of the data e.g.

df.sample(n=1000,random_state=1).plot()

score 0 · Answer 3 · answered Aug 06 '22 at 22:26

Plotting large datasets with pandas is always trouble because of the memory overhead (more on that here).

A memory-efficient way to do it is to use DuckDB. You can store your data in a .parquet file and then use SQL to compute the bins and heights for your histogram.

You can use the following snippet as a template (just replace bin_size with a numeric value):

select
  floor(SOME_COLUMN/100.0)*100.0,
  count(*) as count
from 'path/to/file.parquet'
group by 1
order by 1;

Then, you can pass the results to matplotlib's bar function, which takes bin positions and height.

I implemented this in a new package called JupySQL. It is essentially doing what I've described with a couple of extra things. Here, you can see an example and some memory benchmarks demonstrating that this approach is much more efficient.

What's the best way to plot a very large pandas dataframe?

3 Answers3