SXSSFWorkbook defaults to using inline strings instead of a shared strings table. That means SXSSFWorkbook writes text directly in the sheet even if it is multiple times the same text. XSSFWorkbook and Excel's GUI both are using a shared strings table where text gets an index and the same text is stored only once and the index is used in the sheet then. But that should not have that big impact to the file size of the resulting *.xlsx.
SXSSFWorkbook, as well as all other Office Open XML formatted files, apache poi creates, were zipped using org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream. That uses deflate as compression algorithm and Deflater.DEFAULT_COMPRESSION as the default compression level. One could overwrite protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) of SXSSFWorkbook to set another compression level. But that also should not have that big impact to the file size of the resulting *.xlsx.
Example Java code:
import java.io.File;
import java.io.OutputStream;
import java.io.FileOutputStream;
import org.apache.poi.ss.usermodel.*;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.apache.commons.compress.archivers.zip.Zip64Mode;
import org.apache.commons.compress.archivers.zip.ZipArchiveOutputStream;
import java.util.zip.Deflater;
class CreateSXSSFDifferentCompression {
static SXSSFWorkbook createSXSSFWorkbook(int compressionLevel, int rowAccessWindowSize,
boolean compressTmpFiles, boolean useSharedStringsTable) {
SXSSFWorkbook workbook = null;
if (compressionLevel != Deflater.DEFAULT_COMPRESSION) {
workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable) {
protected ZipArchiveOutputStream createArchiveOutputStream(OutputStream out) {
ZipArchiveOutputStream zos = new ZipArchiveOutputStream(out);
zos.setUseZip64(Zip64Mode.AsNeeded);
zos.setLevel(compressionLevel);
return zos;
}
};
} else {
workbook = new SXSSFWorkbook(null, rowAccessWindowSize, compressTmpFiles, useSharedStringsTable);
}
return workbook;
}
public static void main(String[] args) throws Exception {
SXSSFWorkbook workbook = null;
// uses Deflater.DEFAULT_COMPRESSION and inline strings
//workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, false);
// uses Deflater.DEFAULT_COMPRESSION and shared strings table
//workbook = createSXSSFWorkbook(Deflater.DEFAULT_COMPRESSION, 500, true, true);
// uses Deflater.BEST_COMPRESSION and inline strings
workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, false);
// uses Deflater.BEST_COMPRESSION and shared strings table
//workbook = createSXSSFWorkbook(Deflater.BEST_COMPRESSION, 500, true, true);
int ROWS_COUNT = 2000;
int COLS_COUNT = 1000;
Sheet sheet = workbook.createSheet("Test Sheet 1");
for (int i = 1 ; i <= ROWS_COUNT; i++) {
Row row = sheet.createRow(i);
//System.out.println("Row " + i);
for(int j = 1; j <= COLS_COUNT; j++) {
row.createCell(j).setCellValue("Test " + i);
}
}
FileOutputStream out = new FileOutputStream("./Excel.xlsx");
workbook.write(out);
out.close();
workbook.close();
workbook.dispose();
File file = new File("./Excel.xlsx");
System.out.println(file.length());
}
}
This results in Excel.xlsx file size of:
5,031,034 bytes
when Deflater.DEFAULT_COMPRESSION and inline strings are used.
4,972,663 bytes
when Deflater.DEFAULT_COMPRESSION and shared strings table are used.
4,972,915 bytes
when Deflater.BEST_COMPRESSION and inline strings are used.
And
4,966,749 bytes
when Deflater.BEST_COMPRESSION and shared strings table are used.
Used: Java 12, apache poi 4.1.2, Ubuntu Linux.
Neither I would call that huge for a spreadsheet of 2,000 rows x 1,000 columns nor I would call the impact of the different settings big.
And the entries are compressed very good.
If you look into the Excel.xlsx ZIP archive, you will find the uncompressed size of xl/worksheets/sheet1.xml 112,380,273 bytes when inline strings are used. The uncompressed size of xl/sharedStrings.xml is 138 bytes then and only contains the very basic XML.
If shared strings table is used, then the uncompressed size of xl/worksheets/sheet1.xml is 68,377,273 bytes and the uncompressed size of xl/sharedStrings.xml is 49,045 bytes and contains 2,000 entries.
If Excel itself saves *.xlsx files, it creates files having approximately the same file size when content is equal. So Excel itself uses the same compression level.
Of course one can compressing the *.xlsx files more when storing the Excel.xlsx into a *.zip archive again. But that would not be what Excel expects to be a *.xlsx file.
Microsoft states at What are the benefits of Open XML Formats?:
Compact files Files are automatically compressed and can be up to 75 percent smaller in some cases. The Open XML Format uses zip
compression technology to store documents, offering potential cost
savings as it reduces the disk space required to store files and
decreases the bandwidth needed to send files via e-mail, over
networks, and across the Internet. When you open a file, it is
automatically unzipped. When you save a file, it is automatically
zipped again. You do not have to install any special zip utilities to
open and close files in Office.
The important parts here are:
When you open a file, it is automatically unzipped. When you save a
file, it is automatically zipped again.
That means, if apache poi would compress the files using other methods or compression levels than Microsoft Office itself, then Microsoft Office would be unable to do so with files apache poi had created.
So, since apache poi creates files which Excel (Microsoft Office) is able opening directly, it uses the same compression method and compression level as Excel (Microsoft Office) will do.