7

Context

I have received around 1700 97-2003 excel files with a ton of data including some text information. I am trying to read them all into Stata before I clean and run some statistical analysis. But, Stata is unable to open them. The files have a .xls format with large amounts of text in them which I want to keep, and convert them into .csv files (hence the csv tag).

What I've tried

  • On closer look, every time I manually open a file in Excel I get the error "Format does not match extension", I have to click on "Open anyway" and manually save it before the file can be imported into Stata. I can't possibly do this for 1700 files. I am using a windows OS and all files are saved locally on my drive.

I additionally tried to write a bulk line of code on stata to save all files as csv or atleast an xls that can be imported into stata but I have had zero luck. Everytime I try to save them, I get a corrupted csv or an xls with the exact same error that pops up as " try xlCreateBook()" on stata and the "format does not match extension" error when I manually try to open the file.

For reference, under is the stata code I wrote.

cd "$dropbox\project8\data\2008-2009\A.Monthwise\dakota"
local oldfiles : dir "." files "*.xls" , respectcase
mac list _oldfiles

foreach fn of local oldfiles { local oldfn "fn'"' local res = ustrregexm("oldfn'"', "(.*).xls", .) if `res'==1 { local newfn = "new" + ustrregexs(1) + ".dta" di "Renamingoldfn' to newfn'"' copy"./oldfn'"'"./new/`newfn'"', public } }

Question

Is there a way I can fix this extesnion/format issue for 1700 files and convert everything in bulk, such that Stata can read it?

Lorien
  • 73

2 Answers2

15

The error "format does not match extension" means that the files are not truly .xls. As Excel can still open them, they are probably of another format that is supported. My guess would be .xlsx.

You may test by using a hex-editor (such as HxD) to check the file signature, which is in the first few bytes:

  • xlsx: First 4 bytes are 50 4B 03 04

  • xls: First 8 bytes are D0 CF 11 E0 A1 B1 1A E1

To rename all the files in the current folder at once, you could use the Command Prompt (CMD) with the command:

ren *.xls *.xlsx

If your file-signature is not one of the above, please add it to your post. It will help identify these files.


As the files were HTML, not xsl or even xslx, it's possible to do mass-conversion in Excel using VBA.

The article How to convert multiple xls files to xlsx files in Excel? contains this script that worked for the poster:

Sub ConvertToXlsx()
'Updateby Extendoffice
Dim strPath As String
Dim strFile As String
Dim xWbk As Workbook
Dim xSFD, xRFD As FileDialog
Dim xSPath As String
Dim xRPath As String
Set xSFD = Application.FileDialog(msoFileDialogFolderPicker)
With xSFD
.Title = "Please select the folder contains the xls files:"
.InitialFileName = "C:\"
End With
If xSFD.Show <> -1 Then Exit Sub
xSPath = xSFD.SelectedItems.Item(1)
Set xRFD = Application.FileDialog(msoFileDialogFolderPicker)
With xRFD
.Title = "Please select a folder for outputting the new files:"
.InitialFileName = "C:\"
End With
If xRFD.Show <> -1 Then Exit Sub
xRPath = xRFD.SelectedItems.Item(1) & "\"
strPath = xSPath & "\"
strFile = Dir(strPath & "*.xls")
Application.ScreenUpdating = False
Application.DisplayAlerts = False
Do While strFile <> ""
If Right(strFile, 3) = "xls" Then
Set xWbk = Workbooks.Open(Filename:=strPath & strFile)
xWbk.SaveAs Filename:=xRPath & strFile & "x", _
FileFormat:=xlOpenXMLWorkbook
xWbk.Close SaveChanges:=False
End If
strFile = Dir
Loop
Application.DisplayAlerts = True
Application.ScreenUpdating = True
End Sub
harrymc
  • 498,455
7

As the files appear to be HTML files (a wild way to store data in my opinion) you could look into methods that can scrape/parse webpages.

Stata

I have no Stata to play with, but found the readhtml package which has a function readhtmltable that appears to read tables on web pages (not sure if it can handle a local HTML file).

Python

Alternatively, I do have access to Python and pandas, which has a read_html method. I tried it out with the file F - A & N Islands_September.xls you uploaded and it works fairly well.

For this to work you need the lxml and pandas packages in a python environment.

import pandas as pd 
# This reads in the 'xls' file (which is actually HTML)
df = pd.read_html(r"c:\path\to\F - A & N Islands_September.xls")
# The result is a list with length one, so get the actual DataFrame with
df = df[0]
# Show the first few rows:
df.head()
                  Unnamed: 0_level_0 Unnamed: 1_level_0                                                   Unnamed: 2_level_0 Unnamed: 3_level_0       District                                               
                  Unnamed: 0_level_1 Unnamed: 1_level_1                                                   Unnamed: 2_level_1 Unnamed: 3_level_1 _A & N Islands Nicobar North and Middle Andaman South Andaman
0  M1 [Ante Natal Care Services ANC]                1.1                    Total number of pregnant women Registered for ANC              TOTAL            NaN     NaN                      NaN           NaN
1  M1 [Ante Natal Care Services ANC]              1.1.1  Of which Number registered within first trimester (within 12 weeks)              TOTAL            NaN     NaN                      NaN           NaN
2  M1 [Ante Natal Care Services ANC]                1.2                        Number of Pregnant women registered under JSY              TOTAL            NaN     NaN                      NaN           NaN
3  M1 [Ante Natal Care Services ANC]                1.3   Number of pregnant women received 3 ANC check ups during pregnancy              TOTAL            NaN     NaN                      NaN           NaN
4  M1 [Ante Natal Care Services ANC]              1.4.1          Number of pregnant women given TT1 during current pregnancy              TOTAL            NaN     NaN                      NaN           NaN

To batch convert the files to CSV, you could do something like this:

from pathlib import Path
import pandas as pd

Assuming your 'XLS' files are in subfolder data next to your python script

data = Path(r"./data") data.mkdir(parents=True, exist_ok=True) results = Path(r"./results") results.mkdir(parents=True, exist_ok=True)

Loop over all XLS files

for f in data.glob("*.xls"): outfile = results / f.with_suffix('.csv').name df = pd.read_html(f)[0] df.to_csv(outfile, index=False)