I want to read a .xlsx file, do some things with the data and convert it to a dict to save it in a .json file. To do that I use Python3 and pandas.
This is the code:
import pandas as pd
import json
xls = pd.read_excel(
io = "20codmun.xlsx",
converters = {
"CODAUTO" : str,
"CPRO" : str,
"CMUN" : str,
"DC" : str
}
)
print(xls)
#print(xls.columns.values)
outDict = {}
print(len(xls["NOMBRE"])) # 8131 rows
for i in range(len(xls.index)):
codauto = xls["CODAUTO"][i]
cpro = xls["CPRO"][i]
cmun = xls["CMUN"][i]
dc = xls["DC"][i]
aemetId = cpro + cmun
outDict[xls["NOMBRE"][i]] = {
"CODAUTO" : codauto,
"CPRO" : cpro,
"CMUN" : cmun,
"DC" : dc,
"AEMET_ID" : aemetId
}
print(i) # 8130
print(len(outDict)) # 8114 entries, SOME ENTIRES ARE LOST!!!!!
#print(outDict["Petrer"])
with open("data.json", "w") as outFile:
json.dump(outDict, outFile)
I add here the source of the .xlsx file (Spanish government). Select "Fichero con todas las provincias". You have to delete the first row.
As you can see, the pandas.DataFrame has 8131 rows, the for index at the end is 8130, but the length of the final dict is 8114, so some data is lost!
You can check that "Aljucén" is on the .xlsx file, but not in the .json one. Edit: Solved using json.dump(outDict, outFile, ensure_ascii=False)