We have some internal dashboards with PHP backend used for uploading CSV files. Recently we
found some CSVs would fail to parse: the fgetcsv function returns false, which is super nasty since we couldn't determine the actual problem in CSV (for e.g. at which line no it is experience issues, which characters is it unable to digest etc.)
We narrowed down the problem to character-set encoding: CSVs generated from Windows machines were failing. Linux's iconv command was able to fix the CSVs for us
iconv -c --from-code=UTF-8 --to-code=ASCII path/to/uncleaned.csv > path/to/cleaned.csv
while it's PHP equivalent didn't work (tried using both //IGNORE//TRANSLIT options).
$uncleaned_csv_text = file_get_contents($source_data_csv_filename);
$cleaned_csv_text = iconv('UTF-8', 'ASCII/IGNORE//TRANSLIT', $uncleaned_csv_text);
file_put_contents($source_data_csv_filename, $cleaned_csv_text);
..
$headers = fgetcsv($source_data_csv_filename)
While we can use PHP's exec function to run the shell command
- it is less than ideal
- the practise is forbidden in our organisation from security viewpoint (
Travisdoesn't let it pass through)
Is there any alternative way to achieve this CSV 'cleaning'?
UPDATE-1
We explored several other options, none of which worked for us
regexbased cleaningforceutf8packagemb_convert_encoding(as suggested by discussions)
UPDATE-2
- Upon
echoing thesha1digest of CSV's text before and after subjecting it to PHP'siconvfunction, we found thaticonvis not doing any change - Also in my case,
mb_check_encodingon original CSV's text outputstrueregardless of input query:windows-1252,ascii,utf-8

