Recursively process zip archives to extract files while discarding specific format of files

Question

UPDATE: I noticed that many people are viewing this thread, which makes me believe that this situation is not so rare after all. Anyway, I had asked a similar/related question on SO here, which has pretty decent solutions too which might solve the problem in a better way.

On my Windows 7 machine, I have a directory full of downloaded dumps in ZIP archives. Each archive contains few text files, PDFs and rarely XML files. I want to extract all the contents of each ZIP archive into its respective folder(must be created during the process) while discarding/ignoring extraction of PDFs. After extraction of required files from an archive, processed zip must not be deleted(or I would like to know how I can control it in different situations).

If it helps to know, the number of archives in the directory is in the range of 60k-70k. Also, I need separate output directories because files in an archive may have same names as files in other.

For example,

I have all my archives like one.zip, two.zip,.. in, say, D:\data
I create a new folder for processed data, say, D:\extracted
Now the data from D:\data\one.zip should go to D:\extracted\one. Here, D:\extracted\one should be created automatically.
During this complete uncompression process, all the encountered PDFs should not be extracted(be ignored). There's no point in extracting and then deleting.
(Optional) A log file should be maintained at, say, D:\extracted. Idea is to use this file to resume processing from where it was left in case of an error.
(Optional) Script should let me decide whether I want to keep source archives or delete them after processing.

I already did some search to find a solution but couldn't find one. I came across few questions like these

but they were not of much help(I'm not a pro with Windows by the way). I'm open to installing safe and ad free 3rd party software(open-source) like 7-zip.

EDIT: Is there a tool readily available to do what I need, I already tried Multi Unpacker. It doesn't create new directories, it can't ignore *.pdf files. It's even slow to start with, I think it first reads all the archives at source before starting to process them.

Thanks in advance!

score 3 · Accepted Answer · edited Mar 20 '17 at 10:17

Modifying the answer found here, this piece of PowerShell script should do what you want. Just save it as a file with the Extension ".ps1". When calling it, just call it as ./filename.ps1 and it will extract the files to separate folders, delete the zip files and remove all files with .pdf extension. I have not tested if it works properly with recursive paths, but it should, please test it.

Edit: If you don't want your zip files to be deleted, remove or comment out (#) the line rmdir -Path $_.FullName -Force

Requirements: PowerShell, 7-zip and for you to set the 7-zip path in the file.

param([string]$folderPath="D:\Blah\files")

Get-ChildItem $folderPath -recurse | %{ 

    if($_.Name -match "^*.`.zip$")
    {
        $parent="$(Split-Path $_.FullName -Parent)";    
        write-host "Extracting $($_.FullName) to $parent"

        $arguments=@("e", "`"$($_.FullName)`"", "-o`"$($parent)\$($_.BaseName)`"");
        $ex = start-process -FilePath "`"C:\Program Files\7-Zip\7z.exe`"" -ArgumentList $arguments -wait -PassThru;

        if( $ex.ExitCode -eq 0)
        {
            write-host "Extraction successful, deleting $($_.FullName)"
            rmdir -Path $_.FullName -Force
            $arguments1="$($parent)\$($_.BaseName)\*.pdf"
            rmdir -Recurse -Path $arguments1
        }
    }
}

Recursively process zip archives to extract files while discarding specific format of files

1 Answers1

Linked