Merge all h5 files using h5py

Question

I am Novice at coding. Can some one help with a script in Python using h5py wherein we can read all the directories and sub-directories to merge multiple h5 files into a single h5 file.

Hi Tom, create_aggregate_file.py provided in github uses HDF5_utils and such a package does not exist. There is a reference in stack exchange and I tried this piece of code but it fails as it does read though all directories and subdirectories for .h5 extension d_names = os.listdir(os.getcwd()) d_struct = {} for i in d_names: f = HDF5.File(i, 'r+') d_struct[i] = f.keys() f.close() — Sameer, Apr 16 '18 at 08:02

score 1 · Answer 1 · answered Apr 16 '18 at 12:09

What you need is a list of all datasets in the file. I think that the notion of a recursive function is what is needed here. This would allow you to extract all 'datasets' from a group, but when one of them appears to be a group itself, recursively do the same thing until all datasets are found. For example:

/
|- dataset1
|- group1
   |- dataset2
   |- dataset3
|- dataset4

Your function should in pseudo-code look like:

def getdatasets(key, file):

  out = []

  for name in file[key]:

    path = join(key, name)

    if file[path] is dataset: out += [path]
    else                      out += getdatasets(path, file)

  return out

For our example:

/dataset1 is a dataset: add path to output, giving
```
out = ['/dataset1']
```
/group is not a dataset: call getdatasets('/group',file)
1. /group/dataset2 is a dataset: add path to output, giving
```
nested_out = ['/group/dataset2']
```
2. /group/dataset3 is a dataset: add path to output, giving
```
nested_out = ['/group/dataset2', '/group/dataset3']
```
This is added to what we already had:
```
out = ['/dataset1', '/group/dataset2', '/group/dataset3']
```

/dataset4 is a dataset: add path to output, giving

out = ['/dataset1', '/group/dataset2', '/group/dataset3', '/dataset4']

This list can be used to copy all data to another file.

To make a simple clone you could do the following.

import h5py
import numpy as np

# function to return a list of paths to each dataset
def getdatasets(key,archive):

  if key[-1] != '/': key += '/'

  out = []

  for name in archive[key]:

    path = key + name

    if isinstance(archive[path], h5py.Dataset):
      out += [path]
    else:
       out += getdatasets(path,archive)

  return out


# open HDF5-files
data     = h5py.File('old.hdf5','r')
new_data = h5py.File('new.hdf5','w')

# read as much datasets as possible from the old HDF5-file
datasets = getdatasets('/',data)

# get the group-names from the lists of datasets
groups = list(set([i[::-1].split('/',1)[1][::-1] for i in datasets]))
groups = [i for i in groups if len(i)>0]

# sort groups based on depth
idx    = np.argsort(np.array([len(i.split('/')) for i in groups]))
groups = [groups[i] for i in idx]

# create all groups that contain dataset that will be copied
for group in groups:
  new_data.create_group(group)

# copy datasets
for path in datasets:

  # - get group name
  group = path[::-1].split('/',1)[1][::-1]

  # - minimum group name
  if len(group) == 0: group = '/'

  # - copy data
  data.copy(path, new_data[group])

Further customizations are, of course, possible depending on what you want. You describe some combination of files. In that case you would have to

 new_data = h5py.File('new.hdf5','a')

and probably add something to the path.

Merge all h5 files using h5py

1 Answers1

Linked