Problem
Hello. I am trying to use huggingface to do some malware classification. I have a 5738 malware binaries in a directory. The paths to these malware binaries are stored in a list called files. I am trying to load these binaries into a huggingface datasets.Dataset object.
I have created the Dataset like this
dataset = datasets.Dataset.from_text(
    files,
    sample_by="document",
    encoding="latin1",
)
Since each file is supposed to represent a single instance, I used sample_by="document", which to my knowledge (confirmed by reading the source code) should treat each document in files as an individual example.
Strangely, the length of files and the length of the resulting dataset do not appear to be the same
dataset.num_rows, len(files)
>>> (27967, 5738)
The expected behavior was that each file in files would get mapped to a particular row in dataset, but apparently this did not happen. Any idea whats up with this? Thanks!
Software
- datasets 2.12.0
- Python 3.10.6
- CentOS 9
 
    