Use this tag for questions related to the datasets project from Hugging Face. [Project on GitHub][1] [1]: https://github.com/huggingface/datasets
Questions tagged [huggingface-datasets]
221 questions
                    
                    10
                    
            votes
                
                2 answers
            
        How do I save a Huggingface dataset?
How do I write a HuggingFace dataset to disk?
I have made my own HuggingFace dataset using a JSONL file:
Dataset({
features: ['id', 'text'],
num_rows: 18 })
I would like to persist the dataset to disk.
Is there a preferred way to do this? Or, is…
         
    
    
        Campbell Hutcheson
        
- 549
- 2
- 4
- 12
                    9
                    
            votes
                
                1 answer
            
        Convert pandas dataframe to datasetDict
I cannot find anywhere how to convert a pandas dataframe to type datasets.dataset_dict.DatasetDict, for optimal use in a BERT workflow with a huggingface model. Take these simple dataframes, for example.
train_df = pd.DataFrame({
     "label" : [1,…
         
    
    
        ADF
        
- 522
- 6
- 14
                    6
                    
            votes
                
                1 answer
            
        StableDiffusion Colab - How to "make sure you're logged in with `huggingface-cli login`?"
I'm trying to run the Colab example of the Huggingface StableDiffusion generative text-to-image…
         
    
    
        Twenkid
        
- 825
- 7
- 15
                    6
                    
            votes
                
                1 answer
            
        How do I convert Pandas DataFrame to a Huggingface Dataset object?
I have the following df:
import pandas as pd
df = pd.DataFrame({"foo": ["bar", "baz"]})
How do I convert to a Huggingface Dataset?
         
    
    
        Vincent Claes
        
- 3,960
- 3
- 44
- 62
                    5
                    
            votes
                
                3 answers
            
        Add new column to a HuggingFace dataset
In the dataset I have 5000000 rows, I would like to add a column called 'embeddings' to my dataset.
dataset = dataset.add_column('embeddings', embeddings)
The variable embeddings is a numpy memmap array of size (5000000, 512).
But I get this…
         
    
    
        albero
        
- 169
- 2
- 9
                    5
                    
            votes
                
                1 answer
            
        How to convert tokenized words back to the original ones after inference?
I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words.
# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks…
         
    
    
        deonardo_licaprio
        
- 308
- 1
- 11
                    4
                    
            votes
                
                1 answer
            
        Labeling model with hugginface Dataset
I have the following code
from scipy.spatial.distance import dice, directed_hausdorff
from sklearn.metrics        import f1_score
from segments import SegmentsClient
from segments import SegmentsDataset
from datasets import load_dataset
from…
         
    
    
        Norhther
        
- 545
- 3
- 15
- 35
                    4
                    
            votes
                
                1 answer
            
        How to drop sentences that are too long in Huggingface?
I'm going through the Huggingface tutorial and it appears as the library has automatic truncation, to cut sentences that are too long, based on a max value, or other things.
How can I remove sentences for the same reasoning (sentences are too long,…
         
    
    
        Penguin
        
- 1,923
- 3
- 21
- 51
                    4
                    
            votes
                
                0 answers
            
        max_steps and generative dataset huggingface
I am fine tuning a model on my domain using both MLM and NSP. I am using the TextDatasetForNextSentencePrediction for NSP and DataCollatorForLanguageModeling for MLM.
The problem is with TextDatasetForNextSentencePrediction as it loads everything in…
         
    
    
        Prasanna
        
- 4,125
- 18
- 41
                    3
                    
            votes
                
                1 answer
            
        How to use sample_by="document" argument with load_dataset in Huggingface Dataset?
Problem
Hello. I am trying to use huggingface to do some malware classification. I have a 5738 malware binaries in a directory. The paths to these malware binaries are stored in a list called files. I am trying to load these binaries into a…
         
    
    
        Luke Kurlandski
        
- 81
- 5
                    3
                    
            votes
                
                1 answer
            
        How to create a dataset object with for multiple input of texts to the SetFit model?
The Setfit library accept two inputs : "text" and "label", https://huggingface.co/blog/setfit
My goals is to train Setfit using two similarity input with binary label (similar or not similar). ("text1","text2","similiar/not")
The example of dataset…
         
    
    
        wenz
        
- 61
- 6
                    3
                    
            votes
                
                1 answer
            
        Using huggingface load_dataset in Google Colab notebook
I am trying to load a training dataset in my Google Colab notebook but keep getting an error. This happens exclusively in Colab, since when I run the same notebook in VS Code there is no problem in loading.
Here is the code snippet which returns the…
         
    
    
        Luiz Felipe Bromfman
        
- 31
- 1
- 2
                    3
                    
            votes
                
                1 answer
            
        Cast features to ClassLabel
I have a dataset with type dictionary which I converted to Dataset:
ds = datasets.Dataset.from_dict(bio_dict)
The shape now is:
Dataset({
    features: ['id', 'text', 'ner_tags', 'input_ids', 'attention_mask', 'label'],
    num_rows: 8805
})
When I…
         
    
    
        Yana
        
- 785
- 8
- 23
                    3
                    
            votes
                
                0 answers
            
        Huggingface datasets storing and loading image data
I have a huggingface dataset with an image column
ds["image"][0]
When I save to disk, load it later I get the image column as… 
         
    
    
        Vincent Claes
        
- 3,960
- 3
- 44
- 62
                    3
                    
            votes
                
                1 answer
            
        Predict over a whole dataset using Transformers
I'm trying to zo zero-shot classification over a dataset with 5000 records. Right now I'm using a normal Python loop, but it is going painfully slow. Is there to speed up the process using Transformers or Datasets structures? This is how my code…
         
    
    
        ignacioct
        
- 325
- 1
- 12