I have hundreds of CSV files that I want to process similarly. For simplicity, we can assume that they are all in ./data/01_raw/ (like ./data/01_raw/1.csv, ./data/02_raw/2.csv) etc. I would much rather not give each file a different name and keep track of them individually when building my pipeline. I would like to know if there is any way to read all of them in bulk by specifying something in the catalog.yml file?
Asked
Active
Viewed 1,303 times
4
Rahul Kumar
- 2,184
- 3
- 24
- 46
Srikiran
- 309
- 1
- 3
- 9
1 Answers
8
You are looking for PartitionedDataSet. In your example, the catalog.yml might look like this:
my_partitioned_dataset:
type: "PartitionedDataSet"
path: "data/01_raw"
dataset: "pandas.CSVDataSet"