Hugging Face(2)-Datasets库
https://huggingface.co/docs/datasets/index A Dataset provides fast random access to the rows, and memory-mapping so that loading even large datasets only uses a relatively small amount of device memory. But for really, really big datasets ( > 100G) that won’t even fit on disk or in memory, an IterableDataset allows you to access and use the dataset without waiting for it to download completely! https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt#streaming-datasets https://huggingface.co/docs/datasets/about_mapstyle_vs_iterable 1. 读取 1.1 以Json文件读取为例 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ## 预先下载解压到本地的datasets文件夹 # wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-train.json.gz # wget https://github.com/crux82/squad-it/raw/master/SQuAD_it-test.json.gz # gzip -dkv SQuAD_it-*.json.gz from datasets import load_dataset squad_it_dataset = load_dataset("json", data_files="./datasets/SQuAD_it-train.json", field="data") # field参数为JSON文件特有,用于指定 JSON 文件中包含实际数据的字段名 squad_it_dataset # 默认读取为train split squad_it_dataset.keys() # dict_keys(['train']) squad_it_dataset # DatasetDict({ # train: Dataset({ # features: ['title', 'paragraphs'], # num_rows: 442 # }) # }) 1.2 JSON的其它形式读取 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 # (1) 同时读取两个split data_files = {"train": "SQuAD_it-train.json", "test": "SQuAD_it-test.json"} squad_it_dataset = load_dataset("json", data_files=data_files, field="data") # (2) 读取gz压缩格式 data_files = {"train": "SQuAD_it-train.json.gz", "test": "SQuAD_it-test.json.gz"} squad_it_dataset = load_dataset("json", data_files=data_files, field="data") # (3) 远程读取 url = "https://github.com/crux82/squad-it/raw/master/" data_files = { "train": url + "SQuAD_it-train.json.gz", "test": url + "SQuAD_it-test.json.gz", } squad_it_dataset = load_dataset("json", data_files=data_files, field="data") data_files可以为每个split指定多个文件,https://huggingface.co/docs/datasets/loading ...