You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
I have searched the Data-Juicer issues and found no similar bugs. 我已经在 issue列表 中搜索但是没有发现类似的bug报告。
project_name: 'demo-process'
dataset_path: "/xxx/data_juicer_main/ours/our_test_dataset/original_dataset.jsonl" # path to your dataset directory or file
np: 2 # number of subprocess to process your dataset
ds_cache_dir: '/xxx/data_juicer_main/huggingface'
export_path: './outputs/demo-process/demo-processed.jsonl'
op_fusion: false
use_checkpoint: true
#fusion_strategy: 'probe'
process schedule
a list of several process operators with their arguments
- language_id_score_filter: # filter text in specific language with language scores larger than a specific max value
lang: en # keep text in what language
min_score: 0.8
- words_num_filter: # filter text with number of words out of specific range
lang: en # sample in which language
tokenization: false # whether to use model to tokenize documents
min_num: 10 # the min number of filter range
max_num: 10000
- maximum_line_length_filter: # filter text with the maximum length of lines out of specific range
min_len: 10 # the min length of filter range
max_len: 10000
- perplexity_filter:
lang: en
max_ppl: 8000
- document_simhash_deduplicator:
tokenization: space
window_size: 6
lowercase: true
ignore_pattern: '\p{P}'
num_blocks: 6
hamming_distance: 4
- document_deduplicator: # deduplicate text samples using md5 hashing exact matching method
lowercase: false # whether to convert text to lower case
ignore_non_character: false
- random_selector: # selector to random select samples
select_ratio: 0.5 # the ratio to be sampled
select_num: 10
Logs 报错日志
剩余样本为1时报错:
2024-07-08 05:07:09 | INFO | data_juicer.core.data:225 - [11/11] OP [random_selector] Done in 0.709s. Left 1 samples.
2024-07-08 05:07:09 | INFO | data_juicer.core.data:241 - Writing checkpoint of dataset processed by last op...
Saving the dataset (0/2 shards): 0%| | 0/1 [00:00<?, ? examples/s]
2024-07-08 05:07:09 | ERROR | main:19 - An error has been caught in function '', process 'MainProcess' (1378079), thread 'MainThread' (140001744877376):
Traceback (most recent call last):
File "/data/data_juicer_main/tools/process_data.py", line 19, in
main()
└ <function main at 0x7f53718f9240>
File "/data/data_juicer_main/tools/process_data.py", line 15, in main
executor.run()
│ └ <function Executor.run at 0x7f53718f9120>
└ <data_juicer.core.executor.Executor object at 0x7f5370f02680>
File "/data/data_juicer_main/data_juicer/core/executor.py", line 196, in run
dataset = dataset.process(
│ └ <function NestedDataset.process at 0x7f53718db490>
└ Dataset({
features: ['text', 'meta', 'dj__stats'],
num_rows: 11
})
File "/data/data_juicer_main/data_juicer/core/data.py", line 244, in process
checkpointer.save_ckpt(dataset)
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ │ num_rows: 1
│ │ })
│ └ <function CheckpointManager.save_ckpt at 0x7f53718f8940>
└ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f5370edd510>
IndexError: Index 1 out of range for dataset of size 1.
剩余样本为0时报错:
2024-07-08 00:02:32 | INFO | data_juicer.core.data:225 - [10/11] OP [document_deduplicator] Done in 0.714s. Left 0 samples.
2024-07-08 00:02:33 | INFO | data_juicer.core.data:225 - [11/11] OP [random_selector] Done in 0.799s. Left 0 samples.
2024-07-08 00:02:33 | INFO | data_juicer.core.data:241 - Writing checkpoint of dataset processed by last op...
Saving the dataset (0/2 shards): 0 examples [00:00, ? examples/s]Process ForkPoolWorker-25:
Traceback (most recent call last):
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/datapilotengine/lib/python3.10/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1318, in setstate
table = self._concat_blocks_horizontally_and_vertically(blocks)
Process ForkPoolWorker-26:
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1351, in _concat_blocks_horizontally_and_vertically
return cls._concat_blocks(pa_tables_to_concat_vertically, axis=0)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1331, in _concat_blocks
return pa.concat_tables(pa_tables, promote_options="default")
File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Must pass at least one table
Traceback (most recent call last):
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/xxx/lib/python3.10/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1318, in setstate
table = self._concat_blocks_horizontally_and_vertically(blocks)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1351, in _concat_blocks_horizontally_and_vertically
return cls._concat_blocks(pa_tables_to_concat_vertically, axis=0)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1331, in _concat_blocks
return pa.concat_tables(pa_tables, promote_options="default")
File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Must pass at least one table
Saving the dataset (0/2 shards): 0 examples [00:00, ? examples/s]
2024-07-08 00:02:33 | ERROR | main:19 - An error has been caught in function '', process 'MainProcess' (1219391), thread 'MainThread' (139747673184064):
Traceback (most recent call last):
File "/data/data_juicer_main/tools/process_data.py", line 19, in
main()
└ <function main at 0x7f1845cd9240>
File "/data/data_juicer_main/tools/process_data.py", line 15, in main
executor.run()
│ └ <function Executor.run at 0x7f1845cd9120>
└ <data_juicer.core.executor.Executor object at 0x7f18452126b0>
File "/data/data_juicer_main/data_juicer/core/executor.py", line 196, in run
dataset = dataset.process(
│ └ <function NestedDataset.process at 0x7f1845cbb490>
└ Dataset({
features: ['text', 'meta'],
num_rows: 5
})
File "/data/data_juicer_main/data_juicer/core/data.py", line 244, in process
checkpointer.save_ckpt(dataset)
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats'],
│ │ num_rows: 0
│ │ })
│ └ <function CheckpointManager.save_ckpt at 0x7f1845cd8940>
└ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f18452d15d0>
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1547, in save_to_disk
for job_id, done, content in iflatmap_unordered(
└ <function iflatmap_unordered at 0x7f188df8ae60>
File "/data/xxx/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 711, in iflatmap_unordered
raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
Screenshots 截图
No response
Additional 额外信息
No response
The text was updated successfully, but these errors were encountered:
HunterLG
changed the title
[Bug]: 开启checkpoint,当配置的一个pipline跑到最后一个算子时,剩余的样本为0或1,在开启checkpoint会报错。
[Bug]: 开启checkpoint,当配置的一个pipline执行到最后一个算子时,np>=left samples,在开启checkpoint会报错。
Jan 8, 2025
HunterLG
changed the title
[Bug]: 开启checkpoint,当配置的一个pipline执行到最后一个算子时,np>=left samples,在开启checkpoint会报错。
[Bug]: 开启checkpoint,当配置的一个pipline执行到最后一个算子时,np>left samples,在开启checkpoint会报错。
Jan 8, 2025
Before Reporting 报告之前
I have pulled the latest code of main branch to run again and the bug still existed. 我已经拉取了主分支上最新的代码,重新运行之后,问题仍不能解决。
I have read the README carefully and no error occurred during the installation process. (Otherwise, we recommend that you can ask a question using the Question template) 我已经仔细阅读了 README 上的操作指引,并且在安装过程中没有错误发生。(否则,我们建议您使用Question模板向我们进行提问)
Search before reporting 先搜索,再报告
OS 系统
Ubuntu
Installation Method 安装方式
从源码安装
Data-Juicer Version Data-Juicer版本
v1.0.2
Python Version Python版本
3.10
Describe the bug 描述这个bug
开启checkpoint,当配置的一个pipline跑到最后一个算子时,剩余的样本为0或1,在开启checkpoint会报错。
To Reproduce 如何复现
运行python tools/process_data.py --config configs/demo/process.yaml命令
Configs 配置信息
Process config example for dataset
global parameters
project_name: 'demo-process'
dataset_path: "/xxx/data_juicer_main/ours/our_test_dataset/original_dataset.jsonl" # path to your dataset directory or file
np: 2 # number of subprocess to process your dataset
ds_cache_dir: '/xxx/data_juicer_main/huggingface'
export_path: './outputs/demo-process/demo-processed.jsonl'
op_fusion: false
use_checkpoint: true
#fusion_strategy: 'probe'
process schedule
a list of several process operators with their arguments
process:
- clean_email_mapper:
- clean_links_mapper:
- clean_ip_mapper:
- remove_repeat_sentences_mapper:
Logs 报错日志
剩余样本为1时报错:
2024-07-08 05:07:09 | INFO | data_juicer.core.data:225 - [11/11] OP [random_selector] Done in 0.709s. Left 1 samples.
2024-07-08 05:07:09 | INFO | data_juicer.core.data:241 - Writing checkpoint of dataset processed by last op...
Saving the dataset (0/2 shards): 0%| | 0/1 [00:00<?, ? examples/s]
2024-07-08 05:07:09 | ERROR | main:19 - An error has been caught in function '', process 'MainProcess' (1378079), thread 'MainThread' (140001744877376):
Traceback (most recent call last):
File "/data/data_juicer_main/tools/process_data.py", line 15, in main
executor.run()
│ └ <function Executor.run at 0x7f53718f9120>
└ <data_juicer.core.executor.Executor object at 0x7f5370f02680>
File "/data/data_juicer_main/data_juicer/core/executor.py", line 196, in run
dataset = dataset.process(
│ └ <function NestedDataset.process at 0x7f53718db490>
└ Dataset({
features: ['text', 'meta', 'dj__stats'],
num_rows: 11
})
File "/data/data_juicer_main/data_juicer/core/data.py", line 244, in process
checkpointer.save_ckpt(dataset)
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ │ num_rows: 1
│ │ })
│ └ <function CheckpointManager.save_ckpt at 0x7f53718f8940>
└ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f5370edd510>
File "/data/data_juicer_main/data_juicer/utils/ckpt_utils.py", line 124, in save_ckpt
ds.save_to_disk(self.ckpt_ds_dir, num_proc=self.num_proc)
│ │ │ │ │ └ 2
│ │ │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f5370edd510>
│ │ │ └ '/data/NewDataPilotEngine/data_juicer_main/outputs/demo-process/ckpt/latest'
│ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f5370edd510>
│ └ <function Dataset.save_to_disk at 0x7f53b5c013f0>
└ Dataset({
features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
num_rows: 1
})
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1547, in save_to_disk
for job_id, done, content in iflatmap_unordered(
└ <function iflatmap_unordered at 0x7f53b5ceee60>
File "/data/xxx/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 698, in iflatmap_unordered
async_results = [
File "/data/xxx/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 698, in
async_results = [
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1536, in
"shard": self.shard(num_shards=num_shards, index=shard_idx, contiguous=True),
│ │ │ └ 1
│ │ └ 2
│ └ <function Dataset.shard at 0x7f53b5c14040>
└ Dataset({
features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
num_rows: 1
})
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 4716, in shard
return self.select(
│ └ <function NestedDataset.select at 0x7f53718db6d0>
└ Dataset({
features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
num_rows: 1
})
File "/data/data_juicer_main/data_juicer/core/data.py", line 367, in select
return nested_obj_factory(super().select(*args, **kargs))
│ │ └ {'indices': range(1, 1), 'keep_in_memory': False, 'indices_cache_file_name': None, 'writer_batch_size': 1000}
│ └ ()
└ <function nested_obj_factory at 0x7f53718daf80>
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'indices': range(1, 1), 'keep_in_memory': False, 'indices_cache_file_name': None, 'writer_batch_size': 1000}
│ │ │ └ ()
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ │ num_rows: 1
│ │ })
│ └ <function Dataset.select at 0x7f53b5c035b0>
└ typing.Union
File "/data/xxx/lib/python3.10/site-packages/datasets/fingerprint.py", line 442, in wrapper
out = func(dataset, *args, **kwargs)
│ │ │ └ {'indices': range(1, 1), 'keep_in_memory': False, 'indices_cache_file_name': None, 'writer_batch_size': 1000, 'new_fingerprin...
│ │ └ ()
│ └ Dataset({
│ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ num_rows: 1
│ })
└ <function Dataset.select at 0x7f53b5c03520>
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3866, in select
return self._select_contiguous(start, length, new_fingerprint=new_fingerprint)
│ │ │ │ └ '988d5b9afaef6a64'
│ │ │ └ 0
│ │ └ 1
│ └ <function Dataset._select_contiguous at 0x7f53b5c03640>
└ Dataset({
features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
num_rows: 1
})
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 560, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
│ │ │ │ └ {'new_fingerprint': '988d5b9afaef6a64'}
│ │ │ └ (1, 0)
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ │ num_rows: 1
│ │ })
│ └ <function Dataset._select_contiguous at 0x7f53b5c03760>
└ typing.Union
File "/data/xxx/lib/python3.10/site-packages/datasets/fingerprint.py", line 442, in wrapper
out = func(dataset, *args, **kwargs)
│ │ │ └ {'new_fingerprint': '988d5b9afaef6a64'}
│ │ └ (1, 0)
│ └ Dataset({
│ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ num_rows: 1
│ })
└ <function Dataset._select_contiguous at 0x7f53b5c036d0>
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3926, in _select_contiguous
_check_valid_indices_value(start, len(self))
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats', '__dj__simhash', '__dj__hash'],
│ │ num_rows: 1
│ │ })
│ └ 1
└ <function _check_valid_indices_value at 0x7f53b5c00820>
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 622, in _check_valid_indices_value
raise IndexError(f"Index {index} out of range for dataset of size {size}.")
IndexError: Index 1 out of range for dataset of size 1.
剩余样本为0时报错:
2024-07-08 00:02:32 | INFO | data_juicer.core.data:225 - [10/11] OP [document_deduplicator] Done in 0.714s. Left 0 samples.
2024-07-08 00:02:33 | INFO | data_juicer.core.data:225 - [11/11] OP [random_selector] Done in 0.799s. Left 0 samples.
2024-07-08 00:02:33 | INFO | data_juicer.core.data:241 - Writing checkpoint of dataset processed by last op...
Saving the dataset (0/2 shards): 0 examples [00:00, ? examples/s]Process ForkPoolWorker-25:
Traceback (most recent call last):
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/datapilotengine/lib/python3.10/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1318, in setstate
table = self._concat_blocks_horizontally_and_vertically(blocks)
Process ForkPoolWorker-26:
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1351, in _concat_blocks_horizontally_and_vertically
return cls._concat_blocks(pa_tables_to_concat_vertically, axis=0)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1331, in _concat_blocks
return pa.concat_tables(pa_tables, promote_options="default")
File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Must pass at least one table
Traceback (most recent call last):
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 315, in _bootstrap
self.run()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/data/xxx/lib/python3.10/site-packages/multiprocess/pool.py", line 114, in worker
task = get()
File "/data/xxx/lib/python3.10/site-packages/multiprocess/queues.py", line 371, in get
return _ForkingPickler.loads(res)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 327, in loads
return load(file, ignore, **kwds)
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 313, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/data/xxx/lib/python3.10/site-packages/dill/_dill.py", line 525, in load
obj = StockUnpickler.load(self)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1318, in setstate
table = self._concat_blocks_horizontally_and_vertically(blocks)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1351, in _concat_blocks_horizontally_and_vertically
return cls._concat_blocks(pa_tables_to_concat_vertically, axis=0)
File "/data/xxx/lib/python3.10/site-packages/datasets/table.py", line 1331, in _concat_blocks
return pa.concat_tables(pa_tables, promote_options="default")
File "pyarrow/table.pxi", line 6256, in pyarrow.lib.concat_tables
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Must pass at least one table
Saving the dataset (0/2 shards): 0 examples [00:00, ? examples/s]
2024-07-08 00:02:33 | ERROR | main:19 - An error has been caught in function '', process 'MainProcess' (1219391), thread 'MainThread' (139747673184064):
Traceback (most recent call last):
File "/data/data_juicer_main/tools/process_data.py", line 15, in main
executor.run()
│ └ <function Executor.run at 0x7f1845cd9120>
└ <data_juicer.core.executor.Executor object at 0x7f18452126b0>
File "/data/data_juicer_main/data_juicer/core/executor.py", line 196, in run
dataset = dataset.process(
│ └ <function NestedDataset.process at 0x7f1845cbb490>
└ Dataset({
features: ['text', 'meta'],
num_rows: 5
})
File "/data/data_juicer_main/data_juicer/core/data.py", line 244, in process
checkpointer.save_ckpt(dataset)
│ │ └ Dataset({
│ │ features: ['text', 'meta', 'dj__stats'],
│ │ num_rows: 0
│ │ })
│ └ <function CheckpointManager.save_ckpt at 0x7f1845cd8940>
└ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f18452d15d0>
File "/data/data_juicer_main/data_juicer/utils/ckpt_utils.py", line 124, in save_ckpt
ds.save_to_disk(self.ckpt_ds_dir, num_proc=self.num_proc)
│ │ │ │ │ └ 2
│ │ │ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f18452d15d0>
│ │ │ └ '/data/NewDataPilotEngine/data_juicer_main/outputs/demo-process/ckpt/latest'
│ │ └ <data_juicer.utils.ckpt_utils.CheckpointManager object at 0x7f18452d15d0>
│ └ <function Dataset.save_to_disk at 0x7f188de9d3f0>
└ Dataset({
features: ['text', 'meta', 'dj__stats'],
num_rows: 0
})
File "/data/xxx/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1547, in save_to_disk
for job_id, done, content in iflatmap_unordered(
└ <function iflatmap_unordered at 0x7f188df8ae60>
File "/data/xxx/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 711, in iflatmap_unordered
raise RuntimeError(
RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.
Screenshots 截图
No response
Additional 额外信息
No response
The text was updated successfully, but these errors were encountered: