-
Notifications
You must be signed in to change notification settings - Fork 4.8k
Open
Description
I am using llamafactory with deepspeed stage3 + QLoRA. I trained with SFT and stored checkpoints. The checkpoints had
adapter_config.json chat_template.jinja special_tokens_map.json trainer_state.json zero_to_fp32.py
adapter_model.safetensors latest tokenizer.json training_args.bin global_stepI assumed that the adapter_model.safetensors were adapter weights but they seem to be empty.
Can someone please clarify what gets stored in the global_step folder and how can i save the adapter_weights only and not save the base model while checkpointing?
I am using 4 A100 and training qwen3-235B0thinking and used cpu offloading
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 5e8,
"stage3_max_reuse_distance": 5e8,
"stage3_gather_16bit_weights_on_model_save": false
}
}
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels