Skip to content

Deepspeed zero 3 seems to save empty adapter_model.safetensors #7900

@savitha-suresh

Description

@savitha-suresh

I am using llamafactory with deepspeed stage3 + QLoRA. I trained with SFT and stored checkpoints. The checkpoints had

adapter_config.json        chat_template.jinja  special_tokens_map.json  trainer_state.json     zero_to_fp32.py
adapter_model.safetensors  latest               tokenizer.json           training_args.bin global_step

I assumed that the adapter_model.safetensors were adapter weights but they seem to be empty.
Can someone please clarify what gets stored in the global_step folder and how can i save the adapter_weights only and not save the base model while checkpointing?

I am using 4 A100 and training qwen3-235B0thinking and used cpu offloading

  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "zero_allow_untested_optimizer": true,
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 5e8,
    "stage3_max_reuse_distance": 5e8,
    "stage3_gather_16bit_weights_on_model_save": false
  }
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions