OpenDCAI · haolpku · May 13, 2026 · May 13, 2026
diff --git a/docs/en/notes/guide/mixer/tutorial.md b/docs/en/notes/guide/mixer/tutorial.md
@@ -36,7 +36,7 @@ update_times: 2
 - `init_mixture_proportions`: Initial sampling proportions, required when `mixture_sample_rule='mixture'`.
 - `warmup_step`: Before the first dynamic proportion update, the model needs to perform `warmup_step` steps of regular training. This helps the model establish initial understanding of data distribution.
 - `update_step`: Frequency of domain proportion updates. After every `update_step` training steps, the Mixer will be triggered to update domain proportions for the next training phase.
-- `update_times`: Total number of dynamic data proportion calculations during the entire training process. Therefore, total training steps = `(update_times * update_step + warmup_step) * global_batch_size`
+- `update_times`: Number of dynamic data proportion updates per Flex epoch. Total steps are derived from `num_train_epochs` unless `train_step > 0`.
 
 ### Static Mixing Configuration
 
@@ -47,7 +47,7 @@ train_type: dynamic_mix
 static_mix: true                      # Whether to fix initial static mixing proportions (only effective in dynamic_mix trainer)
 mixture_sample_rule: mixture          # Initial sampling rule
 init_mixture_proportions: [0.7, 0.3]  # Initial proportions, can be adjusted by additional algorithms
-train_step: 3                         # Total training steps (only effective in dynamic_mix trainer), excluding warmup and update steps
+train_step: 3                         # fixed total steps; set to 0 to use num_train_epochs
 ```
 
 When static mixing is enabled, the training process will use fixed `init_mixture_proportions` without dynamic adjustment.
@@ -129,4 +129,4 @@ mixers:
 
 #### Key Points
 
-- `params`: All parameters defined under this block will be passed as keyword arguments to the `__init__` constructor of the `RandomMixer` class. For example, the `seed` value here will be passed to the `seed` parameter of the `__init__` method.
+- `params`: All parameters defined under this block will be passed as keyword arguments to the `__init__` constructor of the `RandomMixer` class. For example, the `seed` value here will be passed to the `seed` parameter of the `__init__` method.
diff --git a/docs/en/notes/guide/selector/quickstart.md b/docs/en/notes/guide/selector/quickstart.md
@@ -23,6 +23,10 @@ component_name: less
 warmup_step: 4
 update_step: 3
 update_times: 2
+num_train_epochs: 1.0
+train_step: 0
 
 eval_dataset: alpaca_zh_demo
-```
+```
+
+`update_times` is the number of dynamic selections per Flex epoch. Use `num_train_epochs: 1.0` for one Flex epoch; use `num_train_epochs: N` with `train_step: 0` for multi-epoch runs. If `train_step > 0`, it fixes the total number of steps and overrides `num_train_epochs`.
diff --git a/docs/en/notes/guide/selector/selector_delta_loss.md b/docs/en/notes/guide/selector/selector_delta_loss.md
@@ -126,6 +126,8 @@ update_times: 2
 eval_dataset: alpaca_zh_demo
 ```
 
+`update_times` is the number of dynamic selections per Flex epoch. Delta Loss requires `update_times >= 2`: the first selection builds initial losses, and later selections use delta loss. For multi-epoch runs, set `num_train_epochs: N` and keep `train_step: 0`.
+
 ---
 
 ### Step 4: Run Training
@@ -180,4 +182,3 @@ The merged model will be saved in:
 ## 3. Model Evaluation
 
 It is recommended to use the [DataFlow](https://github.com/OpenDCAI/DataFlow) [Model QA Evaluation Pipeline](https://opendcai.github.io/DataFlow-Doc/zh/guide/2k5wjgls/) for systematic evaluation of the generated model.
-
diff --git a/docs/en/notes/guide/selector/selector_less.md b/docs/en/notes/guide/selector/selector_less.md
@@ -151,10 +151,10 @@ eval_dataset: alpaca_zh_demo
 * `output_dir`: Output directory of dynamic fine-tuning (LoRA adapter).
 * `warmup_step`: Number of warmup steps before the first sample selection.
 * `update_step`: Number of steps between each dynamic data selection.
-* `update_times`: Total number of dynamic data selection iterations.
+* `update_times`: Number of dynamic data selection iterations per Flex epoch.
 * `eval_dataset`: Validation dataset.
 
-Both `dataset` and `eval_dataset` can be selected from `DataFlex/data/dataset_info.json` or local JSON files in ShareGPT/Alpaca format. Note: The training set size significantly affects computation cost. Total steps = `warmup_step + update_step × update_times`.
+Both `dataset` and `eval_dataset` can be selected from `DataFlex/data/dataset_info.json` or local JSON files in ShareGPT/Alpaca format. Note: the training set size significantly affects computation cost. Steps per Flex epoch = `warmup_step + update_step × update_times`. Total steps are derived from `num_train_epochs` unless `train_step > 0`.
 
 ---
 
@@ -212,4 +212,3 @@ The merged model will be saved in:
 ## 3. Model Evaluation
 
 It is recommended to use the [DataFlow](https://github.com/OpenDCAI/DataFlow) [Model QA Evaluation Pipeline](https://opendcai.github.io/DataFlow-Doc/zh/guide/2k5wjgls/) for systematic evaluation of the generated model.
-
diff --git a/docs/en/notes/guide/selector/selector_loss.md b/docs/en/notes/guide/selector/selector_loss.md
@@ -132,6 +132,8 @@ update_times: 2
 eval_dataset: alpaca_zh_demo
 ```
 
+`update_times` is the number of dynamic selections per Flex epoch. For multi-epoch runs, set `num_train_epochs: N` and keep `train_step: 0`; if `train_step > 0`, it fixes total steps and overrides `num_train_epochs`.
+
 ---
 
 ### Step 4: Run Training
@@ -185,4 +187,3 @@ The merged model will be saved in:
 ## 3. Model Evaluation
 
 It is recommended to use the [DataFlow](https://github.com/OpenDCAI/DataFlow) [Model QA Evaluation Pipeline](https://opendcai.github.io/DataFlow-Doc/zh/guide/2k5wjgls/) for systematic evaluation of the generated model.
-
diff --git a/docs/en/notes/guide/selector/selector_nice.md b/docs/en/notes/guide/selector/selector_nice.md
@@ -168,7 +168,7 @@ early_stopping_min_delta: 0.01
 **Parameter description:**
 
 * `component_name`: Must match the `nice` component in `components.yaml`, determining reward backend and projection dimensions.
-* `warmup_step` / `update_step` / `update_times`: Control the dynamic selection schedule; total steps = `warmup_step + update_step × update_times`.
+* `warmup_step` / `update_step` / `update_times`: Control the dynamic selection schedule per Flex epoch. Total steps are derived from `num_train_epochs` unless `train_step > 0`.
 * `eval_dataset`: Validation set (Alpaca/ShareGPT style); reward model is used for scoring during generation.
 * `output_dir`: Path to save LoRA adapters and caches.
 
@@ -226,4 +226,3 @@ The merged model will be saved to:
 
 It is recommended to use the [DataFlow](https://github.com/OpenDCAI/DataFlow)
 [Model QA Evaluation Pipeline](https://opendcai.github.io/DataFlow-Doc/en/guide/2k5wjgls/) to systematically evaluate the generated model, and to inspect the scoring logs in `cache_dir` to analyze the reward model’s sensitivity to different samples.
-
diff --git a/docs/en/notes/guide/selector/selector_offline_near.md b/docs/en/notes/guide/selector/selector_offline_near.md
@@ -153,7 +153,7 @@ update_times: 2
 **Notes:**
 
 * `component_name: near` enables the NEAR component.
-* `warmup_step / update_step / update_times` decide **when** and **how often** to re‑select the training subset; total steps ≈ `warmup_step + update_step × update_times`.
+* `warmup_step / update_step / update_times` decide **when** and **how often** to re-select the training subset in each Flex epoch. Total steps are derived from `num_train_epochs` unless `train_step > 0`.
 * total batch_size=device_number x per_device_train_batch_size x gradient_accumulation_steps
 
 ---
@@ -201,4 +201,3 @@ llamafactory-cli export llama3_lora_sft.yaml
 
 We recommend using the [DataFlow](https://github.com/OpenDCAI/DataFlow) QA evaluation pipeline to compare **NEAR** against **Less** and **random sampling**. 
 
-
diff --git a/docs/en/notes/guide/selector/selector_offline_tsds.md b/docs/en/notes/guide/selector/selector_offline_tsds.md
@@ -187,7 +187,7 @@ update_times: 2
 **Notes:**
 
 * `component_name: tsds` enables the TSDS component.
-* `warmup_step / update_step / update_times` decide **when** and **how often** to re‑select the training subset; total steps ≈ `warmup_step + update_step × update_times`.
+* `warmup_step / update_step / update_times` decide **when** and **how often** to re-select the training subset in each Flex epoch. Total steps are derived from `num_train_epochs` unless `train_step > 0`.
 * total batch_size=device_number x per_device_train_batch_size x gradient_accumulation_steps
 
 ---
@@ -235,4 +235,3 @@ llamafactory-cli export llama3_lora_sft.yaml
 
 We recommend using the [DataFlow](https://github.com/OpenDCAI/DataFlow) QA evaluation pipeline to compare **TSDS** against **Less** and **random sampling**. 
 
-
diff --git a/docs/en/notes/guide/selector/selector_zeroth.md b/docs/en/notes/guide/selector/selector_zeroth.md
@@ -123,10 +123,10 @@ eval_dataset: alpaca_zh_demo
 * `output_dir`: Output directory of dynamic fine-tuning (LoRA adapter).
 * `warmup_step`: Number of warmup steps before the first sample selection.
 * `update_step`: Number of steps between each dynamic data selection.
-* `update_times`: Total number of dynamic data selection iterations.
+* `update_times`: Number of dynamic data selection iterations per Flex epoch.
 * `eval_dataset`: Validation dataset.
 
-Both `dataset` and `eval_dataset` can be selected from `DataFlex/data/dataset_info.json` or local JSON files in ShareGPT/Alpaca format. Note: The training set size significantly affects computation cost. Total steps = `warmup_step + update_step × update_times`.
+Both `dataset` and `eval_dataset` can be selected from `DataFlex/data/dataset_info.json` or local JSON files in ShareGPT/Alpaca format. Note: the training set size significantly affects computation cost. Steps per Flex epoch = `warmup_step + update_step × update_times`. Total steps are derived from `num_train_epochs` unless `train_step > 0`.
 
 ---
 

diff --git a/docs/en/notes/guide/weighter/quickstart.md b/docs/en/notes/guide/weighter/quickstart.md
@@ -21,5 +21,8 @@ train_type: dynamic_weight
 components_cfg_file: src/dataflex/configs/components.yaml
 component_name: loss
 warmup_step: 1
-train_step: 3 # total train steps (including warmup)
-```
+num_train_epochs: 1.0
+train_step: 0 # set positive to fix total steps and override num_train_epochs
+```
+
+For multi-epoch runs, set `num_train_epochs: N` and keep `train_step: 0`. `warmup_step` is a global step threshold and does not reset each epoch.
diff --git a/docs/en/notes/guide/weighter/tutorial.md b/docs/en/notes/guide/weighter/tutorial.md
@@ -26,7 +26,8 @@ train_type: dynamic_weight   # Select trainer type. Available options:
 components_cfg_file: src/dataflex/configs/components.yaml
 component_name: loss  # Select component name, corresponding to components defined in components_cfg_file
 warmup_step: 1
-train_step: 3 # Total training steps (including warm_up)
+num_train_epochs: 1.0
+train_step: 0 # set positive to fix total steps and override num_train_epochs
 ```
 
 ### Parameter Details
@@ -35,7 +36,8 @@ train_step: 3 # Total training steps (including warm_up)
 - `component_name`: Defines the specific strategy for data weighting. For example, `loss` uses a loss-based weighter.
 - `components_cfg_file`: Defines the parameter file containing specific parameters for the corresponding strategy.
 - `warmup_step`: Before the first dynamic weighting, the model needs to perform `warmup_step` steps of regular training. This helps the model establish initial understanding of data distribution.
-- `train_step`: Total training steps (including warmup). Weight Trainer will dynamically weight samples at each training step after warmup completion.
+- `train_step`: Optional fixed total steps. If `train_step > 0`, it overrides `num_train_epochs`; for multi-epoch runs, keep `train_step: 0`.
+- `num_train_epochs`: Controls the number of epochs when `train_step: 0`. `warmup_step` is a global step threshold and does not reset each epoch.
 
 ## How to Add Custom Weighter in DataFlex
 
@@ -140,4 +142,4 @@ weighters:
 
 #### Key Points
 
-- `params`: All parameters defined under this block will be passed as keyword arguments to the `__init__` constructor of the `CustomWeighter` class. For example, the `strategy` value here will be passed to the `strategy` parameter of the `__init__` method.
+- `params`: All parameters defined under this block will be passed as keyword arguments to the `__init__` constructor of the `CustomWeighter` class. For example, the `strategy` value here will be passed to the `strategy` parameter of the `__init__` method.
diff --git a/docs/zh/notes/guide/mixer/tutorial.md b/docs/zh/notes/guide/mixer/tutorial.md
@@ -36,7 +36,7 @@ update_times: 2
 - `init_mixture_proportions`: 初始采样对应的比例，`mixture_sample_rule='mixture'` 时需要指定。
 - `warmup_step`: 在执行第一次动态配比更新前，模型需要先进行 `warmup_step` 步的常规训练。这有助于模型建立对数据分布的初步认知。
 - `update_step`: 领域配比更新的频率。每当训练进行 `update_step` 步后，Mixer 将被触发，更新领域配比用于下一阶段的训练。
-- `update_times`: 整个训练过程中，动态数据配比计算的总次数。因此总的训练步数为 `(update_times * update_step + warmup_step) * global_batch_size`
+- `update_times`: 每个 Flex epoch 内动态数据配比计算的次数。总步数由 `num_train_epochs` 推导；若 `train_step > 0`，则以 `train_step` 为准。
 
 ### 静态混合配置
 
@@ -47,7 +47,7 @@ train_type: dynamic_mix
 static_mix: true                      # 是否固定初始静态混合比例（仅在dynamic_mix训练器中生效）
 mixture_sample_rule: mixture          # 初始采样规则
 init_mixture_proportions: [0.7, 0.3]  # 对应初始的比例，可通过额外算法自行调整
-train_step: 3                         # 总训练步数（仅在dynamic_mix训练器中生效），不考虑warmup和update steps
+train_step: 3                         # 固定总步数；设为 0 时由 num_train_epochs 控制
 ```
 
 启用静态混合后，训练过程中将使用固定的 `init_mixture_proportions` 比例，不再动态调整。
@@ -129,4 +129,4 @@ mixers:
 
 #### 关键点说明
 
-- `params`: 该块下定义的所有参数都将作为关键字参数传递给 `RandomMixer` 类的 `__init__` 构造函数。例如，这里的 `seed` 值会传递给 `__init__` 方法的 `seed` 参数。
+- `params`: 该块下定义的所有参数都将作为关键字参数传递给 `RandomMixer` 类的 `__init__` 构造函数。例如，这里的 `seed` 值会传递给 `__init__` 方法的 `seed` 参数。
diff --git a/docs/zh/notes/guide/selector/quickstart.md b/docs/zh/notes/guide/selector/quickstart.md
@@ -23,6 +23,10 @@ component_name: less
 warmup_step: 4
 update_step: 3
 update_times: 2
+num_train_epochs: 1.0
+train_step: 0
 
 eval_dataset: alpaca_zh_demo
 ```
+
+`update_times` 表示每个 Flex epoch 内的动态选择次数。单个 Flex epoch 使用 `num_train_epochs: 1.0`；多 epoch 使用 `num_train_epochs: N` 且保持 `train_step: 0`。若 `train_step > 0`，则固定总步数并覆盖 `num_train_epochs`。
diff --git a/docs/zh/notes/guide/selector/selector_delta_loss.md b/docs/zh/notes/guide/selector/selector_delta_loss.md
@@ -127,6 +127,8 @@ update_times: 2
 eval_dataset: alpaca_zh_demo
 ```
 
+`update_times` 表示每个 Flex epoch 内的动态选择次数。Delta Loss 需要 `update_times >= 2`：第一次选择建立 initial loss，后续选择根据 delta loss 更新样本。多 epoch 训练请设置 `num_train_epochs: N` 且保持 `train_step: 0`。
+
 ---
 
 ### 步骤四：运行训练
@@ -178,4 +180,3 @@ llamafactory-cli export llama3_lora_sft.yaml
 ## 3. 模型评估
 
 推荐使用[DataFlow](https://github.com/OpenDCAI/DataFlow)的[模型QA能力评估流水线](https://opendcai.github.io/DataFlow-Doc/zh/guide/2k5wjgls/)对生成后的模型进行系统性评估。
-
diff --git a/docs/zh/notes/guide/selector/selector_less.md b/docs/zh/notes/guide/selector/selector_less.md
@@ -146,12 +146,12 @@ eval_dataset: alpaca_zh_demo
 * `output_dir`: 动态微调结果（LoRA 适配器）的输出路径。
 * `warmup_step`: 训练初期第一次训练数据选择前，进行warmup的步数。
 * `update_step`: 每次训练数据动态选择的步数。
-* `update_times`: 数据动态选择的总次数。
+* `update_times`: 每个 Flex epoch 内的数据动态选择次数。
 * `eval_dataset`: 验证数据集。
 
 dataset和eval_dataset可选`DataFlex/data/dataset_info.json`中数据，或本地路径下sharegpt或alpaca格式的json数据。注意该方法的情形下，训练集规模会较大影响计算成本。
 
-总步数 = `warmup_step + update_step × update_times`。
+每个 Flex epoch 的步数 = `warmup_step + update_step × update_times`。总步数由 `num_train_epochs` 推导；若 `train_step > 0`，则以 `train_step` 为准。
 
 ---
 
@@ -207,4 +207,3 @@ llamafactory-cli export llama3_lora_sft.yaml
 ## 3. 模型评估
 
 推荐使用[DataFlow](https://github.com/OpenDCAI/DataFlow)的[模型QA能力评估流水线](https://opendcai.github.io/DataFlow-Doc/zh/guide/2k5wjgls/)对生成后的模型进行系统性评估。
-
diff --git a/docs/zh/notes/guide/selector/selector_loss.md b/docs/zh/notes/guide/selector/selector_loss.md
@@ -132,6 +132,8 @@ update_times: 2
 eval_dataset: alpaca_zh_demo
 ```
 
+`update_times` 表示每个 Flex epoch 内的动态选择次数。多 epoch 训练请设置 `num_train_epochs: N` 且保持 `train_step: 0`；若 `train_step > 0`，则固定总步数并覆盖 `num_train_epochs`。
+
 ---
 
 ### 步骤四：运行训练
@@ -184,4 +186,3 @@ llamafactory-cli export llama3_lora_sft.yaml
 ## 3. 模型评估
 
 推荐使用[DataFlow](https://github.com/OpenDCAI/DataFlow)的[模型QA能力评估流水线](https://opendcai.github.io/DataFlow-Doc/zh/guide/2k5wjgls/)对生成后的模型进行系统性评估。
-
diff --git a/docs/zh/notes/guide/selector/selector_nice.md b/docs/zh/notes/guide/selector/selector_nice.md
@@ -162,7 +162,7 @@ early_stopping_min_delta: 0.01
 
 **参数说明：**
 * `component_name`: 与 `components.yaml` 中的 `nice` 组件保持一致，决定奖励后端与投影维度等设置。
-* `warmup_step` / `update_step` / `update_times`: 决定动态选择的触发节奏；总步数 = `warmup_step + update_step × update_times`。
+* `warmup_step` / `update_step` / `update_times`: 决定每个 Flex epoch 内的动态选择节奏；总步数由 `num_train_epochs` 推导，若 `train_step > 0` 则以 `train_step` 为准。
 * `eval_dataset`: 验证集，可以是 Alpaca/ShareGPT 样式，生成时会调用奖励模型打分。
 * `output_dir`: LoRA 适配器与缓存保存路径。
 

diff --git a/docs/zh/notes/guide/selector/selector_offline_near.md b/docs/zh/notes/guide/selector/selector_offline_near.md
@@ -173,7 +173,7 @@ update_times: 2
 **参数说明：**
 
 * `component_name: near`：启用 NEAR 组件。
-* `warmup_step / update_step / update_times`：决定**何时**与**多久**进行一次动态选择；总步数 ≈ `warmup_step + update_step × update_times`。
+* `warmup_step / update_step / update_times`：决定每个 Flex epoch 内**何时**与**多久**进行一次动态选择；总步数由 `num_train_epochs` 推导，若 `train_step > 0` 则以 `train_step` 为准。
 *  总batch_size=device_number x per_device_train_batch_size x gradient_accumulation_steps
 
 
@@ -218,4 +218,4 @@ llamafactory-cli export llama3_lora_sft.yaml
 
 ## 9. 评估与对比
 
-建议使用 [DataFlow](https://github.com/OpenDCAI/DataFlow) 的模型 QA 评估流水线，对 **NEAR** 与 **Less**、**随机采样** 等策略进行并列评测
+建议使用 [DataFlow](https://github.com/OpenDCAI/DataFlow) 的模型 QA 评估流水线，对 **NEAR** 与 **Less**、**随机采样** 等策略进行并列评测
diff --git a/docs/zh/notes/guide/selector/selector_offline_tsds.md b/docs/zh/notes/guide/selector/selector_offline_tsds.md
@@ -209,7 +209,7 @@ update_times: 2
 **参数说明：**
 
 * `component_name: tsds`：启用 TSDS 组件。
-* `warmup_step / update_step / update_times`：决定**何时**与**多久**进行一次动态选择；总步数 ≈ `warmup_step + update_step × update_times`。
+* `warmup_step / update_step / update_times`：决定每个 Flex epoch 内**何时**与**多久**进行一次动态选择；总步数由 `num_train_epochs` 推导，若 `train_step > 0` 则以 `train_step` 为准。
 *  总batch_size=device_number x per_device_train_batch_size x gradient_accumulation_steps
 
 ---
@@ -253,4 +253,4 @@ llamafactory-cli export llama3_lora_sft.yaml
 
 ## 9. 评估与对比
 
-建议使用 [DataFlow](https://github.com/OpenDCAI/DataFlow) 的模型 QA 评估流水线，对 **TSDS** 与 **Less**、**随机采样** 等策略进行并列评测
+建议使用 [DataFlow](https://github.com/OpenDCAI/DataFlow) 的模型 QA 评估流水线，对 **TSDS** 与 **Less**、**随机采样** 等策略进行并列评测