new/TRAINING_PARAMETERS_2xE5-2678v3_1050Ti.md

# Training parameter recommendations

Hardware: Dual Intel Xeon E5-2678 v3 (24 physical cores / 48 threads) + NVIDIA GTX 1050 Ti (4 GB VRAM)

Purpose: recommended, ready-to-apply parameter sets for the repository's two training flows:
- Card Model (`card_model/train_card_model.py`) — see `card_model/config.py`
- MCCFR Trainer (`mccfr_trainer.py`)

Do not modify code automatically; this document lists the variables and suggested values (two profiles: Quick/Dev and Balanced/Production). Edit the constants in the referenced files when you are ready.

Files to adjust (examples):
- Card Model config: [card_model/config.py](card_model/config.py#L65-L72)
- MCCFR trainer: [mccfr_trainer.py](mccfr_trainer.py#L76-L97)

---

## Summary recommendation for your machine (short)
- If you want fast iterations: use the `Quick / Dev` profile below.
- If you want longer runs for better final performance and have time: use the `Balanced / Production` profile.

---

## Card Model (histogram + equity) — variables in `card_model/config.py`
Two profiles: Quick / Dev (iterate fast) and Balanced / Production.

### Quick / Dev (recommended to iterate)
- `NUM_TRAIN_SAMPLES` = 200_000
- `NUM_VAL_SAMPLES`   = 10_000
- `NUM_ROLLOUTS`      = 200
- `BATCH_SIZE`        = 1024
- `NUM_EPOCHS`        = 32
- `LEARNING_RATE`     = 1e-3
- `WEIGHT_DECAY`      = 1e-4
- `LAMBDA_MSE`        = 0.1
- `NUM_WORKERS`       = 20  # used for dataset generation and DataLoader in this codebase; 20 is a good balance on 24 cores

Notes:
- `NUM_ROLLOUTS=200` reduces data-generation cost (fewer MC rollouts) so samples are cheaper to produce. Increase to 1000 for higher-quality labels if you have time.
- `BATCH_SIZE=1024` is safe for GTX 1050 Ti (4 GB VRAM). If you see OOM during CardModel training, reduce to 512.

### Balanced / Production (longer training, better final quality)
- `NUM_TRAIN_SAMPLES` = 2_000_000
- `NUM_VAL_SAMPLES`   = 100_000
- `NUM_ROLLOUTS`      = 1000
- `BATCH_SIZE`        = 4096
- `NUM_EPOCHS`        = 64
- `LEARNING_RATE`     = 5e-4
- `WEIGHT_DECAY`      = 1e-4
- `LAMBDA_MSE`        = 0.1
- `NUM_WORKERS`       = 22

Notes:
- Production profile expects long wall-clock time and sustained CPU usage. With `NUM_WORKERS=22` you still leave 2 physical cores for OS/driver tasks.
- If training CardModel on GPU causes OOM, fallback to CPU (`device=torch.device('cpu')`) or reduce `BATCH_SIZE`.

---

## MCCFR Trainer (`mccfr_trainer.py`) — main self-play + network training
Two profiles: Quick / Dev and Balanced / Production.

### Quick / Dev (safe to test)
- `NUM_ITERATIONS` = 1_000
- `GAMES_PER_ITER`  = 200
- `NUM_WORKERS`     = 20    # worker processes for self-play traversals (use physical cores minus a few)
- `BUFFER_MAX_SIZE` = 500_000
- `MIN_BUFFER_SIZE_FOR_TRAIN` = 10_000
- `TRAIN_BATCH_SIZE` = 4_096
- `TRAIN_STEPS_PER_ITER` = 20
- `LEARNING_RATE` = 1e-3
- `WEIGHT_DECAY` = 1e-4
- `CLIP_GRAD_NORM` = 1.0
- `CARD_MODEL_CHECKPOINT` = `card_model/data/best_card_model.pt` (use existing checkpoint if available)

Why these values?
- `NUM_WORKERS=20` uses most physical cores while leaving a few cores for the main process and OS.
- `TRAIN_BATCH_SIZE=4096` is a conservative batch that should fit in 4 GB VRAM for the small CFR network and allow efficient training.
- Reduce `MIN_BUFFER_SIZE_FOR_TRAIN` for faster first training iterations during experiments.

### Balanced / Production (long-run)
- `NUM_ITERATIONS` = 50_000
- `GAMES_PER_ITER`  = 500
- `NUM_WORKERS`     = 20
- `BUFFER_MAX_SIZE` = 2_000_000
- `MIN_BUFFER_SIZE_FOR_TRAIN` = 100_000
- `TRAIN_BATCH_SIZE` = 8_192
- `TRAIN_STEPS_PER_ITER` = 50
- `LEARNING_RATE` = 5e-4
- `WEIGHT_DECAY` = 1e-4
- `CLIP_GRAD_NORM` = 1.0

Notes:
- The CFR network is compact; even on 4GB VRAM you can try `TRAIN_BATCH_SIZE` up to 8k-16k depending on other GPU activity. Start with 8k and monitor GPU memory with `nvidia-smi`.
- `NUM_WORKERS=20` still recommended; avoid setting `NUM_WORKERS` >= number of physical cores to reduce scheduling/oversubscription overhead.

---

## Suggested practical workflow (apply these before long runs)
1. For a first end-to-end test, use the **Quick / Dev** profile for both Card Model and MCCFR Trainer.
2. Generate CardModel training data once:
   - Run `python train_card_model.py` (it will generate or load `card_model/data/train_data.npz`).
   - If generation is too slow, reduce `NUM_TRAIN_SAMPLES` or `NUM_ROLLOUTS` in the Quick profile.
3. Train CardModel to obtain `card_model/data/best_card_model.pt`.
4. Use that checkpoint with `mccfr_trainer.py` (set `CARD_MODEL_CHECKPOINT` if you want to load it) and start MCCFR with Quick/Dev profile.
5. If both steps succeed and you want to scale up, switch to the Balanced/Production profile.

Commands examples:
- Generate & train CardModel (from repo root):

```
python train_card_model.py
```

- Start MCCFR trainer (from repo root):

```
python mccfr_trainer.py
```

Monitor GPU memory while training with `nvidia-smi -l 2` and reduce `BATCH_SIZE` / `TRAIN_BATCH_SIZE` if you see OOM.

---

## Notes & cautions
- The repository hardcodes some constants in `card_model/config.py` and `mccfr_trainer.py`. This document lists the variables and recommended values — you must edit the constants in those files or override them in a wrapper script before running.
- For multi-process data generation and MCCFR traversal, the code uses `spawn` start method to avoid CUDA forking issues. Keep that unchanged.
- If you plan to fully utilize all 24 cores for data generation, avoid launching heavy background tasks. Disk I/O during parallel generation can be significant; make sure you have enough temporary disk space for intermediate `.npz` files.

---

## Quick reference: exact variables to set
- `card_model/config.py`:
  - `NUM_TRAIN_SAMPLES`, `NUM_VAL_SAMPLES`, `NUM_ROLLOUTS`, `BATCH_SIZE`, `NUM_EPOCHS`, `LEARNING_RATE`, `NUM_WORKERS`, `WEIGHT_DECAY`.
- `mccfr_trainer.py`:
  - `NUM_ITERATIONS`, `GAMES_PER_ITER`, `NUM_WORKERS`, `BUFFER_MAX_SIZE`, `MIN_BUFFER_SIZE_FOR_TRAIN`, `TRAIN_BATCH_SIZE`, `TRAIN_STEPS_PER_ITER`, `LEARNING_RATE`, `WEIGHT_DECAY`, `CARD_MODEL_CHECKPOINT`.

---

If you want, I can now write a small wrapper script that launches CardModel data generation and training, then launches MCCFR with the chosen profile (no code changes to core files — the wrapper will set values at runtime). Reply if you want that wrapper created.