Files
new/TRAINING_PARAMETERS_2xE5-2678v3_1050Ti.md
2026-04-22 10:11:42 +08:00

6.3 KiB

Training parameter recommendations

Hardware: Dual Intel Xeon E5-2678 v3 (24 physical cores / 48 threads) + NVIDIA GTX 1050 Ti (4 GB VRAM)

Purpose: recommended, ready-to-apply parameter sets for the repository's two training flows:

  • Card Model (card_model/train_card_model.py) — see card_model/config.py
  • MCCFR Trainer (mccfr_trainer.py)

Do not modify code automatically; this document lists the variables and suggested values (two profiles: Quick/Dev and Balanced/Production). Edit the constants in the referenced files when you are ready.

Files to adjust (examples):


Summary recommendation for your machine (short)

  • If you want fast iterations: use the Quick / Dev profile below.
  • If you want longer runs for better final performance and have time: use the Balanced / Production profile.

Card Model (histogram + equity) — variables in card_model/config.py

Two profiles: Quick / Dev (iterate fast) and Balanced / Production.

  • NUM_TRAIN_SAMPLES = 200_000
  • NUM_VAL_SAMPLES = 10_000
  • NUM_ROLLOUTS = 200
  • BATCH_SIZE = 1024
  • NUM_EPOCHS = 32
  • LEARNING_RATE = 1e-3
  • WEIGHT_DECAY = 1e-4
  • LAMBDA_MSE = 0.1
  • NUM_WORKERS = 20 # used for dataset generation and DataLoader in this codebase; 20 is a good balance on 24 cores

Notes:

  • NUM_ROLLOUTS=200 reduces data-generation cost (fewer MC rollouts) so samples are cheaper to produce. Increase to 1000 for higher-quality labels if you have time.
  • BATCH_SIZE=1024 is safe for GTX 1050 Ti (4 GB VRAM). If you see OOM during CardModel training, reduce to 512.

Balanced / Production (longer training, better final quality)

  • NUM_TRAIN_SAMPLES = 2_000_000
  • NUM_VAL_SAMPLES = 100_000
  • NUM_ROLLOUTS = 1000
  • BATCH_SIZE = 4096
  • NUM_EPOCHS = 64
  • LEARNING_RATE = 5e-4
  • WEIGHT_DECAY = 1e-4
  • LAMBDA_MSE = 0.1
  • NUM_WORKERS = 22

Notes:

  • Production profile expects long wall-clock time and sustained CPU usage. With NUM_WORKERS=22 you still leave 2 physical cores for OS/driver tasks.
  • If training CardModel on GPU causes OOM, fallback to CPU (device=torch.device('cpu')) or reduce BATCH_SIZE.

MCCFR Trainer (mccfr_trainer.py) — main self-play + network training

Two profiles: Quick / Dev and Balanced / Production.

Quick / Dev (safe to test)

  • NUM_ITERATIONS = 1_000
  • GAMES_PER_ITER = 200
  • NUM_WORKERS = 20 # worker processes for self-play traversals (use physical cores minus a few)
  • BUFFER_MAX_SIZE = 500_000
  • MIN_BUFFER_SIZE_FOR_TRAIN = 10_000
  • TRAIN_BATCH_SIZE = 4_096
  • TRAIN_STEPS_PER_ITER = 20
  • LEARNING_RATE = 1e-3
  • WEIGHT_DECAY = 1e-4
  • CLIP_GRAD_NORM = 1.0
  • CARD_MODEL_CHECKPOINT = card_model/data/best_card_model.pt (use existing checkpoint if available)

Why these values?

  • NUM_WORKERS=20 uses most physical cores while leaving a few cores for the main process and OS.
  • TRAIN_BATCH_SIZE=4096 is a conservative batch that should fit in 4 GB VRAM for the small CFR network and allow efficient training.
  • Reduce MIN_BUFFER_SIZE_FOR_TRAIN for faster first training iterations during experiments.

Balanced / Production (long-run)

  • NUM_ITERATIONS = 50_000
  • GAMES_PER_ITER = 500
  • NUM_WORKERS = 20
  • BUFFER_MAX_SIZE = 2_000_000
  • MIN_BUFFER_SIZE_FOR_TRAIN = 100_000
  • TRAIN_BATCH_SIZE = 8_192
  • TRAIN_STEPS_PER_ITER = 50
  • LEARNING_RATE = 5e-4
  • WEIGHT_DECAY = 1e-4
  • CLIP_GRAD_NORM = 1.0

Notes:

  • The CFR network is compact; even on 4GB VRAM you can try TRAIN_BATCH_SIZE up to 8k-16k depending on other GPU activity. Start with 8k and monitor GPU memory with nvidia-smi.
  • NUM_WORKERS=20 still recommended; avoid setting NUM_WORKERS >= number of physical cores to reduce scheduling/oversubscription overhead.

Suggested practical workflow (apply these before long runs)

  1. For a first end-to-end test, use the Quick / Dev profile for both Card Model and MCCFR Trainer.
  2. Generate CardModel training data once:
    • Run python train_card_model.py (it will generate or load card_model/data/train_data.npz).
    • If generation is too slow, reduce NUM_TRAIN_SAMPLES or NUM_ROLLOUTS in the Quick profile.
  3. Train CardModel to obtain card_model/data/best_card_model.pt.
  4. Use that checkpoint with mccfr_trainer.py (set CARD_MODEL_CHECKPOINT if you want to load it) and start MCCFR with Quick/Dev profile.
  5. If both steps succeed and you want to scale up, switch to the Balanced/Production profile.

Commands examples:

  • Generate & train CardModel (from repo root):
python train_card_model.py
  • Start MCCFR trainer (from repo root):
python mccfr_trainer.py

Monitor GPU memory while training with nvidia-smi -l 2 and reduce BATCH_SIZE / TRAIN_BATCH_SIZE if you see OOM.


Notes & cautions

  • The repository hardcodes some constants in card_model/config.py and mccfr_trainer.py. This document lists the variables and recommended values — you must edit the constants in those files or override them in a wrapper script before running.
  • For multi-process data generation and MCCFR traversal, the code uses spawn start method to avoid CUDA forking issues. Keep that unchanged.
  • If you plan to fully utilize all 24 cores for data generation, avoid launching heavy background tasks. Disk I/O during parallel generation can be significant; make sure you have enough temporary disk space for intermediate .npz files.

Quick reference: exact variables to set

  • card_model/config.py:
    • NUM_TRAIN_SAMPLES, NUM_VAL_SAMPLES, NUM_ROLLOUTS, BATCH_SIZE, NUM_EPOCHS, LEARNING_RATE, NUM_WORKERS, WEIGHT_DECAY.
  • mccfr_trainer.py:
    • NUM_ITERATIONS, GAMES_PER_ITER, NUM_WORKERS, BUFFER_MAX_SIZE, MIN_BUFFER_SIZE_FOR_TRAIN, TRAIN_BATCH_SIZE, TRAIN_STEPS_PER_ITER, LEARNING_RATE, WEIGHT_DECAY, CARD_MODEL_CHECKPOINT.

If you want, I can now write a small wrapper script that launches CardModel data generation and training, then launches MCCFR with the chosen profile (no code changes to core files — the wrapper will set values at runtime). Reply if you want that wrapper created.