This document explains the full journey of building neural models from first principles: a single perceptron, its training dynamics and limitations, then a handcrafted multi‑layer perceptron (MLP) with backpropagation used to solve XOR and classify Tic‑Tac‑Toe positions. It emphasizes UNDER‑THE‑HOOD mechanics (weight updates, gradient flow, dataset generation) rather than using external ML libraries.
- Perceptron: linear separator; succeeds on AND; fails on XOR (proof of need for hidden layers).
- Automation: statistical view of perceptron convergence behavior (multiple trials).
- MLP Core: forward propagation + explicit manual backprop for arbitrary layer sizes.
- XOR Solution: minimal hidden layer enables non‑linear separability.
- Tic‑Tac‑Toe: synthetic state space generation; supervised classification.
- Tooling: serialization, inference, visualization, benchmarking.
- Perceptron CLI with custom argument parser (no
argparse): creation, training, prediction, save/load. - AND gate multi‑trial automation: convergence phases, average accuracy curve.
- Minimal MLP: sigmoid activations, MSE loss, batch & mini‑batch gradient descent, optional learning rate decay, JSON persistence.
- XOR trainer script + benchmarking grid for convergence epoch statistics.
- Tic‑Tac‑Toe dataset generator enforcing legal game states (alternating turns, early stop at win, draw detection).
- Tic‑Tac‑Toe training script with validation split and early stopping by accuracy.
- Inference & curve visualization utilities (aggregated overlays).
Given inputs
Clarifying notation:
Alternate activation (tanh threshold) provided for experimentation.
For a sample
Only misclassified samples update parameters (classic perceptron rule, not gradient descent).
Layer sizes:
Internal storage:
weights[L][j][i]= weight from neuron i in previous layer to neuron j in current layer.biases[L][j]= bias for neuron j.- Activations list includes input layer as index 0.
Using MSE for simplicity (cross‑entropy would be more suitable for classification but requires softmax modifications).
Sigmoid derivative given activation
Output deltas:
Hidden layer deltas (chain rule):
Weight & bias gradients:
Mini‑batch accumulation sums gradients across the batch; final update divides by batch size (average gradient).
For each batch:
- Initialize accumulators for grad_w, grad_b.
- Forward each sample → activations.
- Compute deltas via backward pass.
- Accumulate gradients.
- After batch: apply averaged update.
- Track accuracy (#correct / #samples). Binary uses threshold; multi‑class uses argmax.
MLP training breaks early if accuracy >= target_accuracy.
bs/
my_perceptron.py # Perceptron CLI (custom argument parsing)
mlp.py # MLP core: forward, backprop, train
tictactoe.py # Dataset generation & labeling helpers
scripts/
automate_and_gate.py # Batch AND gate perceptron trials
train_xor_mlp.py # XOR training script (MLP)
benchmark_xor.py # Hyperparameter convergence benchmark for XOR
visualize_results.py # Overlay multiple CSV curves
infer_mlp.py # Load JSON model & predict
train_tictactoe_mlp.py # Tic-Tac-Toe position classifier trainer
results/ # Generated CSVs / plots / saved models
data/ # AND & XOR gate truth tablespython3 -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .Optional plotting via matplotlib (included in pyproject.toml).
Equation: w_i ← w_i + lr * (y - ŷ) * x_i, b ← b + lr * (y - ŷ).
Training uses PHASES:
- Evaluate accuracy on full dataset.
- Perform
--stepsrandom single-sample updates. - Repeat until 100% accuracy or
--max-phasesreached.
CLI (custom parser, no argparse):
# Train AND gate
python3 bs/my_perceptron.py --new 2 --train --steps 40 data/and_gate.txt --save perceptron_and.json
# Predict with saved model
python3 bs/my_perceptron.py --load perceptron_and.json data/and_gate.txt
# Demonstrate XOR failure (plateau)
python3 bs/my_perceptron.py --new 2 --train --steps 40 --max-phases 30 data/xor_gate.txtOutput includes per-phase accuracy history.
Prediction file format: lines may contain either just the inputs or inputs + label (0/1). In prediction mode the label, if present, is ignored, allowing reuse of the original training truth table without manual editing.
Purpose: quantify variability in phases needed for convergence by repeated random initialization.
Flags:
--trialsnumber of independent runs--stepsupdates per phase (same semantics as perceptron CLI)--lrlearning rate for each perceptron created--outoutput directory
Outputs:
and_gate_trials.csv: columnstrial,phases_to_converge,weights,biasand_gate_accuracy_curve.csv: phase vs average accuracy (%) across trials- Optional plot
and_gate_learning_curve.pngif matplotlib installed.
Interpretation: Distribution of phases quantifies stability; fewer phases ⇒ weights closer to separating hyperplane early.
Flat list of 9 integers: X=1, O=-1, empty=0. Row‑major ordering.
- X always moves first.
- Players strictly alternate (counts differ by at most 1; O never exceeds X).
- Search halts further expansion after first win (no post‑win moves included).
- Draw: board full with no winner.
rec(board):
add board to visited
if winner(board) or draw(board): mark finished; return
player = next_player(board)
for each empty cell i:
board[i] = player
if legal(board): rec(board)
board[i] = 0This prunes illegal turn orders and duplicated states, producing a compact supervised dataset of reachable positions.
- Binary:
[1]if a win has already occurred else[0]. - Multi:
[ongoing, draw, win]one‑hot.
Performs split into train/validation, mini‑batch gradient descent, optional plotting & saving.
Flags:
--mode {binary,multi}
--hidden HIDDEN_SIZE
--epochs N_EPOCHS
--lr LEARNING_RATE
--batch BATCH_SIZE (0 = full dataset)
--lr-decay DECAY_FRACTION
--val VALIDATION_RATIO
--target TARGET_TRAIN_ACCURACY (early stop)
--out OUTPUT_DIR
--no-plot (suppress plot)
--save-model MODEL_PATH.jsonExamples:
# Binary classification (win vs not yet win)
python3 scripts/train_tictactoe_mlp.py --mode binary --hidden 64 --epochs 60 --lr 0.25 \
--batch 128 --lr-decay 0.002 --target 0.95 --out results --save-model results/tictactoe_binary.json
# Multi-class (ongoing / draw / win)
python3 scripts/train_tictactoe_mlp.py --mode multi --hidden 96 --epochs 80 --lr 0.3 \
--batch 256 --lr-decay 0.001 --target 0.90 --out results --save-model results/tictactoe_multi.jsonOutputs:
- CSV:
tictactoe_<mode>_curve.csv(epoch,loss,accuracy) - Plot:
tictactoe_<mode>_learning.png(loss & train accuracy) unless--no-plot - JSON model (if
--save-model) - Console summary: final train & validation accuracy + sample validation predictions. Hidden layer deltas:
Gradients:
Batch update (average gradient over batch) using learning rate lr.
Mini‑batch: specify batch_size in train_epoch; if ≤0 uses full dataset.
Learning rate decay: each epoch lr_decay > 0.
Flags:
--hidden: hidden layer size--lr: learning rate--epochs: max epochs--no-plot: disable plot--out: directory forxor_mlp_curve.csv& optional plot--save-model: JSON output path
Example:
python3 scripts/train_xor_mlp.py --hidden 4 --lr 0.5 --epochs 5000 --out results --save-model results/xor_model.jsonCSV columns: epoch,loss,accuracy (accuracy in [0,1]).
Grid of hidden sizes × learning rates × trials.
Flags: --hidden, --lr, --epochs, --trials, --out.
Output CSV: hidden,lr,trial,epochs_to_converge.
Optional scatter plot if matplotlib present.
python3 scripts/infer_mlp.py --model results/xor_model.json --input 0 1What it does: loads a saved JSON model and runs a forward pass on a single input vector you provide on the command line.
Usage:
python3 scripts/infer_mlp.py --model <path/to/model.json> --input <space-separated numbers>- Input length must match the model’s input size (
layer_sizes[0]). - Tic‑Tac‑Toe inputs are 9 numbers with X=1, O=−1, empty=0 (row‑major).
Helpful flags:
--labels: when output has 3 classes, also prints the class name (order[ongoing, draw, win]).--ttt-gt: for 9‑value inputs, computes and prints the Tic‑Tac‑Toe ground‑truth class using the rule helpers.
Output interpretation:
- Binary models: prints one sigmoid value in [0,1] and a line
Binary Thresholded=0|1(threshold 0.5). - Multi‑class (Tic‑Tac‑Toe): prints the list of three sigmoid values and a line
Argmax Class=<index>. Class order used by training is[ongoing, draw, win]. Values do not necessarily sum to 1 (not softmax).
Examples:
Input=[0, 1]
Raw Output=[0.842]
Binary Thresholded=1
Input=[0, 0, 0, 0, 1, 0, 0, 0, 0]
Raw Output=[0.903, 0.001, 0.099]
Argmax Class=0 # class 0 = "ongoing"
With labels and ground truth:
python3 scripts/infer_mlp.py --model results/tictactoe_multi.json \
--input 1 0 1 -1 -1 -1 0 1 0 --labels --ttt-gtInput=[1.0, 0.0, 1.0, -1.0, -1.0, -1.0, 0.0, 1.0, 0.0]
Raw Output=[...]
Argmax Class=0
Argmax Label=ongoing
GroundTruth Class=2 Label=win (0=ongoing,1=draw,2=win)
If you prefer probability‑like outputs that sum to 1, switch the final layer to softmax (or add a softmax option for inference).
Overlay curves across all CSVs in a directory.
Flags: --dir, --prefix (filter by filename prefix), --metric (loss|accuracy), --out.
Accuracy displayed as percentage (internally multiplies by 100).
python3 scripts/visualize_results.py --dir results --metric accuracy --out results/overlay_accuracy.pngStructure:
{
"layer_sizes": [n_in, h1, ..., n_out],
"learning_rate": 0.3,
"weights": [[[...]]],
"biases": [[...]]
}with open('model.json', 'w') as f:
json.dump(mlp.to_dict(), f)Persist with Python json.dump; load using MLP.from_dict (shape consistency validated).
json.dump(mlp.to_dict(), f)
{
"layer_sizes": [2,4,1],
"learning_rate": 0.5,
"weights": [[[...],[...]], ...],
"biases": [[...], ...]
}Load with MLP.from_dict() via infer_mlp.py.
All training scripts store epoch-level rows: epoch,loss,accuracy (accuracy ∈ [0,1]). Visualization multiplies accuracy by 100 for plotting.
Automation script for AND gate uses different CSVs: and_gate_trials.csv & and_gate_accuracy_curve.csv.
- Simplicity prioritized: raw Python lists for matrices; clear loops over vectorized operations.
- MSE for classification: fine on tiny problems; replace with cross‑entropy + softmax for multi‑class robustness.
- Sigmoid everywhere: introduces potential saturation; depth kept minimal to mitigate vanishing gradients.
- No momentum / Adam / weight decay: educational focus on base gradient descent.
- Early stopping uses accuracy only; could add validation loss patience.
- Deterministic dataset generation ensures reproducible Tic‑Tac‑Toe splits (random only in shuffling & weight init).
- XOR does not converge: increase hidden size (
--hidden 4or8) or adjust--lr(0.3–0.8). Extremely high--lrmay destabilize. - Loss plateaus: try lowering learning rate or using smaller batch (set
batch_size< dataset length). - NaN values: usually from large weights + high
lr; restart with lowerlr. - Visualization finds no files: confirm CSVs in
results/and metric columns present.
# Perceptron AND training
python3 bs/my_perceptron.py --new 2 --train --steps 40 data/and_gate.txt --save perceptron_and.json
# Perceptron XOR failure
python3 bs/my_perceptron.py --new 2 --train --steps 40 --max-phases 30 data/xor_gate.txt
# XOR MLP training
python3 scripts/train_xor_mlp.py --hidden 4 --lr 0.5 --epochs 5000 --out results --save-model results/xor_model.json
# XOR benchmarking
python3 scripts/benchmark_xor.py --hidden 2 3 4 5 --lr 0.3 0.5 0.8 --epochs 5000 --trials 3 --out results/xor_benchmark.csv
# Inference
python3 scripts/infer_mlp.py --model results/xor_model.json --input 0 1
# Visualization overlay
python3 scripts/visualize_results.py --dir results --metric accuracy --out results/overlay_accuracy.png- Cross‑entropy + softmax output layer.
- Confusion matrix + precision/recall (macro + per class).
- Momentum / Adam optimizer abstraction.
- Gradient clipping for stability on larger hidden layers.
- Adjustable initialization (Xavier/He) to reduce early saturation.
- Export to ONNX or a lightweight inference format.
- Add command to compute Tic‑Tac‑Toe board feature importance (per input perturbation).