Haydar Kilic | Artificial Intelligence Engineering
This repository contains Jupyter Notebooks that reinforce the theoretical slide content of the Generative AI course with Python. Each notebook derives and visualises the formulas covered in lectures from scratch, and adapts them to real data scenarios.
| Lecture | Topic | Notebook |
|---|---|---|
| Lecture 1 | Generative Modelling Fundamentals | [GAI_Lecture1_Notebook.ipynb] |
| Lecture 2 | Derivation of Generative Models (MAP · MLE · Beta-Binomial · Dirichlet) | [GAI_Lecture2_Notebook.ipynb] |
| Lecture 3 | Deep Generative Models (VAE · GAN · GMMN · Diffusion) | [GAI_Lecture3_Notebook.ipynb] |
| Lecture 4 | Transformers and Large Language Models (Attention · RoPE · Mini GPT · Scaling) | [GAI_Lecture4_Notebook.ipynb] |
The table will be updated as new lectures are added.
Section 1 — Core Concepts
- Handwritten digit recognition: 28×28 pixel vector representation, train/test/validation split
- Polynomial regression and curve fitting (Vandermonde matrix, Least Squares)
- Overfitting / Underfitting and RMS error analysis
- Ridge Regularisation (L2 penalty, λ hyperparameter)
Section 2 — Probability Theory
- Joint, marginal and conditional probability distributions
- Bayes' theorem — medical diagnosis and base rate fallacy
- Gaussian (Normal) distribution: PDF, CDF, numerical verification
- Maximum Likelihood Estimation (MLE) and bias
- Bayesian updating: coin flip prior → posterior
Section 3 — Decision Theory
- Minimum-error decision boundaries and posterior probabilities
- Reject Option and threshold θ
- Asymmetric loss matrix (medical diagnosis scenario)
- Generative / Discriminative / Discriminant model comparison
Section 1 — Learning from Positive Examples & The Number Game
- Concept learning = binary classification; posterior predictive distribution
- Strong sampling assumption:
p(D|h) = (1/|h|)^N - Size Principle: narrow hypothesis → high likelihood
- Prior, likelihood and posterior computation; Bayesian updating
- MAP estimation and N → ∞ behaviour (Dirac convergence)
- Bayesian Model Averaging (BMA) vs. Plug-In approach
- Mixture prior (π₀ parameter): rule-based vs. interval-based hypotheses
Section 2 — Beta-Binomial Model
- Bernoulli likelihood and sufficient statistics (N₁, N₀)
- Beta distribution: conjugate prior, various (a, b) parameters
- Sequential Bayesian updating: Beta(a,b) → Beta(N₁+a, N₀+b)
- MLE, MAP and posterior mean formulas; convergence as N grows
- Zero Count Problem and Laplace succession rule
- Posterior variance and confidence interval: σ ∝ 1/√N
- Compound Beta-Binomial distribution: prediction of future trials
Section 3 — Dirichlet-Multinomial
- Multinomial likelihood and Dirichlet prior
- Visualisation of the K=3 probability simplex (barycentric coordinates)
- Dirichlet-Multinomial update and posterior prediction
- Add-K smoothing (β): MLE → Laplace → uniform
Section 4 — Mixture Model
- Effect of the π₀ parameter on the posterior predictive distribution
Section 5 — MLE vs MAP vs Bayes Comparison
- Error analysis, convergence of θ estimates with N
Section 1 — Probabilistic Framework & MLE
- Real data simulation with a 2D Gaussian mixture
- Log-Gaussian log-likelihood function
- MLE vs. bad model comparison
Section 2 — KL Divergence
- Closed-form Gaussian KL computation
- KL asymmetry: KL(p‖q) ≠ KL(q‖p)
- MLE ≡ KL minimisation relationship
Section 3 — Latent Space & Manifold Hypothesis
- MNIST: 784 pixels → ~10-dimensional manifold (PCA variance analysis)
- Latent space visualisation via 2D PCA projection
- Latent space arithmetic: z(7) − z(1) + z(0) ≈ z(6)
Section 4 — ELBO Derivation
- Closed-form KL computation and heat map
- Balance between reconstruction and KL terms
Section 5 — Variational Autoencoder (VAE)
- Encoder–Decoder architecture, Reparametrisation Trick
- Gradient flow diagram (why backprop works)
- Training on MNIST; 2D latent space visualisation
- β-VAE: KL regularisation effect; Posterior Collapse problem
Section 6 — Generative Adversarial Networks (GAN)
- Generator + Discriminator architecture (LeakyReLU, BatchNorm)
- Optimal Discriminator formula and Nash equilibrium visualisation
- MNIST training; G/D loss curves and mode-collapse discussion
Section 7 — GMMN & MMD
- Gaussian (RBF) kernel and MMD² computation (multi-scale)
- MMD intuition test: same / nearby / distant distributions
- Discriminator-free GMMN training (MMD loss only)
Section 8 — Diffusion Models (DDPM)
- Forward process: β schedule, closed-form q(x_t|x_0)
- SimpleUNet: time embedding + skip-connection noise estimator
- DDPM training (MSE loss) and reverse process sampling
- Step-by-step denoising visualisation
Section 9 — Model Comparison & FID
- Fréchet Inception Distance computation (PCA feature space)
- Radar chart: Quality / Diversity / Speed / Stability / Latent Control
- Generative model chronology (1985–2022)
- Comprehensive comparison table
Section 1 — RNN vs Transformer: Vanishing Gradients
- Simulation of |dL/dh_t| ≈ |W_hh|^(T-t) exponential decay in simple RNNs
- Vanishing / stable / exploding regimes (|W_hh| = 0.85 / 1.00 / 1.15)
- Transformer O(1) connection distance: direct access to every token pair
Section 2 — Encoder–Decoder and the Information Bottleneck
- Cosine similarity loss at different sequence lengths with a GRU encoder
- RNN Enc-Dec single-vector bottleneck vs. Attention context vector comparison
- Visual explanation of c_t = Σ α_{t,i} · h_i
Section 3 — Bahdanau (Additive) Attention Mechanism
- From-scratch BahdanauAttention: W_s, W_h, v parameterised scoring
- e_{t,i} = vᵀ tanh(W_s·s_{t-1} + W_h·h_i) → softmax → context vector
- English→German translation simulation: 4×4 attention heatmap
Section 4 — Scaled Dot-Product Attention (Q, K, V)
Attention(Q,K,V) = softmax(QK^T / √d_k) · Vstep-by-step implementation- Importance of √d_k scaling: entropy analysis (unscaled softmax collapses as d_k grows)
- Dimension analysis: (B, T, d_model) → Q/K/V → (B, T, d_k) → Z
Section 5 — Multi-Head Attention
- Single large W_q/W_k/W_v matrix approach; split_heads → (B, n_heads, T, d_k)
- 4-head attention maps: Position / Syntax / Semantics / Distance
- Parameter analysis: 4 × d_model² weights
Section 6 — Positional Encoding (Sinusoidal, RoPE, ALiBi)
- PE_{pos,2i} = sin(pos/10000^{2i/d}), PE_{pos,2i+1} = cos(…): matrix visualisation
- Wave frequencies: low dimension = high frequency; PE similarity matrix
- RoPE: relative positional encoding via 2D rotation; q^T_m k_n ∝ f(m-n)
- ALiBi: e_{ij} = q_i^Tk_j − m·|i−j| linear penalty; slope m_i = 2^{−8i/n_heads}
- Comparison table: Sinusoidal / Learned / RoPE / ALiBi
Section 7 — Feed-Forward Network & Activation Functions
- ReLU → GELU → Swish/SiLU → SwiGLU(x,W,V) = Swish(xW) ⊙ xV
- Gradient analysis: dead neuron problem in ReLU for x<0 region
- d_ff = 4×d_model expansion rule and FFN parameter growth
Section 8 — Layer Normalization: LayerNorm vs RMSNorm / Pre-LN vs Post-LN
- LN(x) = γ·(x−μ)/√(σ²+ε)+β vs. RMSNorm(x) = γ·x/RMS(x) (no β, ~10% faster)
- std/mean comparison at different input scales
- Pre-LN (modern) vs Post-LN (original): gradient distribution histogram
- BN vs LN vs RMSNorm: preference analysis in sequence models
Section 9 — Attention Masking: Full vs Causal
- make_full_mask (Bidirectional): BERT/RoBERTa — every token attends to every other
- make_causal_mask (lower triangular): GPT — only past visible, future −∞
- Masking → model family → task matching table (Encoder / Decoder / Enc-Dec)
Section 10 — Full Transformer Block (From-Scratch Implementation)
- TransformerEncoderBlock: Pre-LN + MHA + FFN + Residual
- TransformerEncoder: N layers, learned PE, final LayerNorm
- Parameter analysis for 3 model configurations (Small / BERT-mini / BERT-base)
- #params ≈ 12 × N × d²_model estimation formula
Section 11 — Mini GPT: Character-Level Language Model
- GPTDecoderBlock: Causal MHA + Pre-LN + FFN
- MiniGPT: tok_emb + pos_emb + 3 decoder blocks + lm_head (weight tying)
- Autoregressive generate(): top-k sampling + temperature control
- 500-step training on Turkish text: loss curve + attention map
- Generated text samples at different temperatures (0.5 / 1.0 / 1.5)
Section 12 — Hyperparameter Analysis & Scaling Laws
- Real LLM table: BERT-base/large, GPT-2, GPT-3, LLaMA-2 7B/70B
- Scaling law: L ∝ N^{−0.076} log-log visualisation
- d_model vs number of heads (d_k = d_model/h ≈ 64–128 rule)
- GPT vs BERT comparison table: architecture, task, context, usage
- Modern LLM block: RMSNorm + Pre-LN + SwiGLU + RoPE
# Clone the repository
git clone https://github.com/HAYDARKILIC/generative_artificial_intelligence
cd generative_artificial_intelligence
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux/macOS
# venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
# Launch Jupyter
jupyter notebooknumpy>=2.0
matplotlib>=3.7
scipy>=1.11
scikit-learn>=1.3
jupyter>=1.0
ipykernel>=6.0
torch>=2.0
torchvision>=0.15
tqdm>=4.65
The
requirements.txtfile is included in the repository.
⚠️ torchandtorchvisionare required from Lecture 3 onwards. For GPU support, select a CUDA-compatible version at pytorch.org.
generative-ai/
├── README.md
├── requirements.txt
├── GAI_Lecture1_Notebook.ipynb # Lecture 1 — Generative Modelling Fundamentals
├── GAI_Lecture2_Notebook.ipynb # Lecture 2 — MAP · MLE · Beta-Binomial · Dirichlet
├── GAI_Lecture3_Notebook.ipynb # Lecture 3 — VAE · GAN · GMMN · Diffusion
├── GAI_Lecture4_Notebook.ipynb # Lecture 4 — Transformer · Attention · Mini GPT · LLM
└── (future lecture notebooks will be added here)
Pattern Recognition and Machine Learning – Christopher M. Bishop (1st Ed., 2006), Ch. 1–2
Machine Learning: A Probabilistic Perspective – Kevin P. Murphy (1st Ed., 2012), Ch. 3
Deep Learning – Goodfellow, Bengio, Courville (1st Ed., 2016), Ch. 20.10.3
Deep Learning – Goodfellow, Bengio, Courville (1st Ed., 2016), Ch. 20.10.4
Probabilistic Machine Learning: Advanced Topics – Kevin P. Murphy (1st Ed., 2023), Ch. 25
Natural Language Processing with Transformers – Lewis et al. (1st Ed., 2022), Ch. 1–2
Speech and Language Processing – Jurafsky & Martin (3rd Ed., draft), Ch. 3, 10, 11
Generative AI — Haydar Kılıç, Artificial Intelligence Engineering