Skip to content
Rubén Darío Guerrero edited this page Mar 24, 2025 · 2 revisions

AutoDiff: A Comprehensive Tutorial

This tutorial provides a comprehensive guide to understanding the AutoDiff framework for automatic differentiation, covering its design principles, implementation details, and practical applications.

Table of Contents

  1. Introduction to Automatic Differentiation
    1. What is a Derivative and Why Do We Need It?
    2. A Simple Example: Understanding the Chain Rule
  2. Design Principles: SOLID in AutoDiff
    1. Single Responsibility Principle (SRP)
    2. Open-Closed Principle (OCP)
    3. Liskov Substitution Principle (LSP)
    4. Interface Segregation Principle (ISP)
    5. Dependency Inversion Principle (DIP)
  3. Core Concepts of Automatic Differentiation
    1. Forward Mode vs. Reverse Mode
    2. The Dual Number Concept
  4. Expression System (Stage 1)
    1. Expression Representation/
    2. Practical Application: Newton's Method
  5. Dual Numbers (Bridge Component)
    1. Practical Application: Taylor Series Approximation
  6. Computational Graph (Stage 2)
    1. Practical Application: Gradient Descent Optimization
  7. Forward Mode Implementation
    1. Practical Application: Computing Partial Derivatives
  8. Reverse Mode Implementation
    1. Practical Application: Neural Network Layer
  9. Control Flow Differentiation
    1. Practical Application: Differentiating Through a Conditional (ReLU)
  10. Expression Optimization
    1. Practical Application: Optimizing a Complex Expression
  11. Benchmark Analysis
    1. Understanding the Benchmark
    2. Simplified Benchmark Implementation
    3. Benchmark Results Analysis
    4. Practical Implications
  12. Advanced Applications
    1. Constrained Optimization Using Lagrange Multipliers
  13. Conclusion

1. Introduction to Automatic Differentiation

Automatic differentiation (AD) is a computational technique for calculating exact derivatives of functions without resorting to numerical approximations or manual derivation.

1.1 What is a Derivative and Why Do We Need It?

In many scientific and engineering applications, we need to compute derivatives of functions:

  • In optimization, derivatives help us find the minimum or maximum of functions
  • In machine learning, gradients guide the training of neural networks
  • In physics simulations, derivatives represent rates of change in physical systems

Computing derivatives by hand can be tedious and error-prone, while numerical approximations (like finite differences) suffer from precision issues. Automatic differentiation solves these problems by providing a way to compute exact derivatives algorithmically.

1.2 A Simple Example: Understanding the Chain Rule

Consider a simple function: f(x) = sin(x²).

To compute the derivative manually, we apply the chain rule:

  1. Let g(x) = x² and h(g) = sin(g)
  2. Apply the chain rule: f'(x) = h'(g) * g'(x)
  3. Compute the individual derivatives: h'(g) = cos(g) and g'(x) = 2x
  4. Substitute back: f'(x) = cos(x²) * 2x

While this is manageable for simple functions, it becomes increasingly complex for nested functions with multiple variables.

2. Design Principles: SOLID in AutoDiff

The AutoDiff framework is built on solid software engineering principles, specifically the SOLID principles.

2.1 Single Responsibility Principle (SRP)

Each component in AutoDiff has a clear, focused responsibility:

  • Expression classes represent mathematical expressions
  • DualNumber handles forward-mode differentiation
  • GraphNode manages computational graph operations
  • Optimizer focuses exclusively on expression optimization

Example:

#include "expression.h"

// Creating a variable - its sole responsibility is to represent a variable in an expression
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);

// Setting its value
x->set_value(3.0);

2.2 Open-Closed Principle (OCP)

The framework is designed to be open for extension but closed for modification:

  • New mathematical operations can be added by deriving from BinaryOperation or UnaryOperation
  • New optimizations can be implemented by extending the ExpressionOptimizer
  • Custom functions can be integrated through the CustomFunctionNode interface

Example of adding a new unary operation:

#include "expression.h"

// Define a new operation (CustomFunction)
template <typename T>
class CustomFunction : public ad::expr::UnaryOperation<T> {
public:
    using ad::expr::UnaryOperation<T>::UnaryOperation;

    T evaluate() const override {
        T x = this->operand_->evaluate();
        return x * x + x; // f(x) = x² + x
    }

    ad::expr::ExprPtr<T> differentiate(const std::string& variable) const override {
        // f'(x) = 2x + 1
        auto term1 = std::make_unique<ad::expr::Multiplication<T>>(
            std::make_unique<ad::expr::Constant<T>>(2),
            this->operand_->clone()
        );
        auto term2 = std::make_unique<ad::expr::Constant<T>>(1);
        auto deriv = std::make_unique<ad::expr::Addition<T>>(std::move(term1), std::move(term2));
        
        // Chain rule: multiply by derivative of inner expression
        return std::make_unique<ad::expr::Multiplication<T>>(
            std::move(deriv),
            this->operand_->differentiate(variable)
        );
    }

    ad::expr::ExprPtr<T> clone() const override {
        return std::make_unique<CustomFunction<T>>(this->operand_->clone());
    }

    ad::expr::ExprPtr<T> clone_with(ad::expr::ExprPtr<T> new_operand) const override {
        return std::make_unique<CustomFunction<T>>(std::move(new_operand));
    }

    std::string func_name() const override { return "custom_func"; }
};

2.3 Liskov Substitution Principle (LSP)

Objects in AutoDiff are replaceable with their subtypes without affecting program correctness:

  • All expressions follow the Expression interface contract
  • All operations provide consistent differentiation behavior
  • Graph nodes follow a consistent interface for forward/backward passes

Example:

#include "expression.h"

// Function that works with any Expression type
template <typename T>
T evaluate_and_differentiate(const ad::expr::Expression<T>& expr, const std::string& var) {
    T value = expr.evaluate();
    T derivative = expr.differentiate(var)->evaluate();
    return value + derivative;
}

// Usage with different expression types
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);
auto sin_x = std::make_unique<ad::expr::Sin<double>>(x->clone());
auto x_squared = std::make_unique<ad::expr::Pow<double>>(
    x->clone(),
    std::make_unique<ad::expr::Constant<double>>(2)
);

// Both work with the same function
double result1 = evaluate_and_differentiate(*sin_x, "x");
double result2 = evaluate_and_differentiate(*x_squared, "x");

2.4 Interface Segregation Principle (ISP)

Interfaces in AutoDiff are focused and minimal:

  • Expression provides only essential methods for evaluation and differentiation
  • GraphNode separates forward and backward passes
  • Separate mechanisms for expressing mathematical relationships vs. computing gradients

Example:

#include "expression.h"

// BinaryOperation only defines what's needed for a binary operation
// Use case: Adding two expressions
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);
auto y = std::make_unique<ad::expr::Variable<double>>("y", 3.0);
auto sum = std::make_unique<ad::expr::Addition<double>>(x->clone(), y->clone());

// clone_with method provides a clean way to create a new binary operation
// with modified operands
auto z = std::make_unique<ad::expr::Variable<double>>("z", 4.0);
auto new_sum = sum->clone_with(x->clone(), z->clone());

2.5 Dependency Inversion Principle (DIP)

High-level modules in AutoDiff depend on abstractions, not concrete implementations:

  • Algorithms operate on the Expression interface, not concrete types
  • Optimization strategies depend on abstract expression interfaces
  • Differentiation relies on abstract rule definitions

Example:

#include "expression.h"
#include "optimizer.h"

// Create an expression
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);
auto expr = std::make_unique<ad::expr::Sin<double>>(
    std::make_unique<ad::expr::Pow<double>>(
        x->clone(),
        std::make_unique<ad::expr::Constant<double>>(2)
    )
);

// Optimizer works with the abstract Expression interface
ad::optimizer::ExpressionOptimizer<double> optimizer;
auto optimized_expr = optimizer.optimize(std::move(expr));

3. Core Concepts of Automatic Differentiation

3.1 Forward Mode vs. Reverse Mode

AD has two primary modes of operation:

Forward Mode:

  • Computes the derivative alongside the function evaluation
  • Efficient when there are few inputs and many outputs
  • Propagates derivatives from inputs to outputs

Reverse Mode:

  • First computes the function value, then propagates derivatives backward
  • Efficient when there are many inputs and few outputs
  • Used extensively in neural networks (backpropagation)

Example for forward mode:

#include "forward_mode.h"

// Create a forward mode session
ad::forward::ForwardMode<double> fm;

// Compute derivative with respect to x
auto x_val = 2.0;
auto y_val = 3.0;

// Create variables (x with derivative 1, y with derivative 0)
auto x_var = fm.variable(x_val);
auto y_const = fm.constant(y_val);

// Compute f(x, y) = x * y + sin(x)
auto xy = x_var * y_const;
auto sin_x = sin(x_var);
auto result = xy + sin_x;

// Result contains both value and derivative
double function_value = result.value;  // 6.9093
double derivative_x = result.deriv;    // 3.4161

Example for reverse mode:

#include "reverse_mode.h"
#include "computational_graph.h"

// Create a reverse mode session
ad::reverse::ReverseMode<double> rm;

// Add variables to the graph
auto x = rm.add_variable("x", 2.0);
auto y = rm.add_variable("y", 3.0);

// Build the computational graph
auto xy = ad::graph::multiply<double>(x, y);
auto sin_x = ad::graph::sin<double>(x);
auto output = ad::graph::add<double>(xy, sin_x);

// Compute the function value
double result = output->forward();  // 6.9093

// Compute gradients with respect to all variables
auto gradients = rm.compute_gradients(output);
double dx = gradients["x"];  // 3.4161
double dy = gradients["y"];  // 2.0

3.2 The Dual Number Concept

Forward mode AD often uses a mathematical construct called dual numbers. A dual number is an extension of real numbers with a nilpotent element ε (where ε² = 0).

Example:

#include "dual_number.h"

// Create dual numbers
ad::core::DualNumber<double> x(2.0, 1.0);  // Value 2.0, derivative 1.0
ad::core::DualNumber<double> y(3.0, 0.0);  // Value 3.0, derivative 0.0

// Perform operations
auto sum = x + y;        // Value 5.0, derivative 1.0
auto product = x * y;    // Value 6.0, derivative 3.0
auto sin_x = sin(x);     // Value sin(2.0), derivative cos(2.0)

// The derivative part gives us the derivative of the operation
double derivative_of_product = product.derivative();  // 3.0

4. Expression System (Stage 1)

The expression system represents mathematical expressions as trees of operations.

4.1 Expression Representation

Example:

#include "expression.h"

// Create variables
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);
auto y = std::make_unique<ad::expr::Variable<double>>("y", 3.0);

// Build the expression: f(x, y) = x * y + sin(x)
auto xy = std::make_unique<ad::expr::Multiplication<double>>(x->clone(), y->clone());
auto sin_x = std::make_unique<ad::expr::Sin<double>>(x->clone());
auto expr = std::make_unique<ad::expr::Addition<double>>(std::move(xy), std::move(sin_x));

// Evaluate the expression
double result = expr->evaluate();  // result = 2 * 3 + sin(2) = 6.9093

// Compute derivative with respect to x
auto dx = expr->differentiate("x");
double derivative_x = dx->evaluate();  // derivative_x = y + cos(x) = 3 + cos(2) = 3.4161

4.2 Practical Application: Newton's Method

Newton's method for finding roots requires both function evaluation and differentiation:

#include <iostream>
#include "expression.h"

int main() {
    // Create the expression: f(x) = x³ - 4x² + 5x - 2
    auto x = std::make_unique<ad::expr::Variable<double>>("x", 1.0);
    
    auto x_cubed = std::make_unique<ad::expr::Pow<double>>(
        x->clone(),
        std::make_unique<ad::expr::Constant<double>>(3)
    );
    
    auto four = std::make_unique<ad::expr::Constant<double>>(4);
    auto x_squared = std::make_unique<ad::expr::Pow<double>>(
        x->clone(),
        std::make_unique<ad::expr::Constant<double>>(2)
    );
    auto four_x_squared = std::make_unique<ad::expr::Multiplication<double>>(
        std::move(four),
        std::move(x_squared)
    );
    
    auto five = std::make_unique<ad::expr::Constant<double>>(5);
    auto five_x = std::make_unique<ad::expr::Multiplication<double>>(
        std::move(five),
        x->clone()
    );
    
    auto two = std::make_unique<ad::expr::Constant<double>>(2);
    
    auto term1 = std::make_unique<ad::expr::Subtraction<double>>(
        std::move(x_cubed),
        std::move(four_x_squared)
    );
    auto term2 = std::make_unique<ad::expr::Subtraction<double>>(
        std::move(five_x),
        std::move(two)
    );
    auto f = std::make_unique<ad::expr::Addition<double>>(
        std::move(term1),
        std::move(term2)
    );
    
    // Newton's method: x_{n+1} = x_n - f(x_n) / f'(x_n)
    double x_n = 1.0;
    for (int i = 0; i < 10; i++) {
        x->set_value(x_n);
        double f_x = f->evaluate();
        auto df = f->differentiate("x");
        double df_x = df->evaluate();
        
        double x_next = x_n - f_x / df_x;
        double error = std::abs(x_next - x_n);
        
        std::cout << "Iteration " << i << ": x = " << x_n 
                  << ", f(x) = " << f_x 
                  << ", f'(x) = " << df_x 
                  << ", error = " << error << std::endl;
        
        if (error < 1e-10) {
            std::cout << "Converged to root: " << x_next << std::endl;
            break;
        }
        
        x_n = x_next;
    }
    
    return 0;
}

5. Dual Numbers (Bridge Component)

Dual numbers serve as a bridge between the expression system and the computational graph.

5.1 Practical Application: Taylor Series Approximation

#include <iostream>
#include <cmath>
#include "dual_number.h"

// Function to compute e^x using Taylor series approximation
template <typename T>
T exp_taylor(const T& x, int n_terms) {
    T result = T(1);  // First term: x^0 / 0!
    T term = T(1);
    
    for (int i = 1; i < n_terms; i++) {
        term = term * x / T(i);  // x^i / i!
        result = result + term;
    }
    
    return result;
}

int main() {
    // Evaluate at x = 1.0
    double x_val = 1.0;
    
    // Create a dual number with derivative 1.0
    ad::core::DualNumber<double> x(x_val, 1.0);
    
    // Compute approximation using 5 terms
    ad::core::DualNumber<double> result = exp_taylor(x, 5);
    
    // Compare with standard exp function
    double std_exp = std::exp(x_val);
    double std_exp_derivative = std_exp;  // Derivative of e^x is e^x
    
    std::cout << "Taylor approximation of e^" << x_val << " with 5 terms:" << std::endl;
    std::cout << "Value: " << result.value() << " (error: " << std::abs(result.value() - std_exp) << ")" << std::endl;
    std::cout << "Derivative: " << result.derivative() << " (error: " << std::abs(result.derivative() - std_exp_derivative) << ")" << std::endl;
    
    // Compute approximation using 10 terms
    result = exp_taylor(x, 10);
    
    std::cout << "\nTaylor approximation of e^" << x_val << " with 10 terms:" << std::endl;
    std::cout << "Value: " << result.value() << " (error: " << std::abs(result.value() - std_exp) << ")" << std::endl;
    std::cout << "Derivative: " << result.derivative() << " (error: " << std::abs(result.derivative() - std_exp_derivative) << ")" << std::endl;
    
    return 0;
}

6. Computational Graph (Stage 2)

The computational graph is the foundation of reverse mode automatic differentiation.

6.1 Practical Application: Gradient Descent Optimization

#include <iostream>
#include "reverse_mode.h"
#include "computational_graph.h"

int main() {
    // Function to minimize: f(x, y) = (x - 2)² + 2(y - 1)²
    ad::reverse::ReverseMode<double> rm;
    
    // Add variables
    auto x = rm.add_variable("x", 0.0);  // Starting point: (0, 0)
    auto y = rm.add_variable("y", 0.0);
    
    // Constants
    auto two = std::make_shared<ad::graph::ConstantNode<double>>(2.0);
    
    // Build the expression
    auto x_minus_2 = ad::graph::subtract<double>(
        x,
        std::make_shared<ad::graph::ConstantNode<double>>(2.0)
    );
    auto y_minus_1 = ad::graph::subtract<double>(
        y,
        std::make_shared<ad::graph::ConstantNode<double>>(1.0)
    );
    
    auto term1 = ad::graph::multiply<double>(
        x_minus_2,
        x_minus_2
    );
    auto term2 = ad::graph::multiply<double>(
        two,
        ad::graph::multiply<double>(y_minus_1, y_minus_1)
    );
    
    auto f = ad::graph::add<double>(term1, term2);
    
    // Gradient descent parameters
    double learning_rate = 0.1;
    int n_iterations = 20;
    
    std::cout << "Gradient descent optimization:" << std::endl;
    std::cout << "Initial point: x = " << x->get_value() << ", y = " << y->get_value() << std::endl;
    
    for (int i = 0; i < n_iterations; i++) {
        // Compute function value
        double f_val = f->forward();
        
        // Compute gradients
        auto gradients = rm.compute_gradients(f);
        double dx = gradients["x"];
        double dy = gradients["y"];
        
        // Update variables
        double x_val = x->get_value();
        double y_val = y->get_value();
        
        x->set_value(x_val - learning_rate * dx);
        y->set_value(y_val - learning_rate * dy);
        
        std::cout << "Iteration " << i+1 << ": f(x,y) = " << f_val 
                  << ", (x,y) = (" << x->get_value() << "," << y->get_value() << ")"
                  << ", gradients = (" << dx << "," << dy << ")" << std::endl;
    }
    
    std::cout << "Final point: x = " << x->get_value() << ", y = " << y->get_value() << std::endl;
    std::cout << "Minimum value: f(x,y) = " << f->forward() << std::endl;
    std::cout << "Expected minimum: (x,y) = (2,1), f(2,1) = 0" << std::endl;
    
    return 0;
}

7. Forward Mode Implementation

Forward mode AD computes derivatives alongside function values.

7.1 Practical Application: Computing Partial Derivatives

#include <iostream>
#include "forward_mode.h"

int main() {
    // Function: f(x, y, z) = x*y*z + sin(x)*cos(y)*exp(z)
    ad::forward::ForwardMode<double> fm;
    
    // Variable values
    double x_val = 1.0;
    double y_val = 2.0;
    double z_val = 0.5;
    
    std::cout << "Computing partial derivatives of f(x,y,z) = x*y*z + sin(x)*cos(y)*exp(z)" << std::endl;
    std::cout << "at point (x,y,z) = (" << x_val << "," << y_val << "," << z_val << ")" << std::endl;
    
    // Compute ∂f/∂x
    auto x = fm.variable(x_val);  // Derivative = 1
    auto y = fm.constant(y_val);  // Derivative = 0
    auto z = fm.constant(z_val);  // Derivative = 0
    
    auto xyz = x * y * z;
    auto sin_x = sin(x);
    auto cos_y = cos(y);
    auto exp_z = exp(z);
    auto term2 = sin_x * cos_y * exp_z;
    auto f = xyz + term2;
    
    double f_val = f.value;
    double df_dx = f.deriv;
    
    // Compute ∂f/∂y
    x = fm.constant(x_val);  // Derivative = 0
    y = fm.variable(y_val);  // Derivative = 1
    z = fm.constant(z_val);  // Derivative = 0
    
    xyz = x * y * z;
    sin_x = sin(x);
    cos_y = cos(y);
    exp_z = exp(z);
    term2 = sin_x * cos_y * exp_z;
    f = xyz + term2;
    
    double df_dy = f.deriv;
    
    // Compute ∂f/∂z
    x = fm.constant(x_val);  // Derivative = 0
    y = fm.constant(y_val);  // Derivative = 0
    z = fm.variable(z_val);  // Derivative = 1
    
    xyz = x * y * z;
    sin_x = sin(x);
    cos_y = cos(y);
    exp_z = exp(z);
    term2 = sin_x * cos_y * exp_z;
    f = xyz + term2;
    
    double df_dz = f.deriv;
    
    // Output results
    std::cout << "f(x,y,z) = " << f_val << std::endl;
    std::cout << "∂f/∂x = " << df_dx << std::endl;
    std::cout << "∂f/∂y = " << df_dy << std::endl;
    std::cout << "∂f/∂z = " << df_dz << std::endl;
    
    return 0;
}

8. Reverse Mode Implementation

Reverse mode AD builds a computational graph and propagates gradients backward.

8.1 Practical Application: Neural Network Layer

#include <iostream>
#include <vector>
#include <random>
#include "reverse_mode.h"
#include "computational_graph.h"

int main() {
    // Neural network layer: output = activation(W·input + b)
    // For simplicity, we'll use a linear activation (identity function)
    
    // Define dimensions
    const int input_dim = 3;
    const int output_dim = 2;
    
    // Create input data
    std::vector<double> input_data = {0.5, -0.2, 0.1};
    
    // Create weight matrix (randomly initialized)
    std::vector<std::vector<double>> weights(output_dim, std::vector<double>(input_dim));
    std::vector<double> biases(output_dim);
    
    std::random_device rd;
    std::mt19937 gen(rd());
    std::normal_distribution<> d(0, 0.1);
    
    for (int i = 0; i < output_dim; i++) {
        std::cout << "For output[" << i << "]:" << std::endl;
        auto gradients = rm.compute_gradients(outputs[i]);
        
        for (int j = 0; j < input_dim; j++) {
            std::cout << "  d(output[" << i << "])/d(input[" << j << "]) = " 
                      << gradients["input" + std::to_string(j)] << std::endl;
        }
    }
    
    return 0;
}

9. Control Flow Differentiation

AutoDiff supports differentiating through control flow structures like loops and conditionals.

9.1 Practical Application: Differentiating Through a Conditional (ReLU)

#include <iostream>
#include <vector>
#include "reverse_mode.h"
#include "computational_graph.h"
#include "control_flow.h"

int main() {
    // Demonstrate differentiating through a ReLU activation function
    // ReLU(x) = max(0, x) = { x if x > 0, 0 otherwise }
    
    std::cout << "Differentiating through a ReLU activation function" << std::endl;
    std::cout << "==================================================" << std::endl;
    
    // Create a reverse mode session
    ad::reverse::ReverseMode<double> rm;
    
    // Create a variable node for our input
    auto x = rm.add_variable("x", 0.0);
    
    // Create a zero constant node
    auto zero = std::make_shared<ad::graph::ConstantNode<double>>(0.0);
    
    // Create a conditional node to implement ReLU
    // We'll use a smoothing parameter to make it differentiable
    auto relu = ad::graph::make_conditional<double>(
        x,           // condition: x > 0
        x,           // true branch: return x
        zero,        // false branch: return 0
        0.01         // smoothing parameter
    );
    
    // Test at different values
    std::vector<double> test_values = {-2.0, -1.0, -0.5, -0.1, 0.0, 0.1, 0.5, 1.0, 2.0};
    
    std::cout << "x\tReLU(x)\tGradient" << std::endl;
    std::cout << "------------------------" << std::endl;
    
    for (double val : test_values) {
        // Set the input value
        x->set_value(val);
        
        // Forward pass
        double relu_val = relu->forward();
        
        // Backward pass (compute gradient)
        auto gradients = rm.compute_gradients(relu);
        double drelu_dx = gradients["x"];
        
        // Print results
        std::cout << val << "\t" << relu_val << "\t" << drelu_dx << std::endl;
    }
    
    // Compare with analytical solution
    std::cout << "\nAnalytical solution for comparison:" << std::endl;
    std::cout << "For ReLU(x), the derivative is:" << std::endl;
    std::cout << "- 0 when x < 0" << std::endl;
    std::cout << "- 1 when x > 0" << std::endl;
    std::cout << "- Undefined at x = 0 (but often set to 0 or 0.5)" << std::endl;
    
    // Demonstrate using ReLU in a simple neural network
    std::cout << "\nSimple neural network with ReLU activation" << std::endl;
    std::cout << "===========================================" << std::endl;
    
    // Create weight and bias
    auto w = rm.add_variable("w", 0.5);
    auto b = rm.add_variable("b", -0.3);
    
    // Create a simple neuron: y = ReLU(w*x + b)
    auto wx = ad::graph::multiply<double>(w, x);
    auto wxb = ad::graph::add<double>(wx, b);
    auto neuron_output = ad::graph::make_conditional<double>(
        wxb,         // condition: w*x + b > 0
        wxb,         // true branch: return w*x + b
        zero,        // false branch: return 0
        0.01         // smoothing parameter
    );
    
    // Test for different input values
    std::cout << "\nx\tw*x+b\tReLU(w*x+b)\tdReLU/dx\tdReLU/dw\tdReLU/db" << std::endl;
    std::cout << "-------------------------------------------------------------" << std::endl;
    
    for (double val : {-2.0, -1.0, 0.0, 1.0, 2.0}) {
        // Set the input value
        x->set_value(val);
        
        // Forward pass
        double wxb_val = wxb->forward();
        double neuron_val = neuron_output->forward();
        
        // Backward pass (compute gradients)
        auto gradients = rm.compute_gradients(neuron_output);
        double dy_dx = gradients["x"];
        double dy_dw = gradients["w"];
        double dy_db = gradients["b"];
        
        // Print results
        std::cout << val << "\t" << wxb_val << "\t" << neuron_val << "\t\t" 
                  << dy_dx << "\t\t" << dy_dw << "\t\t" << dy_db << std::endl;
    }
    
    std::cout << "\nExplanation:" << std::endl;
    std::cout << "- When w*x + b <= 0, the output and all gradients are close to 0" << std::endl;
    std::cout << "- When w*x + b > 0, the gradients show how the output changes with each parameter:" << std::endl;
    std::cout << "  * dReLU/dx ≈ w (when active)" << std::endl;
    std::cout << "  * dReLU/dw ≈ x (when active)" << std::endl;
    std::cout << "  * dReLU/db ≈ 1 (when active)" << std::endl;
    std::cout << "- The transition between active/inactive regions is smooth due to the smoothing parameter" << std::endl;
    
    return 0;
}

10. Expression Optimization

The optimizer in AutoDiff improves the performance of expression evaluation by applying various optimization techniques.

10.1 Practical Application: Optimizing a Complex Expression

#include <iostream>
#include <chrono>
#include "expression.h"
#include "optimizer.h"

int main() {
    // Create a complex expression with opportunities for optimization:
    // f(x) = (x + 0) * 1 + (x * 0) + (x - x) + sin(x)^2 + cos(x)^2
    // This can be simplified to: f(x) = 1 + 0 + 0 + sin(x)^2 + cos(x)^2 = 1 + sin(x)^2 + cos(x)^2
    // Furthermore, sin(x)^2 + cos(x)^2 = 1, so the result should be 2
    
    auto x = std::make_unique<ad::expr::Variable<double>>("x", 0.5);
    
    // Term 1: (x + 0) * 1
    auto zero = std::make_unique<ad::expr::Constant<double>>(0.0);
    auto one = std::make_unique<ad::expr::Constant<double>>(1.0);
    auto x_plus_zero = std::make_unique<ad::expr::Addition<double>>(x->clone(), std::move(zero));
    auto term1 = std::make_unique<ad::expr::Multiplication<double>>(std::move(x_plus_zero), std::move(one));
    
    // Term 2: x * 0
    auto zero2 = std::make_unique<ad::expr::Constant<double>>(0.0);
    auto term2 = std::make_unique<ad::expr::Multiplication<double>>(x->clone(), std::move(zero2));
    
    // Term 3: x - x
    auto term3 = std::make_unique<ad::expr::Subtraction<double>>(x->clone(), x->clone());
    
    // Term 4: sin(x)^2
    auto sin_x = std::make_unique<ad::expr::Sin<double>>(x->clone());
    auto two = std::make_unique<ad::expr::Constant<double>>(2.0);
    auto term4 = std::make_unique<ad::expr::Pow<double>>(std::move(sin_x), std::move(two));
    
    // Term 5: cos(x)^2
    auto cos_x = std::make_unique<ad::expr::Cos<double>>(x->clone());
    auto two2 = std::make_unique<ad::expr::Constant<double>>(2.0);
    auto term5 = std::make_unique<ad::expr::Pow<double>>(std::move(cos_x), std::move(two2));
    
    // Combine terms
    auto sum1 = std::make_unique<ad::expr::Addition<double>>(std::move(term1), std::move(term2));
    auto sum2 = std::make_unique<ad::expr::Addition<double>>(std::move(sum1), std::move(term3));
    auto sum3 = std::make_unique<ad::expr::Addition<double>>(std::move(sum2), std::move(term4));
    auto expr = std::make_unique<ad::expr::Addition<double>>(std::move(sum3), std::move(term5));
    
    // Evaluate without optimization
    auto start = std::chrono::high_resolution_clock::now();
    double result_unoptimized = expr->evaluate();
    auto end = std::chrono::high_resolution_clock::now();
    auto time_unoptimized = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    
    // Create optimizer
    ad::optimizer::ExpressionOptimizer<double> optimizer;
    
    // Optimize the expression
    auto optimized_expr = optimizer.optimize(expr->clone());
    
    // Evaluate with optimization
    start = std::chrono::high_resolution_clock::now();
    double result_optimized = optimized_expr->evaluate();
    end = std::chrono::high_resolution_clock::now();
    auto time_optimized = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    
    // Print results
    std::cout << "Unoptimized result: " << result_unoptimized 
              << " (time: " << time_unoptimized << " ns)" << std::endl;
    std::cout << "Optimized result: " << result_optimized 
              << " (time: " << time_optimized << " ns)" << std::endl;
    std::cout << "Speedup: " << static_cast<double>(time_unoptimized) / time_optimized << "x" << std::endl;
    
    // Print optimization metrics
    std::cout << optimizer.get_metrics_report() << std::endl;
    
    return 0;
}

11. Benchmark Analysis

The included benchmark file (benchmark_forward_vs_backward_crossover.cpp) compares the performance of forward and reverse mode AD for different input/output dimensions.

11.1 Understanding the Benchmark

The benchmark simulates a simple neural network layer computation:

y = W·x + b

where:

  • x is an input vector of size input_dim
  • W is a weight matrix of size output_dim × input_dim
  • b is a bias vector of size output_dim
  • y is the output vector of size output_dim

11.2 Simplified Benchmark Implementation

#include <iostream>
#include <vector>
#include <chrono>
#include "forward_mode.h"
#include "reverse_mode.h"
#include "computational_graph.h"

template <typename T>
double benchmark_forward(int input_dim, int output_dim, int iterations) {
    // Initialize values
    std::vector<T> x_values(input_dim);
    std::vector<std::vector<T>> W_values(output_dim, std::vector<T>(input_dim));
    std::vector<T> b_values(output_dim);
    
    // Fill with simple values
    for (int i = 0; i < input_dim; i++) {
        x_values[i] = 0.1 * (i + 1);
    }
    
    for (int i = 0; i < output_dim; i++) {
        b_values[i] = 0.01 * i;
        for (int j = 0; j < input_dim; j++) {
            W_values[i][j] = 0.01 * (i + 1) * (j + 1);
        }
    }
    
    // Vectors for forward mode
    std::vector<ad::forward::ForwardVar<T>> x_forward(input_dim);
    std::vector<std::vector<ad::forward::ForwardVar<T>>> W_forward(output_dim, std::vector<ad::forward::ForwardVar<T>>(input_dim));
    std::vector<ad::forward::ForwardVar<T>> b_forward(output_dim);
    std::vector<ad::forward::ForwardVar<T>> y_forward(output_dim);
    std::vector<std::vector<T>> gradients(input_dim, std::vector<T>(output_dim));
    
    // Start timing
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int iter = 0; iter < iterations; iter++) {
        // Reset all derivatives
        for (int i = 0; i < input_dim; i++) {
            x_forward[i] = ad::forward::ForwardVar<T>(x_values[i], T(0));
        }
        
        for (int i = 0; i < output_dim; i++) {
            b_forward[i] = ad::forward::ForwardVar<T>(b_values[i], T(0));
            for (int j = 0; j < input_dim; j++) {
                W_forward[i][j] = ad::forward::ForwardVar<T>(W_values[i][j], T(0));
            }
        }
        
        // Compute gradients for each input
        for (int k = 0; k < input_dim; k++) {
            // Set derivative for current input to 1.0
            x_forward[k].deriv = T(1);
            
            // Compute layer outputs
            for (int i = 0; i < output_dim; i++) {
                // Start with bias
                y_forward[i] = b_forward[i];
                
                // Add weighted inputs
                for (int j = 0; j < input_dim; j++) {
                    y_forward[i] = y_forward[i] + W_forward[i][j] * x_forward[j];
                }
                
                // Store gradient
                gradients[k][i] = y_forward[i].deriv;
            }
            
            // Reset derivative for current input
            x_forward[k].deriv = T(0);
        }
    }
    
    auto end = std::chrono::high_resolution_clock::now();
    
    return std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() / static_cast<double>(iterations);
}

template <typename T>
double benchmark_reverse(int input_dim, int output_dim, int iterations) {
    // Initialize values
    std::vector<T> x_values(input_dim);
    std::vector<std::vector<T>> W_values(output_dim, std::vector<T>(input_dim));
    std::vector<T> b_values(output_dim);
    
    // Fill with simple values
    for (int i = 0; i < input_dim; i++) {
        x_values[i] = 0.1 * (i + 1);
    }
    
    for (int i = 0; i < output_dim; i++) {
        b_values[i] = 0.01 * i;
        for (int j = 0; j < input_dim; j++) {
            W_values[i][j] = 0.01 * (i + 1) * (j + 1);
        }
    }
    
    // Start timing
    auto start = std::chrono::high_resolution_clock::now();
    
    for (int iter = 0; iter < iterations; iter++) {
        // Create fresh computational graph for each iteration
        ad::reverse::ReverseMode<T> rm;
        
        // Add variables to the graph
        std::vector<std::shared_ptr<ad::graph::VariableNode<T>>> x_nodes;
        std::vector<std::vector<std::shared_ptr<ad::graph::VariableNode<T>>>> W_nodes;
        std::vector<std::shared_ptr<ad::graph::VariableNode<T>>> b_nodes;
        
        for (int i = 0; i < input_dim; i++) {
            x_nodes.push_back(rm.add_variable("x" + std::to_string(i), x_values[i]));
        }
        
        for (int i = 0; i < output_dim; i++) {
            std::vector<std::shared_ptr<ad::graph::VariableNode<T>>> row;
            for (int j = 0; j < input_dim; j++) {
                row.push_back(rm.add_variable("W" + std::to_string(i) + "_" + std::to_string(j), W_values[i][j]));
            }
            W_nodes.push_back(row);
            b_nodes.push_back(rm.add_variable("b" + std::to_string(i), b_values[i]));
        }
        
        // Build the computational graph for all outputs
        std::vector<std::shared_ptr<ad::graph::GraphNode<T>>> y_nodes;
        
        for (int i = 0; i < output_dim; i++) {
            // Start with the bias node
            std::shared_ptr<ad::graph::GraphNode<T>> output = b_nodes[i];
            
            // Add weighted inputs
            for (int j = 0; j < input_dim; j++) {
                auto term = ad::graph::multiply<T>(W_nodes[i][j], x_nodes[j]);
                output = ad::graph::add<T>(output, term);
            }
            y_nodes.push_back(output);
        }
        
        // Compute gradients for each output
        for (int i = 0; i < output_dim; i++) {
            auto grads = rm.compute_gradients(y_nodes[i]);
        }
    }
    
    auto end = std::chrono::high_resolution_clock::now();
    
    return std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() / static_cast<double>(iterations);
}

int main() {
    const int iterations = 10;  // Number of iterations to average over
    
    std::cout << "Benchmarking forward vs. reverse mode AD for different input/output dimensions" << std::endl;
    std::cout << "==========================================================================" << std::endl;
    std::cout << "Each test runs for " << iterations << " iterations, and times are in milliseconds per iteration" << std::endl;
    std::cout << "\nInput dim, Output dim, Forward time (ms), Reverse time (ms), Ratio (reverse/forward)" << std::endl;
    std::cout << "------------------------------------------------------------------------" << std::endl;
    
    // Test various input/output dimension combinations
    std::vector<std::pair<int, int>> test_cases = {
        {1, 1},
        {1, 10},
        {10, 1},
        {10, 10},
        {20, 1},
        {1, 20},
        {20, 20},
        {50, 1},
        {1, 50},
        {50, 50}
    };
    
    for (const auto& [input_dim, output_dim] : test_cases) {
        double forward_time = benchmark_forward<double>(input_dim, output_dim, iterations);
        double reverse_time = benchmark_reverse<double>(input_dim, output_dim, iterations);
        double ratio = reverse_time / forward_time;
        
        std::cout << input_dim << ", " << output_dim << ", " 
                  << forward_time << ", " << reverse_time << ", " << ratio << std::endl;
    }
    
    return 0;
}

11.3 Benchmark Results Analysis

Key insights from the benchmark:

  1. Single Output, Multiple Inputs: As the number of inputs increases, reverse mode becomes increasingly efficient compared to forward mode. For small input dimensions (1-5), forward mode is still faster, but beyond a certain threshold (around 10-15 inputs), reverse mode takes the lead.

  2. Multiple Outputs, Single Input: Forward mode consistently outperforms reverse mode in this scenario, as expected theoretically. The performance gap grows with the number of outputs.

  3. Equal Inputs and Outputs: For square matrices (equal input and output dimensions), forward mode tends to be better for small dimensions, but reverse mode becomes more efficient as the dimensions grow.

  4. The Crossover Point: The benchmark identifies the exact point where the ratio of reverse mode time to forward mode time equals 1.0. This is the crossover point where the two methods have equal performance.

11.4 Practical Implications

These benchmark results have important practical implications:

  1. Algorithm Selection: Choose forward mode when the number of outputs exceeds the number of inputs, and reverse mode in the opposite case.

  2. Neural Network Training: For neural networks, where we typically have many weights (inputs) but only one loss value (output), reverse mode (backpropagation) is the clear choice.

  3. Hybrid Approaches: For complex functions with multiple inputs and outputs, a hybrid approach might be optimal, using forward mode for some computations and reverse mode for others.

12. Advanced Applications

12.1 Constrained Optimization Using Lagrange Multipliers

#include <iostream>
#include <vector>
#include "reverse_mode.h"
#include "computational_graph.h"

int main() {
    // Problem: Minimize f(x, y, z) = x^2 + y^2 + z^2
    // Subject to: g1(x, y, z) = x + y + z - 1 = 0
    //             g2(x, y, z) = x^2 + y^2 - z = 0
    
    // Create a reverse mode AD session
    ad::reverse::ReverseMode<double> rm;
    
    // Add variables
    auto x = rm.add_variable("x", 0.5);  // Initial guess
    auto y = rm.add_variable("y", 0.5);
    auto z = rm.add_variable("z", 0.0);
    auto lambda1 = rm.add_variable("lambda1", 0.0);  // Lagrange multiplier
    auto lambda2 = rm.add_variable("lambda2", 0.0);  // Lagrange multiplier
    
    // Build the objective function: f(x, y, z) = x^2 + y^2 + z^2
    auto x_squared = ad::graph::multiply<double>(x, x);
    auto y_squared = ad::graph::multiply<double>(y, y);
    auto z_squared = ad::graph::multiply<double>(z, z);
    
    auto sum_xy = ad::graph::add<double>(x_squared, y_squared);
    auto objective = ad::graph::add<double>(sum_xy, z_squared);
    
    // Build constraint 1: g1(x, y, z) = x + y + z - 1 = 0
    auto sum_xyz = ad::graph::add<double>(
        ad::graph::add<double>(x, y),
        z
    );
    auto constraint1 = ad::graph::subtract<double>(
        sum_xyz,
        std::make_shared<ad::graph::ConstantNode<double>>(1.0)
    );
    
    // Build constraint 2: g2(x, y, z) = x^2 + y^2 - z = 0
    auto constraint2 = ad::graph::subtract<double>(
        ad::graph::add<double>(x_squared, y_squared),
        z
    );
    
    // Build the Lagrangian: L = f + lambda1*g1 + lambda2*g2
    auto lambda1_g1 = ad::graph::multiply<double>(lambda1, constraint1);
    auto lambda2_g2 = ad::graph::multiply<double>(lambda2, constraint2);
    
    auto lagrangian = ad::graph::add<double>(
        objective,
        ad::graph::add<double>(lambda1_g1, lambda2_g2)
    );
    
    // Gradient descent to find the critical point of the Lagrangian
    const int max_iterations = 100;
    const double learning_rate = 0.01;
    const double tolerance = 1e-6;
    
    std::cout << "Constrained optimization using Lagrange multipliers" << std::endl;
    std::cout << "Minimizing f(x, y, z) = x^2 + y^2 + z^2" << std::endl;
    std::cout << "Subject to: x + y + z - 1 = 0 and x^2 + y^2 - z = 0" << std::endl;
    std::cout << "==========================================================" << std::endl;
    
    for (int i = 0; i < max_iterations; i++) {
        // Compute gradients
        auto gradients = rm.compute_gradients(lagrangian);
        
        // Update variables using gradient descent
        double x_val = x->get_value();
        double y_val = y->get_value();
        double z_val = z->get_value();
        double lambda1_val = lambda1->get_value();
        double lambda2_val = lambda2->get_value();
        
        double dx = gradients["x"];
        double dy = gradients["y"];
        double dz = gradients["z"];
        double dlambda1 = gradients["lambda1"];
        double dlambda2 = gradients["lambda2"];
        
        // Compute constraint values
        double constraint1_val = x_val + y_val + z_val - 1.0;
        double constraint2_val = x_val*x_val + y_val*y_val - z_val;
        
        // Check convergence
        double gradient_norm = std::sqrt(dx*dx + dy*dy + dz*dz + dlambda1*dlambda1 + dlambda2*dlambda2);
        double constraint_norm = std::sqrt(constraint1_val*constraint1_val + constraint2_val*constraint2_val);
        
        if (i % 10 == 0 || i == max_iterations - 1 || gradient_norm < tolerance) {
            std::cout << "Iteration " << i << ":" << std::endl;
            std::cout << "  (x, y, z) = (" << x_val << ", " << y_val << ", " << z_val << ")" << std::endl;
            std::cout << "  f(x, y, z) = " << objective->forward() << std::endl;
            std::cout << "  Constraint 1 = " << constraint1_val << std::endl;
            std::cout << "  Constraint 2 = " << constraint2_val << std::endl;
            std::cout << "  Gradient norm = " << gradient_norm << std::endl;
            std::cout << "  Constraint norm = " << constraint_norm << std::endl;
            std::cout << std::endl;
        }
        
        if (gradient_norm < tolerance && constraint_norm < tolerance) {
            std::cout << "Converged after " << i << " iterations!" << std::endl;
            break;
        }
        
        // Update values
        x->set_value(x_val - learning_rate * dx);
        y->set_value(y_val - learning_rate * dy);
        z->set_value(z_val - learning_rate * dz);
        lambda1->set_value(lambda1_val - learning_rate * dlambda1);
        lambda2->set_value(lambda2_val - learning_rate * dlambda2);
    }
    
    // Verify the solution
    double x_final = x->get_value();
    double y_final = y->get_value();
    double z_final = z->get_value();
    double obj_final = objective->forward();
    
    std::cout << "Final solution:" << std::endl;
    std::cout << "  (x, y, z) = (" << x_final << ", " << y_final << ", " << z_final << ")" << std::endl;
    std::cout << "  f(x, y, z) = " << obj_final << std::endl;
    std::cout << "  Constraint 1 = " << x_final + y_final + z_final - 1.0 << std::endl;
    std::cout << "  Constraint 2 = " << x_final*x_final + y_final*y_final - z_final << std::endl;
    
    return 0;
}

13. Conclusion

In this tutorial, we've covered the design principles and implementation details of automatic differentiation in the AutoDiff framework.

We started with the core concepts of automatic differentiation and the differences between forward and reverse mode. We then explored the three main components of the framework: the expression system, dual numbers, and the computational graph.

We looked at how forward mode uses dual numbers to simultaneously compute function values and derivatives, and how reverse mode builds a computational graph to efficiently compute gradients with respect to many variables.

We also covered control flow differentiation, expression optimization, and performance benchmarks.

Throughout the tutorial, we've seen how automatic differentiation combines mathematical elegance with computational efficiency to solve a wide range of problems in scientific computing, optimization, and machine learning.

By understanding both the theoretical foundations and practical implementation details of automatic differentiation, you are now equipped to use the AutoDiff framework effectively in your own applications, whether for optimization, machine learning, or any other field that requires the computation of derivatives.

From a design perspective, we've seen how the SOLID principles are applied in the AutoDiff framework to create a flexible, extensible, and maintainable codebase. Each component has a single responsibility, the system is open for extension but closed for modification, objects are replaceable with their subtypes, interfaces are focused and minimal, and high-level modules depend on abstractions rather than concrete implementations.

These design principles, combined with the mathematical foundation of automatic differentiation, make AutoDiff a powerful tool for a wide range of applications.

Clone this wiki locally