-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This tutorial provides a comprehensive guide to understanding the AutoDiff framework for automatic differentiation, covering its design principles, implementation details, and practical applications.
- Introduction to Automatic Differentiation
- Design Principles: SOLID in AutoDiff
- Core Concepts of Automatic Differentiation
- Expression System (Stage 1)
- Dual Numbers (Bridge Component)
- Computational Graph (Stage 2)
- Forward Mode Implementation
- Reverse Mode Implementation
- Control Flow Differentiation
- Expression Optimization
- Benchmark Analysis
- Advanced Applications
- Conclusion
Automatic differentiation (AD) is a computational technique for calculating exact derivatives of functions without resorting to numerical approximations or manual derivation.
In many scientific and engineering applications, we need to compute derivatives of functions:
- In optimization, derivatives help us find the minimum or maximum of functions
- In machine learning, gradients guide the training of neural networks
- In physics simulations, derivatives represent rates of change in physical systems
Computing derivatives by hand can be tedious and error-prone, while numerical approximations (like finite differences) suffer from precision issues. Automatic differentiation solves these problems by providing a way to compute exact derivatives algorithmically.
Consider a simple function: f(x) = sin(x²).
To compute the derivative manually, we apply the chain rule:
- Let
g(x) = x²andh(g) = sin(g) - Apply the chain rule:
f'(x) = h'(g) * g'(x) - Compute the individual derivatives:
h'(g) = cos(g)andg'(x) = 2x - Substitute back:
f'(x) = cos(x²) * 2x
While this is manageable for simple functions, it becomes increasingly complex for nested functions with multiple variables.
The AutoDiff framework is built on solid software engineering principles, specifically the SOLID principles.
Each component in AutoDiff has a clear, focused responsibility:
-
Expressionclasses represent mathematical expressions -
DualNumberhandles forward-mode differentiation -
GraphNodemanages computational graph operations -
Optimizerfocuses exclusively on expression optimization
Example:
#include "expression.h"
// Creating a variable - its sole responsibility is to represent a variable in an expression
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);
// Setting its value
x->set_value(3.0);The framework is designed to be open for extension but closed for modification:
- New mathematical operations can be added by deriving from
BinaryOperationorUnaryOperation - New optimizations can be implemented by extending the
ExpressionOptimizer - Custom functions can be integrated through the
CustomFunctionNodeinterface
Example of adding a new unary operation:
#include "expression.h"
// Define a new operation (CustomFunction)
template <typename T>
class CustomFunction : public ad::expr::UnaryOperation<T> {
public:
using ad::expr::UnaryOperation<T>::UnaryOperation;
T evaluate() const override {
T x = this->operand_->evaluate();
return x * x + x; // f(x) = x² + x
}
ad::expr::ExprPtr<T> differentiate(const std::string& variable) const override {
// f'(x) = 2x + 1
auto term1 = std::make_unique<ad::expr::Multiplication<T>>(
std::make_unique<ad::expr::Constant<T>>(2),
this->operand_->clone()
);
auto term2 = std::make_unique<ad::expr::Constant<T>>(1);
auto deriv = std::make_unique<ad::expr::Addition<T>>(std::move(term1), std::move(term2));
// Chain rule: multiply by derivative of inner expression
return std::make_unique<ad::expr::Multiplication<T>>(
std::move(deriv),
this->operand_->differentiate(variable)
);
}
ad::expr::ExprPtr<T> clone() const override {
return std::make_unique<CustomFunction<T>>(this->operand_->clone());
}
ad::expr::ExprPtr<T> clone_with(ad::expr::ExprPtr<T> new_operand) const override {
return std::make_unique<CustomFunction<T>>(std::move(new_operand));
}
std::string func_name() const override { return "custom_func"; }
};Objects in AutoDiff are replaceable with their subtypes without affecting program correctness:
- All expressions follow the
Expressioninterface contract - All operations provide consistent differentiation behavior
- Graph nodes follow a consistent interface for forward/backward passes
Example:
#include "expression.h"
// Function that works with any Expression type
template <typename T>
T evaluate_and_differentiate(const ad::expr::Expression<T>& expr, const std::string& var) {
T value = expr.evaluate();
T derivative = expr.differentiate(var)->evaluate();
return value + derivative;
}
// Usage with different expression types
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);
auto sin_x = std::make_unique<ad::expr::Sin<double>>(x->clone());
auto x_squared = std::make_unique<ad::expr::Pow<double>>(
x->clone(),
std::make_unique<ad::expr::Constant<double>>(2)
);
// Both work with the same function
double result1 = evaluate_and_differentiate(*sin_x, "x");
double result2 = evaluate_and_differentiate(*x_squared, "x");Interfaces in AutoDiff are focused and minimal:
-
Expressionprovides only essential methods for evaluation and differentiation -
GraphNodeseparates forward and backward passes - Separate mechanisms for expressing mathematical relationships vs. computing gradients
Example:
#include "expression.h"
// BinaryOperation only defines what's needed for a binary operation
// Use case: Adding two expressions
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);
auto y = std::make_unique<ad::expr::Variable<double>>("y", 3.0);
auto sum = std::make_unique<ad::expr::Addition<double>>(x->clone(), y->clone());
// clone_with method provides a clean way to create a new binary operation
// with modified operands
auto z = std::make_unique<ad::expr::Variable<double>>("z", 4.0);
auto new_sum = sum->clone_with(x->clone(), z->clone());High-level modules in AutoDiff depend on abstractions, not concrete implementations:
- Algorithms operate on the
Expressioninterface, not concrete types - Optimization strategies depend on abstract expression interfaces
- Differentiation relies on abstract rule definitions
Example:
#include "expression.h"
#include "optimizer.h"
// Create an expression
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);
auto expr = std::make_unique<ad::expr::Sin<double>>(
std::make_unique<ad::expr::Pow<double>>(
x->clone(),
std::make_unique<ad::expr::Constant<double>>(2)
)
);
// Optimizer works with the abstract Expression interface
ad::optimizer::ExpressionOptimizer<double> optimizer;
auto optimized_expr = optimizer.optimize(std::move(expr));AD has two primary modes of operation:
Forward Mode:
- Computes the derivative alongside the function evaluation
- Efficient when there are few inputs and many outputs
- Propagates derivatives from inputs to outputs
Reverse Mode:
- First computes the function value, then propagates derivatives backward
- Efficient when there are many inputs and few outputs
- Used extensively in neural networks (backpropagation)
Example for forward mode:
#include "forward_mode.h"
// Create a forward mode session
ad::forward::ForwardMode<double> fm;
// Compute derivative with respect to x
auto x_val = 2.0;
auto y_val = 3.0;
// Create variables (x with derivative 1, y with derivative 0)
auto x_var = fm.variable(x_val);
auto y_const = fm.constant(y_val);
// Compute f(x, y) = x * y + sin(x)
auto xy = x_var * y_const;
auto sin_x = sin(x_var);
auto result = xy + sin_x;
// Result contains both value and derivative
double function_value = result.value; // 6.9093
double derivative_x = result.deriv; // 3.4161Example for reverse mode:
#include "reverse_mode.h"
#include "computational_graph.h"
// Create a reverse mode session
ad::reverse::ReverseMode<double> rm;
// Add variables to the graph
auto x = rm.add_variable("x", 2.0);
auto y = rm.add_variable("y", 3.0);
// Build the computational graph
auto xy = ad::graph::multiply<double>(x, y);
auto sin_x = ad::graph::sin<double>(x);
auto output = ad::graph::add<double>(xy, sin_x);
// Compute the function value
double result = output->forward(); // 6.9093
// Compute gradients with respect to all variables
auto gradients = rm.compute_gradients(output);
double dx = gradients["x"]; // 3.4161
double dy = gradients["y"]; // 2.0Forward mode AD often uses a mathematical construct called dual numbers. A dual number is an extension of real numbers with a nilpotent element ε (where ε² = 0).
Example:
#include "dual_number.h"
// Create dual numbers
ad::core::DualNumber<double> x(2.0, 1.0); // Value 2.0, derivative 1.0
ad::core::DualNumber<double> y(3.0, 0.0); // Value 3.0, derivative 0.0
// Perform operations
auto sum = x + y; // Value 5.0, derivative 1.0
auto product = x * y; // Value 6.0, derivative 3.0
auto sin_x = sin(x); // Value sin(2.0), derivative cos(2.0)
// The derivative part gives us the derivative of the operation
double derivative_of_product = product.derivative(); // 3.0The expression system represents mathematical expressions as trees of operations.
Example:
#include "expression.h"
// Create variables
auto x = std::make_unique<ad::expr::Variable<double>>("x", 2.0);
auto y = std::make_unique<ad::expr::Variable<double>>("y", 3.0);
// Build the expression: f(x, y) = x * y + sin(x)
auto xy = std::make_unique<ad::expr::Multiplication<double>>(x->clone(), y->clone());
auto sin_x = std::make_unique<ad::expr::Sin<double>>(x->clone());
auto expr = std::make_unique<ad::expr::Addition<double>>(std::move(xy), std::move(sin_x));
// Evaluate the expression
double result = expr->evaluate(); // result = 2 * 3 + sin(2) = 6.9093
// Compute derivative with respect to x
auto dx = expr->differentiate("x");
double derivative_x = dx->evaluate(); // derivative_x = y + cos(x) = 3 + cos(2) = 3.4161Newton's method for finding roots requires both function evaluation and differentiation:
#include <iostream>
#include "expression.h"
int main() {
// Create the expression: f(x) = x³ - 4x² + 5x - 2
auto x = std::make_unique<ad::expr::Variable<double>>("x", 1.0);
auto x_cubed = std::make_unique<ad::expr::Pow<double>>(
x->clone(),
std::make_unique<ad::expr::Constant<double>>(3)
);
auto four = std::make_unique<ad::expr::Constant<double>>(4);
auto x_squared = std::make_unique<ad::expr::Pow<double>>(
x->clone(),
std::make_unique<ad::expr::Constant<double>>(2)
);
auto four_x_squared = std::make_unique<ad::expr::Multiplication<double>>(
std::move(four),
std::move(x_squared)
);
auto five = std::make_unique<ad::expr::Constant<double>>(5);
auto five_x = std::make_unique<ad::expr::Multiplication<double>>(
std::move(five),
x->clone()
);
auto two = std::make_unique<ad::expr::Constant<double>>(2);
auto term1 = std::make_unique<ad::expr::Subtraction<double>>(
std::move(x_cubed),
std::move(four_x_squared)
);
auto term2 = std::make_unique<ad::expr::Subtraction<double>>(
std::move(five_x),
std::move(two)
);
auto f = std::make_unique<ad::expr::Addition<double>>(
std::move(term1),
std::move(term2)
);
// Newton's method: x_{n+1} = x_n - f(x_n) / f'(x_n)
double x_n = 1.0;
for (int i = 0; i < 10; i++) {
x->set_value(x_n);
double f_x = f->evaluate();
auto df = f->differentiate("x");
double df_x = df->evaluate();
double x_next = x_n - f_x / df_x;
double error = std::abs(x_next - x_n);
std::cout << "Iteration " << i << ": x = " << x_n
<< ", f(x) = " << f_x
<< ", f'(x) = " << df_x
<< ", error = " << error << std::endl;
if (error < 1e-10) {
std::cout << "Converged to root: " << x_next << std::endl;
break;
}
x_n = x_next;
}
return 0;
}Dual numbers serve as a bridge between the expression system and the computational graph.
#include <iostream>
#include <cmath>
#include "dual_number.h"
// Function to compute e^x using Taylor series approximation
template <typename T>
T exp_taylor(const T& x, int n_terms) {
T result = T(1); // First term: x^0 / 0!
T term = T(1);
for (int i = 1; i < n_terms; i++) {
term = term * x / T(i); // x^i / i!
result = result + term;
}
return result;
}
int main() {
// Evaluate at x = 1.0
double x_val = 1.0;
// Create a dual number with derivative 1.0
ad::core::DualNumber<double> x(x_val, 1.0);
// Compute approximation using 5 terms
ad::core::DualNumber<double> result = exp_taylor(x, 5);
// Compare with standard exp function
double std_exp = std::exp(x_val);
double std_exp_derivative = std_exp; // Derivative of e^x is e^x
std::cout << "Taylor approximation of e^" << x_val << " with 5 terms:" << std::endl;
std::cout << "Value: " << result.value() << " (error: " << std::abs(result.value() - std_exp) << ")" << std::endl;
std::cout << "Derivative: " << result.derivative() << " (error: " << std::abs(result.derivative() - std_exp_derivative) << ")" << std::endl;
// Compute approximation using 10 terms
result = exp_taylor(x, 10);
std::cout << "\nTaylor approximation of e^" << x_val << " with 10 terms:" << std::endl;
std::cout << "Value: " << result.value() << " (error: " << std::abs(result.value() - std_exp) << ")" << std::endl;
std::cout << "Derivative: " << result.derivative() << " (error: " << std::abs(result.derivative() - std_exp_derivative) << ")" << std::endl;
return 0;
}The computational graph is the foundation of reverse mode automatic differentiation.
#include <iostream>
#include "reverse_mode.h"
#include "computational_graph.h"
int main() {
// Function to minimize: f(x, y) = (x - 2)² + 2(y - 1)²
ad::reverse::ReverseMode<double> rm;
// Add variables
auto x = rm.add_variable("x", 0.0); // Starting point: (0, 0)
auto y = rm.add_variable("y", 0.0);
// Constants
auto two = std::make_shared<ad::graph::ConstantNode<double>>(2.0);
// Build the expression
auto x_minus_2 = ad::graph::subtract<double>(
x,
std::make_shared<ad::graph::ConstantNode<double>>(2.0)
);
auto y_minus_1 = ad::graph::subtract<double>(
y,
std::make_shared<ad::graph::ConstantNode<double>>(1.0)
);
auto term1 = ad::graph::multiply<double>(
x_minus_2,
x_minus_2
);
auto term2 = ad::graph::multiply<double>(
two,
ad::graph::multiply<double>(y_minus_1, y_minus_1)
);
auto f = ad::graph::add<double>(term1, term2);
// Gradient descent parameters
double learning_rate = 0.1;
int n_iterations = 20;
std::cout << "Gradient descent optimization:" << std::endl;
std::cout << "Initial point: x = " << x->get_value() << ", y = " << y->get_value() << std::endl;
for (int i = 0; i < n_iterations; i++) {
// Compute function value
double f_val = f->forward();
// Compute gradients
auto gradients = rm.compute_gradients(f);
double dx = gradients["x"];
double dy = gradients["y"];
// Update variables
double x_val = x->get_value();
double y_val = y->get_value();
x->set_value(x_val - learning_rate * dx);
y->set_value(y_val - learning_rate * dy);
std::cout << "Iteration " << i+1 << ": f(x,y) = " << f_val
<< ", (x,y) = (" << x->get_value() << "," << y->get_value() << ")"
<< ", gradients = (" << dx << "," << dy << ")" << std::endl;
}
std::cout << "Final point: x = " << x->get_value() << ", y = " << y->get_value() << std::endl;
std::cout << "Minimum value: f(x,y) = " << f->forward() << std::endl;
std::cout << "Expected minimum: (x,y) = (2,1), f(2,1) = 0" << std::endl;
return 0;
}Forward mode AD computes derivatives alongside function values.
#include <iostream>
#include "forward_mode.h"
int main() {
// Function: f(x, y, z) = x*y*z + sin(x)*cos(y)*exp(z)
ad::forward::ForwardMode<double> fm;
// Variable values
double x_val = 1.0;
double y_val = 2.0;
double z_val = 0.5;
std::cout << "Computing partial derivatives of f(x,y,z) = x*y*z + sin(x)*cos(y)*exp(z)" << std::endl;
std::cout << "at point (x,y,z) = (" << x_val << "," << y_val << "," << z_val << ")" << std::endl;
// Compute ∂f/∂x
auto x = fm.variable(x_val); // Derivative = 1
auto y = fm.constant(y_val); // Derivative = 0
auto z = fm.constant(z_val); // Derivative = 0
auto xyz = x * y * z;
auto sin_x = sin(x);
auto cos_y = cos(y);
auto exp_z = exp(z);
auto term2 = sin_x * cos_y * exp_z;
auto f = xyz + term2;
double f_val = f.value;
double df_dx = f.deriv;
// Compute ∂f/∂y
x = fm.constant(x_val); // Derivative = 0
y = fm.variable(y_val); // Derivative = 1
z = fm.constant(z_val); // Derivative = 0
xyz = x * y * z;
sin_x = sin(x);
cos_y = cos(y);
exp_z = exp(z);
term2 = sin_x * cos_y * exp_z;
f = xyz + term2;
double df_dy = f.deriv;
// Compute ∂f/∂z
x = fm.constant(x_val); // Derivative = 0
y = fm.constant(y_val); // Derivative = 0
z = fm.variable(z_val); // Derivative = 1
xyz = x * y * z;
sin_x = sin(x);
cos_y = cos(y);
exp_z = exp(z);
term2 = sin_x * cos_y * exp_z;
f = xyz + term2;
double df_dz = f.deriv;
// Output results
std::cout << "f(x,y,z) = " << f_val << std::endl;
std::cout << "∂f/∂x = " << df_dx << std::endl;
std::cout << "∂f/∂y = " << df_dy << std::endl;
std::cout << "∂f/∂z = " << df_dz << std::endl;
return 0;
}Reverse mode AD builds a computational graph and propagates gradients backward.
#include <iostream>
#include <vector>
#include <random>
#include "reverse_mode.h"
#include "computational_graph.h"
int main() {
// Neural network layer: output = activation(W·input + b)
// For simplicity, we'll use a linear activation (identity function)
// Define dimensions
const int input_dim = 3;
const int output_dim = 2;
// Create input data
std::vector<double> input_data = {0.5, -0.2, 0.1};
// Create weight matrix (randomly initialized)
std::vector<std::vector<double>> weights(output_dim, std::vector<double>(input_dim));
std::vector<double> biases(output_dim);
std::random_device rd;
std::mt19937 gen(rd());
std::normal_distribution<> d(0, 0.1);
for (int i = 0; i < output_dim; i++) {
std::cout << "For output[" << i << "]:" << std::endl;
auto gradients = rm.compute_gradients(outputs[i]);
for (int j = 0; j < input_dim; j++) {
std::cout << " d(output[" << i << "])/d(input[" << j << "]) = "
<< gradients["input" + std::to_string(j)] << std::endl;
}
}
return 0;
}AutoDiff supports differentiating through control flow structures like loops and conditionals.
#include <iostream>
#include <vector>
#include "reverse_mode.h"
#include "computational_graph.h"
#include "control_flow.h"
int main() {
// Demonstrate differentiating through a ReLU activation function
// ReLU(x) = max(0, x) = { x if x > 0, 0 otherwise }
std::cout << "Differentiating through a ReLU activation function" << std::endl;
std::cout << "==================================================" << std::endl;
// Create a reverse mode session
ad::reverse::ReverseMode<double> rm;
// Create a variable node for our input
auto x = rm.add_variable("x", 0.0);
// Create a zero constant node
auto zero = std::make_shared<ad::graph::ConstantNode<double>>(0.0);
// Create a conditional node to implement ReLU
// We'll use a smoothing parameter to make it differentiable
auto relu = ad::graph::make_conditional<double>(
x, // condition: x > 0
x, // true branch: return x
zero, // false branch: return 0
0.01 // smoothing parameter
);
// Test at different values
std::vector<double> test_values = {-2.0, -1.0, -0.5, -0.1, 0.0, 0.1, 0.5, 1.0, 2.0};
std::cout << "x\tReLU(x)\tGradient" << std::endl;
std::cout << "------------------------" << std::endl;
for (double val : test_values) {
// Set the input value
x->set_value(val);
// Forward pass
double relu_val = relu->forward();
// Backward pass (compute gradient)
auto gradients = rm.compute_gradients(relu);
double drelu_dx = gradients["x"];
// Print results
std::cout << val << "\t" << relu_val << "\t" << drelu_dx << std::endl;
}
// Compare with analytical solution
std::cout << "\nAnalytical solution for comparison:" << std::endl;
std::cout << "For ReLU(x), the derivative is:" << std::endl;
std::cout << "- 0 when x < 0" << std::endl;
std::cout << "- 1 when x > 0" << std::endl;
std::cout << "- Undefined at x = 0 (but often set to 0 or 0.5)" << std::endl;
// Demonstrate using ReLU in a simple neural network
std::cout << "\nSimple neural network with ReLU activation" << std::endl;
std::cout << "===========================================" << std::endl;
// Create weight and bias
auto w = rm.add_variable("w", 0.5);
auto b = rm.add_variable("b", -0.3);
// Create a simple neuron: y = ReLU(w*x + b)
auto wx = ad::graph::multiply<double>(w, x);
auto wxb = ad::graph::add<double>(wx, b);
auto neuron_output = ad::graph::make_conditional<double>(
wxb, // condition: w*x + b > 0
wxb, // true branch: return w*x + b
zero, // false branch: return 0
0.01 // smoothing parameter
);
// Test for different input values
std::cout << "\nx\tw*x+b\tReLU(w*x+b)\tdReLU/dx\tdReLU/dw\tdReLU/db" << std::endl;
std::cout << "-------------------------------------------------------------" << std::endl;
for (double val : {-2.0, -1.0, 0.0, 1.0, 2.0}) {
// Set the input value
x->set_value(val);
// Forward pass
double wxb_val = wxb->forward();
double neuron_val = neuron_output->forward();
// Backward pass (compute gradients)
auto gradients = rm.compute_gradients(neuron_output);
double dy_dx = gradients["x"];
double dy_dw = gradients["w"];
double dy_db = gradients["b"];
// Print results
std::cout << val << "\t" << wxb_val << "\t" << neuron_val << "\t\t"
<< dy_dx << "\t\t" << dy_dw << "\t\t" << dy_db << std::endl;
}
std::cout << "\nExplanation:" << std::endl;
std::cout << "- When w*x + b <= 0, the output and all gradients are close to 0" << std::endl;
std::cout << "- When w*x + b > 0, the gradients show how the output changes with each parameter:" << std::endl;
std::cout << " * dReLU/dx ≈ w (when active)" << std::endl;
std::cout << " * dReLU/dw ≈ x (when active)" << std::endl;
std::cout << " * dReLU/db ≈ 1 (when active)" << std::endl;
std::cout << "- The transition between active/inactive regions is smooth due to the smoothing parameter" << std::endl;
return 0;
}The optimizer in AutoDiff improves the performance of expression evaluation by applying various optimization techniques.
#include <iostream>
#include <chrono>
#include "expression.h"
#include "optimizer.h"
int main() {
// Create a complex expression with opportunities for optimization:
// f(x) = (x + 0) * 1 + (x * 0) + (x - x) + sin(x)^2 + cos(x)^2
// This can be simplified to: f(x) = 1 + 0 + 0 + sin(x)^2 + cos(x)^2 = 1 + sin(x)^2 + cos(x)^2
// Furthermore, sin(x)^2 + cos(x)^2 = 1, so the result should be 2
auto x = std::make_unique<ad::expr::Variable<double>>("x", 0.5);
// Term 1: (x + 0) * 1
auto zero = std::make_unique<ad::expr::Constant<double>>(0.0);
auto one = std::make_unique<ad::expr::Constant<double>>(1.0);
auto x_plus_zero = std::make_unique<ad::expr::Addition<double>>(x->clone(), std::move(zero));
auto term1 = std::make_unique<ad::expr::Multiplication<double>>(std::move(x_plus_zero), std::move(one));
// Term 2: x * 0
auto zero2 = std::make_unique<ad::expr::Constant<double>>(0.0);
auto term2 = std::make_unique<ad::expr::Multiplication<double>>(x->clone(), std::move(zero2));
// Term 3: x - x
auto term3 = std::make_unique<ad::expr::Subtraction<double>>(x->clone(), x->clone());
// Term 4: sin(x)^2
auto sin_x = std::make_unique<ad::expr::Sin<double>>(x->clone());
auto two = std::make_unique<ad::expr::Constant<double>>(2.0);
auto term4 = std::make_unique<ad::expr::Pow<double>>(std::move(sin_x), std::move(two));
// Term 5: cos(x)^2
auto cos_x = std::make_unique<ad::expr::Cos<double>>(x->clone());
auto two2 = std::make_unique<ad::expr::Constant<double>>(2.0);
auto term5 = std::make_unique<ad::expr::Pow<double>>(std::move(cos_x), std::move(two2));
// Combine terms
auto sum1 = std::make_unique<ad::expr::Addition<double>>(std::move(term1), std::move(term2));
auto sum2 = std::make_unique<ad::expr::Addition<double>>(std::move(sum1), std::move(term3));
auto sum3 = std::make_unique<ad::expr::Addition<double>>(std::move(sum2), std::move(term4));
auto expr = std::make_unique<ad::expr::Addition<double>>(std::move(sum3), std::move(term5));
// Evaluate without optimization
auto start = std::chrono::high_resolution_clock::now();
double result_unoptimized = expr->evaluate();
auto end = std::chrono::high_resolution_clock::now();
auto time_unoptimized = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
// Create optimizer
ad::optimizer::ExpressionOptimizer<double> optimizer;
// Optimize the expression
auto optimized_expr = optimizer.optimize(expr->clone());
// Evaluate with optimization
start = std::chrono::high_resolution_clock::now();
double result_optimized = optimized_expr->evaluate();
end = std::chrono::high_resolution_clock::now();
auto time_optimized = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
// Print results
std::cout << "Unoptimized result: " << result_unoptimized
<< " (time: " << time_unoptimized << " ns)" << std::endl;
std::cout << "Optimized result: " << result_optimized
<< " (time: " << time_optimized << " ns)" << std::endl;
std::cout << "Speedup: " << static_cast<double>(time_unoptimized) / time_optimized << "x" << std::endl;
// Print optimization metrics
std::cout << optimizer.get_metrics_report() << std::endl;
return 0;
}The included benchmark file (benchmark_forward_vs_backward_crossover.cpp) compares the performance of forward and reverse mode AD for different input/output dimensions.
The benchmark simulates a simple neural network layer computation:
y = W·x + b
where:
-
xis an input vector of sizeinput_dim -
Wis a weight matrix of sizeoutput_dim × input_dim -
bis a bias vector of sizeoutput_dim -
yis the output vector of sizeoutput_dim
#include <iostream>
#include <vector>
#include <chrono>
#include "forward_mode.h"
#include "reverse_mode.h"
#include "computational_graph.h"
template <typename T>
double benchmark_forward(int input_dim, int output_dim, int iterations) {
// Initialize values
std::vector<T> x_values(input_dim);
std::vector<std::vector<T>> W_values(output_dim, std::vector<T>(input_dim));
std::vector<T> b_values(output_dim);
// Fill with simple values
for (int i = 0; i < input_dim; i++) {
x_values[i] = 0.1 * (i + 1);
}
for (int i = 0; i < output_dim; i++) {
b_values[i] = 0.01 * i;
for (int j = 0; j < input_dim; j++) {
W_values[i][j] = 0.01 * (i + 1) * (j + 1);
}
}
// Vectors for forward mode
std::vector<ad::forward::ForwardVar<T>> x_forward(input_dim);
std::vector<std::vector<ad::forward::ForwardVar<T>>> W_forward(output_dim, std::vector<ad::forward::ForwardVar<T>>(input_dim));
std::vector<ad::forward::ForwardVar<T>> b_forward(output_dim);
std::vector<ad::forward::ForwardVar<T>> y_forward(output_dim);
std::vector<std::vector<T>> gradients(input_dim, std::vector<T>(output_dim));
// Start timing
auto start = std::chrono::high_resolution_clock::now();
for (int iter = 0; iter < iterations; iter++) {
// Reset all derivatives
for (int i = 0; i < input_dim; i++) {
x_forward[i] = ad::forward::ForwardVar<T>(x_values[i], T(0));
}
for (int i = 0; i < output_dim; i++) {
b_forward[i] = ad::forward::ForwardVar<T>(b_values[i], T(0));
for (int j = 0; j < input_dim; j++) {
W_forward[i][j] = ad::forward::ForwardVar<T>(W_values[i][j], T(0));
}
}
// Compute gradients for each input
for (int k = 0; k < input_dim; k++) {
// Set derivative for current input to 1.0
x_forward[k].deriv = T(1);
// Compute layer outputs
for (int i = 0; i < output_dim; i++) {
// Start with bias
y_forward[i] = b_forward[i];
// Add weighted inputs
for (int j = 0; j < input_dim; j++) {
y_forward[i] = y_forward[i] + W_forward[i][j] * x_forward[j];
}
// Store gradient
gradients[k][i] = y_forward[i].deriv;
}
// Reset derivative for current input
x_forward[k].deriv = T(0);
}
}
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() / static_cast<double>(iterations);
}
template <typename T>
double benchmark_reverse(int input_dim, int output_dim, int iterations) {
// Initialize values
std::vector<T> x_values(input_dim);
std::vector<std::vector<T>> W_values(output_dim, std::vector<T>(input_dim));
std::vector<T> b_values(output_dim);
// Fill with simple values
for (int i = 0; i < input_dim; i++) {
x_values[i] = 0.1 * (i + 1);
}
for (int i = 0; i < output_dim; i++) {
b_values[i] = 0.01 * i;
for (int j = 0; j < input_dim; j++) {
W_values[i][j] = 0.01 * (i + 1) * (j + 1);
}
}
// Start timing
auto start = std::chrono::high_resolution_clock::now();
for (int iter = 0; iter < iterations; iter++) {
// Create fresh computational graph for each iteration
ad::reverse::ReverseMode<T> rm;
// Add variables to the graph
std::vector<std::shared_ptr<ad::graph::VariableNode<T>>> x_nodes;
std::vector<std::vector<std::shared_ptr<ad::graph::VariableNode<T>>>> W_nodes;
std::vector<std::shared_ptr<ad::graph::VariableNode<T>>> b_nodes;
for (int i = 0; i < input_dim; i++) {
x_nodes.push_back(rm.add_variable("x" + std::to_string(i), x_values[i]));
}
for (int i = 0; i < output_dim; i++) {
std::vector<std::shared_ptr<ad::graph::VariableNode<T>>> row;
for (int j = 0; j < input_dim; j++) {
row.push_back(rm.add_variable("W" + std::to_string(i) + "_" + std::to_string(j), W_values[i][j]));
}
W_nodes.push_back(row);
b_nodes.push_back(rm.add_variable("b" + std::to_string(i), b_values[i]));
}
// Build the computational graph for all outputs
std::vector<std::shared_ptr<ad::graph::GraphNode<T>>> y_nodes;
for (int i = 0; i < output_dim; i++) {
// Start with the bias node
std::shared_ptr<ad::graph::GraphNode<T>> output = b_nodes[i];
// Add weighted inputs
for (int j = 0; j < input_dim; j++) {
auto term = ad::graph::multiply<T>(W_nodes[i][j], x_nodes[j]);
output = ad::graph::add<T>(output, term);
}
y_nodes.push_back(output);
}
// Compute gradients for each output
for (int i = 0; i < output_dim; i++) {
auto grads = rm.compute_gradients(y_nodes[i]);
}
}
auto end = std::chrono::high_resolution_clock::now();
return std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count() / static_cast<double>(iterations);
}
int main() {
const int iterations = 10; // Number of iterations to average over
std::cout << "Benchmarking forward vs. reverse mode AD for different input/output dimensions" << std::endl;
std::cout << "==========================================================================" << std::endl;
std::cout << "Each test runs for " << iterations << " iterations, and times are in milliseconds per iteration" << std::endl;
std::cout << "\nInput dim, Output dim, Forward time (ms), Reverse time (ms), Ratio (reverse/forward)" << std::endl;
std::cout << "------------------------------------------------------------------------" << std::endl;
// Test various input/output dimension combinations
std::vector<std::pair<int, int>> test_cases = {
{1, 1},
{1, 10},
{10, 1},
{10, 10},
{20, 1},
{1, 20},
{20, 20},
{50, 1},
{1, 50},
{50, 50}
};
for (const auto& [input_dim, output_dim] : test_cases) {
double forward_time = benchmark_forward<double>(input_dim, output_dim, iterations);
double reverse_time = benchmark_reverse<double>(input_dim, output_dim, iterations);
double ratio = reverse_time / forward_time;
std::cout << input_dim << ", " << output_dim << ", "
<< forward_time << ", " << reverse_time << ", " << ratio << std::endl;
}
return 0;
}Key insights from the benchmark:
-
Single Output, Multiple Inputs: As the number of inputs increases, reverse mode becomes increasingly efficient compared to forward mode. For small input dimensions (1-5), forward mode is still faster, but beyond a certain threshold (around 10-15 inputs), reverse mode takes the lead.
-
Multiple Outputs, Single Input: Forward mode consistently outperforms reverse mode in this scenario, as expected theoretically. The performance gap grows with the number of outputs.
-
Equal Inputs and Outputs: For square matrices (equal input and output dimensions), forward mode tends to be better for small dimensions, but reverse mode becomes more efficient as the dimensions grow.
-
The Crossover Point: The benchmark identifies the exact point where the ratio of reverse mode time to forward mode time equals 1.0. This is the crossover point where the two methods have equal performance.
These benchmark results have important practical implications:
-
Algorithm Selection: Choose forward mode when the number of outputs exceeds the number of inputs, and reverse mode in the opposite case.
-
Neural Network Training: For neural networks, where we typically have many weights (inputs) but only one loss value (output), reverse mode (backpropagation) is the clear choice.
-
Hybrid Approaches: For complex functions with multiple inputs and outputs, a hybrid approach might be optimal, using forward mode for some computations and reverse mode for others.
#include <iostream>
#include <vector>
#include "reverse_mode.h"
#include "computational_graph.h"
int main() {
// Problem: Minimize f(x, y, z) = x^2 + y^2 + z^2
// Subject to: g1(x, y, z) = x + y + z - 1 = 0
// g2(x, y, z) = x^2 + y^2 - z = 0
// Create a reverse mode AD session
ad::reverse::ReverseMode<double> rm;
// Add variables
auto x = rm.add_variable("x", 0.5); // Initial guess
auto y = rm.add_variable("y", 0.5);
auto z = rm.add_variable("z", 0.0);
auto lambda1 = rm.add_variable("lambda1", 0.0); // Lagrange multiplier
auto lambda2 = rm.add_variable("lambda2", 0.0); // Lagrange multiplier
// Build the objective function: f(x, y, z) = x^2 + y^2 + z^2
auto x_squared = ad::graph::multiply<double>(x, x);
auto y_squared = ad::graph::multiply<double>(y, y);
auto z_squared = ad::graph::multiply<double>(z, z);
auto sum_xy = ad::graph::add<double>(x_squared, y_squared);
auto objective = ad::graph::add<double>(sum_xy, z_squared);
// Build constraint 1: g1(x, y, z) = x + y + z - 1 = 0
auto sum_xyz = ad::graph::add<double>(
ad::graph::add<double>(x, y),
z
);
auto constraint1 = ad::graph::subtract<double>(
sum_xyz,
std::make_shared<ad::graph::ConstantNode<double>>(1.0)
);
// Build constraint 2: g2(x, y, z) = x^2 + y^2 - z = 0
auto constraint2 = ad::graph::subtract<double>(
ad::graph::add<double>(x_squared, y_squared),
z
);
// Build the Lagrangian: L = f + lambda1*g1 + lambda2*g2
auto lambda1_g1 = ad::graph::multiply<double>(lambda1, constraint1);
auto lambda2_g2 = ad::graph::multiply<double>(lambda2, constraint2);
auto lagrangian = ad::graph::add<double>(
objective,
ad::graph::add<double>(lambda1_g1, lambda2_g2)
);
// Gradient descent to find the critical point of the Lagrangian
const int max_iterations = 100;
const double learning_rate = 0.01;
const double tolerance = 1e-6;
std::cout << "Constrained optimization using Lagrange multipliers" << std::endl;
std::cout << "Minimizing f(x, y, z) = x^2 + y^2 + z^2" << std::endl;
std::cout << "Subject to: x + y + z - 1 = 0 and x^2 + y^2 - z = 0" << std::endl;
std::cout << "==========================================================" << std::endl;
for (int i = 0; i < max_iterations; i++) {
// Compute gradients
auto gradients = rm.compute_gradients(lagrangian);
// Update variables using gradient descent
double x_val = x->get_value();
double y_val = y->get_value();
double z_val = z->get_value();
double lambda1_val = lambda1->get_value();
double lambda2_val = lambda2->get_value();
double dx = gradients["x"];
double dy = gradients["y"];
double dz = gradients["z"];
double dlambda1 = gradients["lambda1"];
double dlambda2 = gradients["lambda2"];
// Compute constraint values
double constraint1_val = x_val + y_val + z_val - 1.0;
double constraint2_val = x_val*x_val + y_val*y_val - z_val;
// Check convergence
double gradient_norm = std::sqrt(dx*dx + dy*dy + dz*dz + dlambda1*dlambda1 + dlambda2*dlambda2);
double constraint_norm = std::sqrt(constraint1_val*constraint1_val + constraint2_val*constraint2_val);
if (i % 10 == 0 || i == max_iterations - 1 || gradient_norm < tolerance) {
std::cout << "Iteration " << i << ":" << std::endl;
std::cout << " (x, y, z) = (" << x_val << ", " << y_val << ", " << z_val << ")" << std::endl;
std::cout << " f(x, y, z) = " << objective->forward() << std::endl;
std::cout << " Constraint 1 = " << constraint1_val << std::endl;
std::cout << " Constraint 2 = " << constraint2_val << std::endl;
std::cout << " Gradient norm = " << gradient_norm << std::endl;
std::cout << " Constraint norm = " << constraint_norm << std::endl;
std::cout << std::endl;
}
if (gradient_norm < tolerance && constraint_norm < tolerance) {
std::cout << "Converged after " << i << " iterations!" << std::endl;
break;
}
// Update values
x->set_value(x_val - learning_rate * dx);
y->set_value(y_val - learning_rate * dy);
z->set_value(z_val - learning_rate * dz);
lambda1->set_value(lambda1_val - learning_rate * dlambda1);
lambda2->set_value(lambda2_val - learning_rate * dlambda2);
}
// Verify the solution
double x_final = x->get_value();
double y_final = y->get_value();
double z_final = z->get_value();
double obj_final = objective->forward();
std::cout << "Final solution:" << std::endl;
std::cout << " (x, y, z) = (" << x_final << ", " << y_final << ", " << z_final << ")" << std::endl;
std::cout << " f(x, y, z) = " << obj_final << std::endl;
std::cout << " Constraint 1 = " << x_final + y_final + z_final - 1.0 << std::endl;
std::cout << " Constraint 2 = " << x_final*x_final + y_final*y_final - z_final << std::endl;
return 0;
}In this tutorial, we've covered the design principles and implementation details of automatic differentiation in the AutoDiff framework.
We started with the core concepts of automatic differentiation and the differences between forward and reverse mode. We then explored the three main components of the framework: the expression system, dual numbers, and the computational graph.
We looked at how forward mode uses dual numbers to simultaneously compute function values and derivatives, and how reverse mode builds a computational graph to efficiently compute gradients with respect to many variables.
We also covered control flow differentiation, expression optimization, and performance benchmarks.
Throughout the tutorial, we've seen how automatic differentiation combines mathematical elegance with computational efficiency to solve a wide range of problems in scientific computing, optimization, and machine learning.
By understanding both the theoretical foundations and practical implementation details of automatic differentiation, you are now equipped to use the AutoDiff framework effectively in your own applications, whether for optimization, machine learning, or any other field that requires the computation of derivatives.
From a design perspective, we've seen how the SOLID principles are applied in the AutoDiff framework to create a flexible, extensible, and maintainable codebase. Each component has a single responsibility, the system is open for extension but closed for modification, objects are replaceable with their subtypes, interfaces are focused and minimal, and high-level modules depend on abstractions rather than concrete implementations.
These design principles, combined with the mathematical foundation of automatic differentiation, make AutoDiff a powerful tool for a wide range of applications.