|
| 1 | +# Optical Flow: Classical to Deep Learning Implementation |
| 2 | + |
| 3 | +## Introduction |
| 4 | + |
| 5 | +Optical flow represents one of the foundational challenges in computer vision: how do we track the motion of objects between frames? When you watch a video, your brain effortlessly tracks the movement of objects across frames. Implementing this computationally requires sophisticated algorithms that can detect and quantify motion at the pixel level. |
| 6 | + |
| 7 | +## Classical Methods and Their Mathematics |
| 8 | + |
| 9 | +### The Lucas-Kanade Method |
| 10 | + |
| 11 | +The Lucas-Kanade algorithm approaches optical flow through a fundamental equation that relates pixel intensity changes to motion. The algorithm is built on two key assumptions: |
| 12 | + |
| 13 | +1. **Brightness Constancy**: A pixel maintains its intensity as it moves |
| 14 | +2. **Spatial Coherence**: Nearby pixels move similarly |
| 15 | + |
| 16 | +These assumptions lead to the optical flow equation: |
| 17 | +``` |
| 18 | +Ix * u + Iy * v + It = 0 |
| 19 | +``` |
| 20 | +where (u,v) represents the flow vector we want to compute. |
| 21 | + |
| 22 | +Here's the implementation with detailed breakdown: |
| 23 | + |
| 24 | +```python |
| 25 | +def lucas_kanade_flow(I1, I2, window_size=15): |
| 26 | + # Compute spatial and temporal gradients |
| 27 | + Ix = cv2.Sobel(I1, cv2.CV_64F, 1, 0, ksize=3) |
| 28 | + Iy = cv2.Sobel(I1, cv2.CV_64F, 0, 1, ksize=3) |
| 29 | + It = I2.astype(np.float32) - I1.astype(np.float32) |
| 30 | + |
| 31 | + # Solve for each pixel in window |
| 32 | + u = np.zeros_like(I1, dtype=np.float32) |
| 33 | + v = np.zeros_like(I1, dtype=np.float32) |
| 34 | + |
| 35 | + for i in range(window_size//2, I1.shape[0]-window_size//2): |
| 36 | + for j in range(window_size//2, I1.shape[1]-window_size//2): |
| 37 | + # Extract window gradients |
| 38 | + ix = Ix[i-window_size//2:i+window_size//2+1, |
| 39 | + j-window_size//2:j+window_size//2+1].flatten() |
| 40 | + iy = Iy[i-window_size//2:i+window_size//2+1, |
| 41 | + j-window_size//2:j+window_size//2+1].flatten() |
| 42 | + it = It[i-window_size//2:i+window_size//2+1, |
| 43 | + j-window_size//2:j+window_size//2+1].flatten() |
| 44 | + |
| 45 | + # Construct system of equations |
| 46 | + A = np.vstack([ix, iy]).T |
| 47 | + b = -it |
| 48 | + |
| 49 | + # Solve least squares |
| 50 | + if np.min(np.linalg.eigvals(A.T @ A)) >= 1e-6: |
| 51 | + nu = np.linalg.solve(A.T @ A, A.T @ b) |
| 52 | + u[i,j], v[i,j] = nu |
| 53 | + |
| 54 | + return u, v |
| 55 | +``` |
| 56 | + |
| 57 | +This implementation: |
| 58 | +1. Computes image gradients using Sobel operators (Ix, Iy) and frame difference (It) |
| 59 | +2. For each pixel, considers a window of surrounding pixels |
| 60 | +3. Solves a least squares problem to find the motion vector |
| 61 | +4. Checks eigenvalues to ensure the solution is well-conditioned |
| 62 | + |
| 63 | +### The Farnebäck Method |
| 64 | + |
| 65 | +Farnebäck's algorithm represents a more sophisticated classical approach that can handle larger motions by using polynomial expansion to approximate pixel neighborhoods: |
| 66 | + |
| 67 | +```python |
| 68 | +def farneback_flow(prev, curr): |
| 69 | + flow = cv2.calcOpticalFlowFarneback( |
| 70 | + prev, curr, |
| 71 | + None, |
| 72 | + pyr_scale=0.5, # Pyramid scale |
| 73 | + levels=3, # Pyramid levels |
| 74 | + winsize=15, # Window size |
| 75 | + iterations=3, # Iterations per level |
| 76 | + poly_n=5, # Polynomial expansion neighborhood |
| 77 | + poly_sigma=1.2, # Gaussian sigma |
| 78 | + flags=0 |
| 79 | + ) |
| 80 | + return flow |
| 81 | +``` |
| 82 | + |
| 83 | +The key parameters control: |
| 84 | + |
| 85 | +1. **Multi-scale Analysis**: |
| 86 | + - `pyr_scale`: Controls pyramid scale reduction (0.5 means each level is half the size) |
| 87 | + - `levels`: Number of pyramid levels (more levels handle larger motions) |
| 88 | + |
| 89 | +2. **Local Approximation**: |
| 90 | + - `winsize`: Size of neighborhood for polynomial expansion |
| 91 | + - `poly_n`: Size of neighborhood used for polynomial approximation |
| 92 | + - `poly_sigma`: Gaussian smoothing for polynomial coefficients |
| 93 | + |
| 94 | +3. **Refinement**: |
| 95 | + - `iterations`: Number of iterations at each pyramid level |
| 96 | + |
| 97 | +## Deep Learning Approaches |
| 98 | + |
| 99 | +### FlowNet: End-to-End Flow Estimation |
| 100 | + |
| 101 | +FlowNet revolutionized optical flow by showing that deep networks could learn to estimate flow directly from data. The architecture processes concatenated frames through an encoder-decoder structure: |
| 102 | + |
| 103 | +```python |
| 104 | +class FlowNetS(nn.Module): |
| 105 | + def __init__(self, batchNorm=True): |
| 106 | + super(FlowNetS, self).__init__() |
| 107 | + |
| 108 | + # Encoder |
| 109 | + self.conv1 = conv(batchNorm, 6, 64, kernel_size=7, stride=2) |
| 110 | + self.conv2 = conv(batchNorm, 64, 128, kernel_size=5, stride=2) |
| 111 | + self.conv3 = conv(batchNorm, 128, 256, kernel_size=5, stride=2) |
| 112 | + |
| 113 | + # Decoder with skip connections |
| 114 | + self.deconv5 = deconv(1024, 512) |
| 115 | + self.deconv4 = deconv(1026, 256) |
| 116 | + self.deconv3 = deconv(770, 128) |
| 117 | + |
| 118 | + # Flow prediction |
| 119 | + self.predict_flow6 = predict_flow(1024) |
| 120 | + self.predict_flow5 = predict_flow(1026) |
| 121 | + self.predict_flow4 = predict_flow(770) |
| 122 | +``` |
| 123 | + |
| 124 | +The architecture consists of: |
| 125 | + |
| 126 | +1. **Encoder Path**: |
| 127 | + - Takes 6-channel input (concatenated RGB frames) |
| 128 | + - Progressive downsampling with increasing feature channels |
| 129 | + - Large initial kernels capture substantial motions |
| 130 | + - Batch normalization stabilizes training |
| 131 | + |
| 132 | +2. **Decoder Path**: |
| 133 | + - Upsampling through deconvolution layers |
| 134 | + - Skip connections preserve fine details |
| 135 | + - Channel counts include flow predictions (e.g., 1026 = 1024 + 2) |
| 136 | + |
| 137 | +3. **Multi-scale Prediction**: |
| 138 | + - Flow predicted at multiple resolutions |
| 139 | + - Coarse predictions handle large motions |
| 140 | + - Fine predictions refine details |
| 141 | + - Loss computed at all scales |
| 142 | + |
| 143 | +### RAFT Architecture |
| 144 | + |
| 145 | +RAFT (Recurrent All-Pairs Field Transforms) represents the current state-of-the-art through iterative refinement: |
| 146 | + |
| 147 | +```python |
| 148 | +class RAFTFeatureExtractor(nn.Module): |
| 149 | + def __init__(self): |
| 150 | + super().__init__() |
| 151 | + self.backbone = ResNet18() |
| 152 | + self.conv1 = nn.Conv2d(256, 128, 1) |
| 153 | + self.conv2 = nn.Conv2d(256, 256, 1) |
| 154 | + |
| 155 | + def forward(self, x): |
| 156 | + # Extract features at 1/8 resolution |
| 157 | + x = self.backbone(x) |
| 158 | + # Split into feature and context networks |
| 159 | + feat = self.conv1(x) |
| 160 | + ctx = self.conv2(x) |
| 161 | + return feat, ctx |
| 162 | +``` |
| 163 | + |
| 164 | +RAFT innovates through: |
| 165 | + |
| 166 | +1. **Feature Extraction**: |
| 167 | + - Shared backbone network (ResNet18) processes both frames |
| 168 | + - Separate feature and context pathways |
| 169 | + - Features optimized for correlation computation |
| 170 | + - Context provides additional motion information |
| 171 | + |
| 172 | +2. **All-Pairs Correlation**: |
| 173 | +```python |
| 174 | +def compute_correlation_volume(feat1, feat2, num_levels=4): |
| 175 | + """Compute 4D correlation volume""" |
| 176 | + b, c, h, w = feat1.shape |
| 177 | + feat2 = feat2.view(b, c, h*w) |
| 178 | + |
| 179 | + # Compute correlation for all pairs |
| 180 | + corr = torch.matmul(feat1.view(b, c, h*w).transpose(1, 2), feat2) |
| 181 | + corr = corr.view(b, h, w, h, w) |
| 182 | + |
| 183 | + # Create correlation pyramid |
| 184 | + corr_pyramid = [] |
| 185 | + for i in range(num_levels): |
| 186 | + corr_pyramid.append(F.avg_pool2d( |
| 187 | + corr.view(b*h*w, 1, h, w), |
| 188 | + 2**i+1, |
| 189 | + stride=1, |
| 190 | + padding=2**i//2 |
| 191 | + )) |
| 192 | + |
| 193 | + return corr_pyramid |
| 194 | +``` |
| 195 | + |
| 196 | +This creates a 4D correlation volume that: |
| 197 | +- Captures all possible matches between frames |
| 198 | +- Enables large displacement handling |
| 199 | +- Provides multi-scale correlation information |
| 200 | + |
| 201 | +3. **Iterative Updates**: |
| 202 | +```python |
| 203 | +class RAFTUpdater(nn.Module): |
| 204 | + def __init__(self): |
| 205 | + super().__init__() |
| 206 | + self.gru = ConvGRU(hidden_dim=128) |
| 207 | + self.flow_head = FlowHead(hidden_dim=128) |
| 208 | + |
| 209 | + def forward(self, net, inp, corr, flow): |
| 210 | + # Update hidden state using correlation and context |
| 211 | + net = self.gru(net, inp, corr) |
| 212 | + # Predict flow update |
| 213 | + delta_flow = self.flow_head(net) |
| 214 | + return net, flow + delta_flow |
| 215 | +``` |
| 216 | + |
| 217 | +The updater: |
| 218 | +- Maintains flow estimate in hidden state |
| 219 | +- Refines estimate through multiple iterations |
| 220 | +- Uses GRU for temporal coherence |
| 221 | +- Predicts incremental updates |
| 222 | + |
| 223 | +## Training and Evaluation |
| 224 | + |
| 225 | +### Loss Functions |
| 226 | + |
| 227 | +The standard metric for optical flow is the EndPoint Error (EPE): |
| 228 | + |
| 229 | +```python |
| 230 | +def endpoint_error(pred_flow, gt_flow): |
| 231 | + """ |
| 232 | + Calculate average end-point error |
| 233 | + pred_flow, gt_flow: Bx2xHxW tensors |
| 234 | + """ |
| 235 | + # Compute per-pixel euclidean distance |
| 236 | + epe = torch.norm(pred_flow - gt_flow, p=2, dim=1) |
| 237 | + # Return mean error |
| 238 | + return epe.mean() |
| 239 | +``` |
| 240 | + |
| 241 | +For multi-scale training, we use a weighted combination: |
| 242 | + |
| 243 | +```python |
| 244 | +def multiscale_loss(flow_preds, flow_gt, weights): |
| 245 | + """ |
| 246 | + Compute weighted loss across multiple scales |
| 247 | + """ |
| 248 | + loss = 0 |
| 249 | + for flow, weight in zip(flow_preds, weights): |
| 250 | + # Downsample ground truth to match prediction |
| 251 | + scaled_gt = F.interpolate( |
| 252 | + flow_gt, |
| 253 | + size=flow.shape[-2:], |
| 254 | + mode='bilinear' |
| 255 | + ) |
| 256 | + # Compute EPE at this scale |
| 257 | + loss += weight * endpoint_error(flow, scaled_gt) |
| 258 | + return loss |
| 259 | +``` |
| 260 | + |
| 261 | +## Conclusion |
| 262 | + |
| 263 | +The evolution of optical flow algorithms shows a clear progression: |
| 264 | +1. Classical methods built on mathematical principles and assumptions |
| 265 | +2. Early deep learning replaced hand-crafted features with learned ones |
| 266 | +3. Modern architectures like RAFT combine learning with sophisticated architectural designs |
| 267 | + |
| 268 | +Each approach offers different trade-offs between: |
| 269 | +- Accuracy vs. computational cost |
| 270 | +- Large vs. small motion handling |
| 271 | +- Training data requirements |
| 272 | +- Real-time performance capabilities |
| 273 | + |
| 274 | +Choose your method based on your specific requirements for these factors. |
0 commit comments