8-DoF Homography Tracker

Crafted by Bob Bu · Spring 2025

This project implements a Sum of Squared Differences (SSD) tracker for live video. A user-selected ROI is used as the initial template, and the tracker iteratively solves for the displacement vector $\mathbf{u}=[u,v]^T$ to minimize alignment error across frames.

The ROI is selected on the first frame using $\texttt{cv2.selectROI()}$ , then the template is extracted from the initial image. For the next frame, the tracker estimates the displacement

\mathbf{u}=[u,v]^T

The update is solved iteratively until convergence or until the maximum number of iterations is reached. A practical stopping rule compares consecutive updates:

\frac{\|\mathbf{u}_k\|}{\|\mathbf{u}_{k-1}\|} < \text{threshold} \quad\text{or}\quad k=\text{max\_iterations}.

After each iteration, the ROI is shifted by $\mathbf{u}$ and the updated bounding box is drawn on the subsequent frame. Iterative warping improves accuracy, but a single update may fail when motion is large.

A live tracker processes frames in real time. Two strategies are evaluated and compared in the following airplane scene: keeping a fixed template from the first frame (left picture), and updating the template every frame using the current ROI (right picture). Updating the template adapts to appearance changes (lighting or deformation), but it can also propagate drift when one frame is misaligned.

Applying the tracker to longer sequences shows the typical failure modes. The 2-DoF tracker performs well under smooth motion, consistent lighting, and distinct ROI textures, but it struggles with rapid motion, occlusion, and strong perspective or appearance changes. To handle these cases, higher-DoF warps can model rotation, scale, and projective deformation.

For higher-DoF tracking, a common choice is an 8-DoF homography warp. The mapping from input coordinates $(x,y)$ to output coordinates $(u,v)$ is written in homogeneous form:

\begin{bmatrix} u \\ v \\ w \end{bmatrix} = \begin{bmatrix} h_1 & h_2 & h_3 \\ h_4 & h_5 & h_6 \\ h_7 & h_8 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \\ 1 \end{bmatrix}.

Expanding and normalizing by $w$ gives the projective coordinates:

\begin{aligned} u &= \frac{h_1x + h_2y + h_3}{h_7x + h_8y + 1},\\ v &= \frac{h_4x + h_5y + h_6}{h_7x + h_8y + 1}. \end{aligned}

The Jacobian collects partial derivatives of $(u,v)$ with respect to the homography parameters $(h_1,dots,h_8)$ :

J= \begin{bmatrix} \frac{\partial u}{\partial h_1} & \frac{\partial u}{\partial h_2} & \cdots & \frac{\partial u}{\partial h_8}\\ \frac{\partial v}{\partial h_1} & \frac{\partial v}{\partial h_2} & \cdots & \frac{\partial v}{\partial h_8} \end{bmatrix}.

With $w=h_7x+h_8y+1$ , the key derivatives used in the implementation are:

\begin{aligned} \frac{\partial u}{\partial h_1}&=\frac{x}{w}, & \frac{\partial u}{\partial h_2}&=\frac{y}{w}, & \frac{\partial u}{\partial h_3}&=\frac{1}{w},\\[4pt] \frac{\partial u}{\partial h_7}&=-\frac{ux}{w}, & \frac{\partial u}{\partial h_8}&=-\frac{uy}{w},\\[6pt] \frac{\partial v}{\partial h_4}&=\frac{x}{w}, & \frac{\partial v}{\partial h_5}&=\frac{y}{w}, & \frac{\partial v}{\partial h_6}&=\frac{1}{w},\\[4pt] \frac{\partial v}{\partial h_7}&=-\frac{vx}{w}, & \frac{\partial v}{\partial h_8}&=-\frac{vy}{w}. \end{aligned}

Warping requires sampling intensity values at subpixel coordinates, so bilinear interpolation is used. Let $(u,v)$ lie within a cell with corner intensities $I_{00},I_{01},I_{10},I_{11}$ and fractional offsets $d_u$ and $d_v$ :

\begin{aligned} I(u,v) &= (1-dv)\bigl[(1-du)I_{00}+du I_{01}\bigr] \\ &\quad + dv\bigl[(1-du)I_{10}+du I_{11}\bigr]. \end{aligned}

The optimization follows a Lucas–Kanade / Gauss–Newton derivation. The brightness constancy assumption is

I(W(x;p)) = T(x),

and the SSD objective is

E(p)=\sum_x \bigl[T(x)-I(W(x;p))\bigr]^2.

Linearizing the warped image around $p$ gives

\begin{aligned} I(W(x;p+\Delta p)) &\approx I(W(x;p)) \\ &\quad + \nabla I(W(x;p))\cdot J\,\Delta p. \end{aligned}

Define the residual $e(x)=T(x)-I(W(x;p))$ . Substituting the approximation yields a linearized least-squares problem and the normal equation

\begin{aligned} \mathbf{H}\,\Delta p &= \mathbf{b},\\ \mathbf{H} &= \sum_x \mathbf{SD}(x)\mathbf{SD}(x)^T,\\ \mathbf{b} &= \sum_x \mathbf{SD}(x)e(x). \end{aligned}

where the steepest descent direction is $\mathbf{SD}(x)=J^T\nabla I(W(x;p))$ . For numerical stability, $\mathbf{H}$ is factorized using $\mathbf{QR}$ decomposition and solved by back-substitution. Parameters update as

p \leftarrow p + \Delta p, \quad \|\Delta p\| < \epsilon.

In one controlled test case, the 8-DoF tracker achieved a noticeably lower mean squared error (MSE) than a 6-DoF affine baseline and a non-iterative variant that skips Gauss–Newton refinement.

The gap becomes clearest when the ROI experiences perspective tilt and scale change. The full homography model plus iterative updates keeps the warp locked on, while simpler or non-iterative variants tend to drift over time. Full math derivations and code are available in the GitHub repository.

University of Alberta · Edmonton, AB, Canada