![]() |
|||||||||||||
|
Harris affine region detector |
| Feature detection | |
Output of a typical corner detection algorithm |
|
| Edge detection | |
|---|---|
| Canny | |
| Canny-Deriche | |
| Differential | |
| Sobel | |
| Interest point detection | |
| Corner detection | |
| Harris operator | |
| Shi and Tomasi | |
| Level curve curvature | |
| SUSAN | |
| FAST | |
| Blob detection | |
| Laplacian of Gaussian (LoG) | |
| Difference of Gaussians (DoG) | |
| Determinant of Hessian (DoH) | |
| Maximally stable extremal regions | |
| Ridge detection | |
| Affine invariant feature detection | |
| Affine shape adaptation | |
| Harris affine | |
| Hessian affine | |
| Feature description | |
| SIFT | |
| SURF | |
| GLOH | |
| LESH | |
| Scale-space | |
| Scale-space axioms | |
| Implementation details | |
| Pyramids | |
In the fields of computer vision and image analysis, the Harris-affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.
Contents |
The Harris-affine detector can identify similar regions between images that are related through affine transformations and have different illuminations. These affine-invariant detectors should be capable of identifying similar regions in images taken from different viewpoints that are related by a simple geometric transformation: scaling, rotation and shearing. These detected regions have been called both invariant and covariant. On one hand, the regions are detected invariant of the image transformation but the regions covariantly change with image transformation [1]. Do not dwell too much on these two naming conventions; the important thing to understand is that the design of these interest points will make them compatible across images taken from several viewpoints. Other detectors that are affine-invariant include Hessian-Affine regions, Maximally Stable Extremal Regions, Kadir brady saliency detector, edge-based regions (EBR) and intensity extrema-based (IBR) regions.
Mikolajczyk and Schmid (2002) first described the Harris-Affine detector as it is used today in An Affine Invariant Interest Point Detector[2]. Earlier works in this direction include use of affine adapted feature points for matching by Baumberg [3] and the first use of scale invariant feature points by Lindeberg [4]. The Harris-Affine detector relies on the combination of corner points detected thorough Harris corner detection, multi-scale analysis through Gaussian scale-space and affine normalization using an iterative affine shape adaptation algorithm. The recursive and iterative algorithm follows an iterative approach to detecting these regions:
The Harris-Affine detector relies heavily on both the Harris measure and a Gaussian scale-space representation. Therefore, a brief examination of both follow. For a more exhaustive derivations see corner detection and Gaussian scale-space or their associated papers.[4] [5]
The Harris corner detector algorithm relies on a central principle: at a corner, the image intensity will change largely in multiple directions. This can alternatively be formulated by examining the changes of intensity due to shifts in a local window. Around a corner point, the image intensity will change greatly when the window is shifted in an arbitrary direction. Following this intuition and through a clever decomposition, the Harris detector uses the second moment matrix as the basis of its corner decisions. (See corner detection for more complete derivation). The matrix A, has also been called the autocorrelation matrix and has values closely related to the derivatives of image intensity.

where Ix and Iy are the respective derivatives (of pixel intensity) in the x and y direction. The off-diagonal entries are the product of Ix and Iy, while the diagonal entries are squares of the respective derivatives. The weighting function w(x,y) can be uniform, but is more typically an isotropic, circular Gaussian,

that acts to average in a local region while weighting those values near the center more heavily.
As it turns out, this A matrix describes the shape of the autocorrelation measure as due to shifts in window location. Thus, if we let λ1 and λ2 be the eigenvalues of A, then these values will provide a quantitative description of the how the autocorrelation measure changes in space: its principal curvatures. As Harris and Stephens (1988) point out, the A matrix centered on corner points will have two large, positive eigenvalues[5]. Rather than extracting these eigenvalues using methods like singular value decomposition, the Harris measure based on the trace and determinant is used:

where α is a constant. Corner points have large, positive eigenvalues and would thus have a large Harris measure. Thus, corner points are identified as local maxima of the Harris measure that are above a specified threshold.

where {xc} are the set of all corner points, R(x) is the Harris measure calculated at x, W(xc) is an 8-neighbor set centered around xc and tthreshold is a specified threshold.
A Gaussian scale-space representation of an image is the set of images that result from convoluting a Gaussian kernel of various sizes with the original image. In general, the representation can be formulated as:

where G(s) is an isotropic, circular Gaussian kernel as defined above. The convolution with a Gaussian kernel smooths the image using a window the size of the kernel. A larger scale, s, corresponds to a smoother resultant image. Mikolajczyk and Schmid (2001) point out that derivatives and other measurements must be normalized across scales [6]. A derivative of order m,
, must be normalized by a factor sm in the following manner:

These derivatives, or any arbitrary measure, can be adapted to a scale-space representation by calculating this measure using a set of scales recursively where the nth scale is sn = kns0. See scale space for a more complete description.
The Harris-Laplace detector combines the traditional 2D Harris corner detector with the idea of a Gaussian scale-space representation in order to create a scale-invariant detector. Harris-corner points are good starting points because they have been shown to have good rotational and illumination invariance in addition to identifying the interesting points of the image[7]. However, the points are not scale invariant and thus the second-moment matrix must be modified to reflect a scale-invariant property. Let us denote,
as the scale adapted second-moment matrix used in the Harris-Laplace detector.
[8]where g(σI) is the Gaussian kernel of scale σI and
. Similar to the Gaussian-scale space,
is the Gaussian-smoothed image. The
operator denotes convolution.
and
are the derivatives in their respective direction applied to the smoothed image and calculated using a Gaussian kernel with scale σD. In terms of our Gaussian scale-space framework, the σI parameter determines the current scale at which the Harris corner points are detected.
Building upon this scale-adapted second-moment matrix, the Harris-Laplace detector is a twofold process: applying the Harris corner detector at multiple scales and automatically choosing the characteristic scale.
The algorithm searches over a fixed number of predefined scales. This set of scales is defined as:

Mikolajczyk and Schmid (2004) use k = 1.4. For each integration scale, σI, chosen from this set, the appropriate differentiation scale is chosen to be a constant factor of the integration scale: σD = sσI. Mikolajczyk and Schmid (2004) used s = 0.7 [8]. Using these scales, the interest points are detected using a Harris measure on the
matrix. The cornerness, like the typical Harris measure, is defined as:

Like the traditional Harris detector, corner points are those local (8 point neighborhood) maxima of the cornerness that are above a specified threshold.
An iterative algorithm based on Lindeberg (1998) both spatially localizes the corner points and selects the characteristic scale [4]. The iterative search has three key steps, that are carried for each point
that were initially detected at scale σI by the multi-scale Harris detector (k indicates the kth iteration):
that maximizes the Laplacian-of-Gaussians (LoG) over a predefined range of neighboring scales. The neighboring scales are typically chosen from a range that is within a two scale-space neighborhood. That is, if the original points were detected using a scaling factor of 1.4 between successive scales, a two scale-space neighborhood is the range
. Thus the Gaussian scales examined are:
. The LoG measurement is defined as:

factor (as discussed above in Gaussian scale-space) is used to normalize the LoG across scales and make these measures comparable, thus making a maximum relevant. Mikolajczyk and Schmid (2001) demonstrate that the LoG measure attains the highest percentage of correctly detected corner points in comparison to other scale-selection measures [6]. The scale which maximizes this LoG measure in the two scale-space neighborhood is deemed the characteristic scale,
, and used in subsequent iterations. If no extrema, or maxima of the LoG is found, this point is discarded from future searches.
is chosen such that it maximizes the Harris corner measure (cornerness as defined above) within an 8×8 local neighborhood.
and
. If the stopping criterion is not met, then the algorithm repeats from step 1 using the new k + 1 points and scale. When the stopping criterion is met, the found points represent those that maximize the LoG across scales (scale selection) and maximize the Harris corner measure in a local neighborhood (spatial selection).It's important to note that although, Harris points may not be localized across scales, they ultimately all converge to the same scale-invariant point. That is to say, a corner point that might be detected at multiple scales may not be at the same coordinates at each scale. However, through the selection of characteristic scale and spatial localization, the points will converge [6].
The Harris-Laplace detected points are scale invariant and work well for isotropic
regions that are viewed from the same viewing angle. In order to be invariant to arbitrary affine transformations (and viewpoints), the mathematical framework must be revisited. The second-moment matrix
is defined more generally for anisotropic regions:

where ΣI and ΣD are covariance matrices defining the differentiation and the integration Gaussian kernel scales. Although this make look significantly different than the second-moment matrix in the Harris-Laplace detector; it is in fact, identical. The earlier μ matrix was the 2D-isotropic version in which the covariance matrices ΣI and ΣD were 2x2 identity matrices multiplied by factors σI and σD, respectively. In the new formulation, one can think of Gaussian kernels as a multivariate Gaussian distributions as opposed to a uniform Gaussian kernel. A uniform Gaussian kernel can be thought of as an isotropic, circular region. Simiarly, a more general Gaussian kernel defines an ellipsoid. In fact, the eigenvectors and eigenvalues of the covariance matrix define the rotation and size of the ellipsoid. Thus we can easily see that this representation allows us to completely define an arbitrary elliptical affine region over which we want to integrate or differentiate.
The goal of the affine invariant detector is to identify regions in images that are related through affine transformations. We thus consider a point
and the transformed point
, where A is an affine transformation. In the case of images, both
and
live in R2 space. The second-moment matrices are related in the following manner [10]:

where ΣI,b and ΣD,b are the covariance matrices for the b reference frame. If we continue with this formulation and enforce that

where σI and σD are scalar factors, one can show that the covariance matrices for the related point are similarly related:

By requiring the covariance matrices to satisfy these conditions, several nice properties arise. One of these properties is that the square root of the second-moment matrix,
will transform the original anisotropic region into isotropic regions that are related simply through a pure rotation matrix R. These new isotropic regions can be thought of as a normalized reference frame. The following equations formulate the relation between the normalized points
and
:

The rotation matrix can be recovered using gradient methods likes those in the SIFT descriptor. As discussed with the Harris detector, the eigenvalues and eigenvectors of the second-moment matrix,
characterize the curvature and shape of the pixel intensities. That is, the eigenvector associated with the largest eigenvalue indicates the direction of largest change and the eigenvector associated with the smallest eigenvalue defines the direction of least change. In the 2D case, the eigenvectors and eigenvalues define an ellipse. For an isotropic region, the region should be circular in shape and not elliptical. This is the case when the eigenvalues have the same magnitude. Thus a measure of the isotropy around a local region is defined as the following:

where λ denote eigenvalues. This measure has the range
. A value of 1 corresponds to perfect isotropy.
Using this mathematical framework, the Harris-Affine detector algorithm iteratively discovers the second-moment matrix that transforms the anisotropic
region into a normalized region in which the isotropic measure is sufficiently close to one. The algorithm uses this shape adaptation matrix, U, to transform the image into a normalized reference frame. In this normalized space, the interest points' parameters (spatial location, integration scale and differentiation scale) are refined using methods similar to the Harris-Laplace detector. The second-moment matrix is computed in this normalized reference frame and should have an isotropic measure close to one at the final iteration. At every kth iteration, each interest region is defined by several parameters that the algorithm must discover: the U(k) matrix, position
, integration scale
and differentiation scale
. Because the detector computes the second-moment matrix in the transformed domain, it's convenient to denote this transformed position as
where
.
,
, and
are those from the Harris-Laplace detector.
. For the first iteration, you apply U(0).
, using a method similar to the Harris-Laplace detector. The scale is chosen as the scale that maximizes the Laplacian of Gaussian (LoG). The search space of the scales are those within two scale-spaces of the previous iterations scale.
![\sigma_I^{(k)} = \underset{{\sigma_I = t\sigma_I^{(k-1)}\atop t \in [0.7, \dots, 1.4]}}{\operatorname{argmax}} \, \sigma_I^2 \det(L_{xx}(\mathbf{x}, \sigma_I) + L_{yy}(\mathbf{x},\sigma_I))](http://upload.wikimedia.org/math/6/b/1/6b1b934537a1fabb1eb92d5d52c2f897.png)
. In order to reduce the search space and degrees of freedom, the differentiation scale is taken to be related to the integration scale through a constant factor:
. For obvious reasons, the constant factor is less than one. Mikolajczyk and Schmid (2001) note that a too small factor will make smoothing (integration) too significant in comparison to differentiation and a factor that's too large will not allow for the integration to average the covariance matrix [6]. It is common to choose
. From this set, the chosen scale will maximize the isotropic measure
.
![\sigma_D^{(k)} = \underset{\sigma_D = s\sigma_I^{(k)},\; s \in [0.5, \dots, 0.75]}{\operatorname{argmax}} \, \frac{\lambda_\min(\mu(\mathbf{x}_w^{(k)}, \sigma_I^{k}, \sigma_D))}{\lambda_\max(\mu(\mathbf{x}_w^{(k)}, \sigma_I^{k}, \sigma_D))}](http://upload.wikimedia.org/math/6/6/8/668b758bdb60b746d8a8158ca4fe8d59.png)
is the second-moment matrix evaluated in the normalized reference frame. This maximization processes causes the eigenvalues to converge to the same value.
that maximizes the Harris corner measure (cornerness) within an 8-point neighborhood around the previous
point.

is the set of 8-nearest neighbors of the previous iteration's point in the normalized reference frame. Because our spatial localization was done in the U-normalized reference frame, the newly chosen point must be transformed back to the original reference frame. This is achieved by transforming a displacement vector and adding this to the previous point:

. The transformation matrix U is updated:
. In order to ensure that the image gets sampled correctly and we are expanding the image in the direction of the least change (smallest eigenvalue), we fix the maximium eigenvalue: λmax(U(k)) = 1. Using this updating method, one can easily see that the final U matrix takes the following form:

, is sufficiently close to its maximum value 1. Sufficiently close implies the following stopping condition:

The computational complexity of the Harris-Affine detector is broken into two parts: initial point detection and affine region normalization. The initial point detection algorithm, Harris-Laplace, has complexity
where n is the number of pixels in the image. The affine region normalization algorithm automatically detects the scale and estimates the shape adaptation matrix, U. This process has complexity
, where p is the number of initial points, m is the size of the search space for the automatic scale selection and k is the number of iterations required to compute the U matrix [8].
Some methods exist to reduce the complexity of the algorithm at the expense of accuracy. One method is to eliminate the search in the differentiation scale step. Rather than choose a factor s from a set of factors, the sped-up algorithm chooses the scale to be constant across iterations and points:
. Although this reduction in search space might decrease the complexity, this change can severely effect the convergence of the U matrix.
One can imagine that this algorithm might identify duplicate interest points at multiple scales. Because the Harris-affine algorithm looks at each initial point given by the Harris-Laplace detector independently, there is no discrimination between identical points. In practice, it has been shown that these points will ultimately all converge to the same interest point. After finishing identifying all interest points, the algorithm accounts for duplicates by comparing the spatial coordinates (
), the integration scale σI, the isotropic measure
and skew [8]. If these interest point parameters are similar within a specified threshold, then they are labeled duplicates. The algorithm discards all these duplicate points except for the interest point that's closest to the average of the duplicates. Typically 30% of the Harris-Affine points are distinct and dissimilar enough to not be discarded [8].
Mikolajczyk and Schmid (2004) showed that often the initial points (40%) do not coverage. The algorithm detects this divergence by stopping the iterative algorithm if the inverse of the isotropic measure is larger than a specified threshold:
. Mikolajczyk and Schmid (2004) use tdiverge = 6. Of those that did converge, the typical number of required iterations was 10 [2].
Quantitative analysis of affine region detectors take into account both the accuracy of point locations and the overlap of regions across two images. Mioklajcyzk and Schmid (2004) extend the repeatability measure of Schmid et al. (1998) as the ratio of point correspondences to minimum detected points of the two images[11][8].

where C(A,B) are the number of corresponding points in images A and B. nB and nA are the number of detected points in the respective images. Because each image represents 3D space, it might be the case that the one image contains objects that are not in the second image and thus whose interest points have no chance of corresponding. In order to make the repeatability measure valid, one remove these points and must only consider points that lie in both images; nA and nB only count those points such that
. For a pair of two images related through a homography matrix H, two points,
and
are said to correspond if:


. Basically, this measure takes a ratio of areas: the area of overlap (intersection) and the total area (union). Perfect overlap would have a ratio of one and have an εS = 0. Different scales effect the region of overlap and thus must be taken into account by normalizing the area of each region of interest. Regions with an overlap error as high as 50% are viable detectors to be matched with a good descriptor [1]. A second measure, a matching score, more practically assesses the detector's ability to identify matching points between images. Mikolajczyk and Schmid (2005) use a SIFT descriptor to identify matching points. In addition to being the closest points in SIFT-space, two matched points must also have a sufficiently small overlap error (as defined in the repeatability measure). The matching score is the ratio of the number of matched points and the minimum of the total detected points in each image:
[1],Mikolajczyk et al. (2005) have done a thorough analysis of several state-of-the-art affine region detectors: Harris-Affine, Hessian-Affine, MSER[12], IBR & EBR [13] and salient[14] detectors[1]. Mikolajczyk et al. analyzed both structured images and textured images in their evaluation. Linux binaries of the detectors and their test images are freely available at their webpage. A brief summary of the results of Mikolajczyk et al (2005) follow; see A comparison of affine region detectors for a more quantitative analysis.
[1] - Presentation slides from Mikolajczyk et al. on their 2005 paper.
[2] - Cordelia Schmid's Computer Vision Lab
[3] - Code, test Images, bibliography of Affine Covariant Features maintained by Krystian Mikolajczyk and the Visual Geometry Group from the Robotics group at the University of Oxford.
[4] - Bibliography of feature (and blob) detectors maintained by USC Institute for Robotics and Intelligent Systems
[5] - Digital implementation of Laplacian of Gaussian