Page 1 of7
Chapter 9: Shape & Object Perception
INTRODUCTION: THE THREE-STAGE MODEL
• Chapter 8 describes a primal sketch, a simple, sparse description of local image features. However,
most visual tasks require representations of discrete, meaningful objects.
• Various V1 neurons may respond to different parts of an outline of a table – how are they integrated?
Recognition of a table must also occur regardless of viewpoint, size, location, or illumination – must focus
on the invariant properties of the object itself.
• E.g. Reflectance is intrinsic to the object; illuminance is extrinsic to the environment.
• Visual processing requires at least two more levels of analysis: shape representation, and object
representation for a three-stage model.
• Both of these representation models require knowledge: implicit embodies knowledge (built-in to
sensory system, like cones having a “knowledge” of sunlight wavelengths, innate knowledge from
evolution, acquired knowledge from experience. This gives a statistical representation of the world – what
is the most likely, the most useful expectation based on past trials?
• Helmholtz: Measured conduction speed in a nerve, contributed to colour theory, invented
opthalmoscope, put forward the notion of “unconscious inference” – perceptions are the result of
hypotheses tested against sensory information, build up over time.
• Apperceptive agnosia: Deficient shape representation, inability to name, copy, or match simple
• Associative agnosia: Can copy and match shapes, but not identify objects from their images –
recognition is possible from touch or sound.
• Structuralism: Perception can be explained as the sum of
basic sensations, which are directly generated from physical
properties of the stimulus. Perception is bottom-up.
• Wilhelm Wundt: Brought empirical methods to psychology.
• Apparent Motion: Zoetrope toy combined multiple separate
images to create a “movie”.
• Illusory Contours: A stimulus is seen even though there are
• Structuralism also cannot priming or ambiguous images –
the image stimulus is the same, but different things are seen.
• Figure-Ground Segregation: Ambiguity with
• Gestalt psychologists (Koffka, Kohler, Werthemier)
believed that perception is more than the sum of its parts.
• They proposed rules (heuristics, principles) of perceptual
organization that demonstrate how the visual system has preferences for grouping parts of images
together on the basis of certain visual properties.
• Gestalt laws do not account for physiological or computational mechanisms.
• Gestalt laws generally reflect the properties of real-world objects.
• Law of Proximity: Nearby elements tend to be grouped together. It reflects an
assumption that objects are made of cohesive, opaque material. Page 2 of7
• Law of Similarity: Visually similar elements, in terms of visual texture like size and colour, tend to be
grouped together. It reflects an assumption that objects are made of
relatively few materials.
• Law of Common Fate: Elements that change or move together tend to
be grouped together. It reflects an assumption that objects are made of
cohesive, opaque material, and that when an object moves, all its parts
• Law of Good Continuation: Grouping is biased to favor smoothly
varying contours. It reflects the assumption that natural object shapes
tend to vary smoothly rather than sharply.
• Law of Closure: Invisible object parts are likely to be similar to the
visible ones. Closure to simple, basic shapes – assume perfect square, circle than amorphous shapes
• Law of Familiarity: Things are more likely to form groups if they are familiar of meaningful
• The figure is more “thing-like” than the ground. It is in front of the ground. The separating contours
belong to the figure. The ground is of unformed material and
extends behind the figure.
• Gestalt figure-ground assignment principles:
o Surroundedness: The surrounding region is the ground
o Size: The smaller region is likely to be the figure
o Symmetry: A symmetrical region is likely to be the figure
o Parallelism: A region with parallel contours is likely to be the
Shape Segmentation Processes
• The outputs of spatial frequency-tuned filters can be used to unify image
regions on the basis of visual texture, using a filter–rectify–filter (FRF)
• FRF processing can explain grouping on the basis of proximity and similarity:
o When texture elements are closer together in one region than in another, the spatial frequency
content (and average luminances) of the two regions differ
o Changes in the output of spatial frequency filters can help group based on similarity in size,
shape, or color
Motion & Depth-Based Segmentation
• The law of common fate can be explained by specialized motion and depth processes, in which
cooperative interactions between motion- or depth-sensitive neurons lead to segmentation on the basis
of common motion or stereoscopic disparity
• Segmentation can also be achieved using symbolic computations rather than image-based
• Marr’s original conception of the primal sketch was as a symbolic representation that consisted of a list
of local features or primitives in the image (edges, bars, etc.), each with its properties (e.g., position,
orientation, etc.) Page 3 of 7
• Collections of primitives can be grouped on the basis of similarities in their symbolic properties, such as
average local intensity, size, density, and orientation
• This grouping makes a new set of symbolic primitives representing the larger spatial structure of the
• Segmentation processes divide images into regions on the basis of shared texture, motion, or depth
• Contours between images are also important because they describe the boundaries of objects; for
example, silhouettes often allow identification.
• The visual system must integrate local information about the position and orientation of edges over
relatively long distances in the image, such as in the law of good continuation.
Image-Based Contour Integration
• 1) Collector units are higher-order neurons in visual cortex that respond to extended contours in
images by summing sequences of local edge responses from smooth contours.
• 2) Cooperative interactions use lateral mutual facilitation between neurons responding to local edge
segments, enhancing activity in the presence of long, smooth contours.
• 3) Feedback from extrastriate cortical areas modulates the activity of striate cells in such ways as to
enhance responses to long contours.
• Although all three theories agree that contour integration requires interaction between local responses
to contour segments, there is still no consensus on whether the integration involves feed-forward,
feedback, or lateral information flow.
Symbolic Contour Integration
• Marr proposed a symbolic process of contour integration called curvilinear aggregation that
generates the bounding contours of shapes in two stages.
• First, local primitive features are grouped into a contour segment only if they have matching orientation,
contrast, type, and are close together. These are entered as a “node” in the symbolic description of the
• Then, the contour segments nodes are evaluated for possible matches with other nodes.
• Shape segmentation alone is not sufficient for object representation. Natural scenes are cluttered with
numerous surfaces, many of which occlude each other; as a result, segmented areas and shapes will
often reflect the spatial arrangement of surfaces as well as their own shapes.
• So surface parsing is an essential part of shape representation. Shapes that are
disconnected in the image, but belong to the same surface on an object, are grouped
together by the visual system.
• Depth cues provide information about the depth ordering of different regions, especially stereoscopic
• Parsing can also be based on intersections between contours that form T-
junction, which are strong occlusion cues.
• If several alternative surface interpretations of an image are available, the
visual system selects the most likely interpretation, the generic
viewpoint (typical perspective used), rather than an accidental
viewpoint (highly unusual perspective on an object)
Crowding Page 4 of 7
• Parsing may break down when surface features are very close together. Features crowd each other
out, so that individual elements cannot be distinguished or even counted
• This may reflect how finely we can focus our spatial attention, or be a result of inappropriate integration
of shape features in peripheral vision.
• Images of objects reflect both intrinsic factors that define the character of an individual object (shape
and surface properties), and extrinsic factors that change the image of an object (viewpoint, light
source, occluding surfaces).
• View-Independent Theories: The visual system removes extrinsic influences, and represents an
object by its intrinsic factors: a structural description of their component parts (symbolic, list of
descriptors and properties), and the relations between those parts.
Generalized Cones (Marr & Nishihara)
• The basic descriptor for all object parts is a 3-D generalized cone. Generalized cones can form a
variety of shapes, corresponding to the volume created by moving a cross-section of constant shape, but
variable width, along an ax