Chapter 7: Attention & Categorization(1)

Overview: In the four major sections of this chapter, we examine the links between the topics of generalization and discrimination, and attention and conceptual knowledge. The first section starts with the traditional topics of generalization and discrimination. The Hull and Spence Algebraic Summation Theory approach to generalization is examined, and contrasted with Sutherland and Mackintosh's Attentional Approach, based on Krechevsky's notion of Hypothesis Learning. We will see much evidence favoring a claim we have already met in earlier chapters: Attention plays a central role in learning. The second section follows up on this theme by examining attentional processes in humans. Filter and Capacity Models of Attention will be presented, and the section will close with an account of Attention and Automaticity. In the third section, we will look at categorization in humans, starting with a description of how Hypothesis Theory (H Theory) applies to single-feature categories, and moving on to more complex categories. We will examine the evidence for and against the claim that humans have well-defined rather than fuzzy (or probabilistic) categories. Finally, the last section will present experiments on categorization in animals.

I. Algebraic Summation Theory versus Attentional Hypothesis Testing

        A basic finding that we have looked at in previous chapters on classical and operant conditioning concerns generalization, the tendency of an animal to respond to other stimuli than the one it was trained with. What does such a result mean? There are a number of possible explanations for generalization. It may mean, for example, that an animal has trouble distinguishing amongst various stimuli. Perhaps its memory of the original training stimulus is imperfect, so that the animal will treat a number of other, similar stimuli as equivalent. On this account, additional training ought to result in better and better memory, and thus, less and less generalization. Such an approach was taken by Lashley and Wade.

        Another possibility is that generalization is an innate feature of an animal's perceptual apparatus. We saw one such mechanism by which this may occur when we discussed Pavlov's theory of generalization: He believed stimulus centers in the brain were organized topographically on a principle of similarity, so that objects that resemble one another would excite regions in the cortex that were near one another. On this account, spill-over of excitation or activation from one center to neighboring centers would yield a generalization gradient. Like Pavlov, Hull believed that generalization was innate, although being a good behaviorist, he did not discuss internal causes (such as brain topography) underlying this principle.

        A third possibility is that an animal responds to a complex bundle of stimuli, rather than a single object, so that the response becomes a probabilistic function of the number of stimuli present that were part of the original bundle. We saw something like this approach when we discussed Guthrie's theory. If we regard this bundle as a list of features or attributes, then we can claim that the more of this list we find in a given situation, the more probable the response becomes. If we train a pigeon to peck at the letter R for example, we might expect some pecking to the letter P because it shares some of the same attributes (a vertical line; a half-oval closed to the left), but less or no pecking to V (which shares only a line slanting down to the right, although the slanted line for R is a half line rather than a full line).

        Finally, perhaps generalization gradients inform us about animal's categories and hypotheses about their world. Human categories tend to exhibit something very like a generalization gradient in that we judge some things to be better examples of the category than others. Perhaps we are teaching animals categories, so that their generalization gradients let us know which objects they regard as good examples of the category (i.e., having high generalization), and which they regard as poor examples (i.e., having low intermediate generalization).

        These are the issues that will concern us in this chapter. We start with two very different approaches to explaining generalization. One, Algebraic Summation Theory, arises from the work of Hull and Spence, and was mentioned earlier when we discussed effective reaction potential. This approach claims that generalization is innate and need reflect absolutely nothing about an animal's categories or concepts. Rather, it claims that there is a basic principle of similarity that governs habit strength much as there is a basic principle of gravity that governs planetary orbits. Once a habit is formed, this principle states that physically similar stimuli will acquire the ability to trigger the habit in proportion to the degree of their similarity. The other, Attentional Hypothesis Testing, arises originally from the work of Krechevsky and of Lashley. It claims that generalization is a measure of the success of discrimination learning, which in turn is based on a hypothesis-testing approach: The animal selects certain stimulus values to attend, and generalization gradients in part reflect its knowledge of and familiarity with stimulus differences.

        Before going on to these theories, however, let us start with some basics of generalization and discrimination.

A. Generalization

        As we have seen earlier, generalization gradients are obtained when we test animals with stimuli that they were not actually trained to respond to. There are in fact a number of ways of testing for gradients, and there is some argument in the field about which technique is best. The usual technique is to employ a within-subjects design in which we expose the same animal to a variety of different stimuli after training (typically during an extinction session), and track the number of responses to each. If an animal has been trained, for example, on a variable interval schedule, extinction is likely to take long enough that we can obtain a reasonable number of responses to many different stimuli. Alternatively, if we are worried that performance with one stimulus may be affected by performance with another, we may employ, instead, a between-subjects design in which each animal or group of animals is exposed to just one new stimulus. In this design, we look at generalization by comparing performance across groups. In any case, with each of these designs, there may be a bit of discrimination training going on, as the animal may be learning that these new stimuli fail to be associated with the outcome. The mere presence of new stimuli makes this situation potentially different from original acquisition, in which only one stimulus was present.

        A variant approach concerns the nature of the stimuli present. In most of the cases we have discussed so far, we have implicitly assumed novel stimuli that are presumably equally complex to the original stimulus. Thus, in the Guttman and Kalish experiment discussed in Chapter 4, pigeons were trained with one color, and then tested with other colors. But we need not restrict ourselves to such an approach. Often, we teach an animal to respond to a complex stimulus, and then systematically change components of that stimulus (or present just the components) to assess what it was the animal responded to. So, we might present a face as a stimulus, and then change the eyes or the nose or the hairline or the mouth to see whether the animal was responding to the whole complex, or to just one of its elements. You saw an example of this earlier in Chapter 6, when we discussed the experiment by Welker and McAuley: Rats extinguished more rapidly when their experimental environment or mode of transportation changed than when the only difference involved removing the food they got for bar pressing.

        Some rather interesting questions may be asked using this technique, as we can see from a further example by Povinelli and Eddy. They trained chimpanzees to obtain a reward by holding a hand out in front of an experimenter. They subsequently presented the animal with two experimenters, only one of whom was looking at the chimp. The question they asked was whether chimpanzees would realize that it made sense just to 'beg' in front of the person who could see them. (That is, do chimpanzees understand that other entities can see what they are doing only if the eyes are oriented towards them?) They found that chimps begged randomly, and appeared not to know that the person looking away could not see their hand gesture. In subsequent generalization tests, chimps chose at random between someone covering his or her head with a bucket and someone holding a bucket at the side; someone with a blindfold over her eyes and someone with a blindfold over the mouth, etc. In fact, the only condition in which the chimps were able to correctly discriminate without further training involved a person facing the chimp versus one facing away. If chimpanzees are aware that we, like they, pick up information through our eyes, they failed to use that information in choosing a likely culprit to beg a reward from.

        As the last two examples illustrate, generalization tests may be used to answer a variety of questions concerning what an animal pays attention to, and what its beliefs may be. That is a very far cry indeed from the earlier tests we looked at that used generalization to assess stimulus similarity. Nevertheless, many experiments still explore the issue of stimulus similarity and its likely influence on responding. For those experiments, several issues become particularly important. One is how to measure stimulus similarity, and a second is how to measure response similarity.

        Concerning stimulus similarity, do we look at physical similarity or psychological similarity? Things that may be physically similar will not necessarily have the same level of psychological or perceptual similarity. Knowing which of these to track is certainly important, if we wish to come up with strong models of generalization. In the realm of music, for example, two notes one octave apart strike people as more similar than notes less than an octave apart, even though physically they represent a bigger difference in frequency.

        And finally, concerning response similarity, a host of issues arise, the most compelling of which involves whether to track absolute or relative generalization gradients. In an absolute gradient, we directly measure some physical quality of the response such as the number of responses per unit time, the strength of a response, its latency, etc. In a relative gradient, on the other hand, we measure relative proportions. Thus, if we train a pigeon to peck at, say, a red light, we could compare the numbers of pecks per ten-minute session with red and orange lights if we want to use absolute gradients, or we can ask what percentage of the pecks are given to red (and what to orange) if we wish to look at relative gradients. Which we use turns out to be important, as different results are sometimes obtained.

        Below, we will generally restrict ourselves to findings concerning absolute gradients that, for the sake of convenience, are measured on the basis of physical similarity of the underlying stimulus dimension or dimensions. As you read about generalization findings elsewhere, keep in mind that these distinctions are important, and be aware of the choices the researchers made in reporting their findings.

 Features Of A Gradient

        By way of a quick review (see Chapter 4, pp. 99-100), recall that there may be both gradients of excitation and gradients of inhibition. A gradient of excitation occurs when a response is reinforced in the presence of a stimulus, and a gradient of inhibition occurs when a response fails to get an expected reinforcement (or is followed by a punisher) in the presence of some stimulus. In a gradient of excitation, we track the extent to which other stimuli excite the response, and in a gradient of inhibition, we track the extent to which other stimuli inhibit the response. In both cases, of course, generalization occurs when a novel stimulus exhibits excitation or inhibition different from its normal baseline or operant level (the level of responding with that stimulus in the absence of the type of training discussed above).

        In a gradient of excitation, three features are characteristic: (1) the greatest responding (the peak) normally occurs with the trained stimulus; (2) the gradient typically exhibits a relatively smooth drop-off of responding as novel stimuli become more different (so that the gradient is shaped a bit like a tent or a bell-shaped curve); and (3) the gradient is symmetrical (the left half of the curve looks like a mirror image of the right half). These features may be seen in the generalization gradients Guttman and Kalish found (see Chapter 4, Figure 1). Gradients of inhibition display similar characteristics except that instead of a peak, of course, we look for a valley: the point at which the least responding occurs. A gradient of inhibition would normally look like a gradient of excitation turned upside down.

       In gradients of excitation, the width and the height of the curve tell us something about the extent of generalization. A bell-shaped curve that extends over more stimuli (i.e., that is wider) indicates greater generalization: The animal is obviously responding to more stimuli. At the same time, a steep curve indicates a greater discernment: With steep curves, animals are responding strongly to some stimuli, but weakly to others (as opposed to responding moderately strongly to all stimuli). All things being equal, less generalization is typically associated with steeper, narrower curves. Figure 1 illustrates this by showing a wide, shallow curve and a narrow, steep curve. In each case, we have assumed a pigeon that is emitting 250 responses. But as you can see, most of these responses in the case of the narrow curve are being given to just a few stimuli. For purposes of comparison, Figure 2 graphs exactly the same findings using a relative rather than an absolute generalization gradient. Relative gradients tend to smooth out a bit the differences found in absolute gradients. Thus, the low generalization curve in Figure 2 no longer looks quite as steep or narrow as it did in Figure 1. 

        Although we have discussed several features that are normally found with generalization gradients, you ought to be aware that there are occasionally clear exceptions to these characteristics. In particular, as you will see below, discrimination training in which the animal is reinforced in the presence of one stimulus but not another will sometimes change a curve's symmetry, and result in a peak at a different spot (a phenomenon termed peak shift). In normal generalization tests in which novel stimuli are presented centered around the original reinforced stimulus, the curve will tend to be 'squashed up' on the side containing the S-, and the new peak will tend to occur on the other side. More about this below, as it has been the subject of much theorizing.

 Innate Versus Learned Gradients

        As you will see later, there are many, many studies that support the claim that experience of various sorts will steepen a gradient. But what kind of gradient occurs in the absence of experience? That question has been the subject of some controversy. On the one hand, people like Hull and Pavlov have claimed that tent-shaped or bell-shaped gradients are innate. On the other hand, Lashley and Wade have pointed out that a steepening gradient reflects a transition to perfect discrimination in which one and only one stimulus is responded to. The steeper and narrower the gradient, the more perfect the animal's response is. Their argument thus asserts that tent-shaped gradients arise as a result of learning to tell things apart. At the start of training, before the animal has learned to discriminate among stimuli, it ought to treat all stimuli along the relevant dimension as equivalent. So, rather than exhibit a tent-shaped gradient, animals at this point ought to have a flat-line gradient in which the same proportion of responses is given to all similar stimuli.

        Jenkins and Harrison conducted a famous study relevant to this issue. One group of pigeons (the discrimination learning group) was explicitly trained to discriminate between two sounds: They were reinforced for pecking in the presence of a 1000 Hz tone, but not a 950 Hz tone. A second group simply had approach or appetitive learning, in which only the 1000 Hz tone was presented. In this latter group, the pigeons were placed in a Skinner box, but their pecking was prevented at certain periods of time because the lights would be off. When the lights came on, so would a 1000 HZ tone, and in these periods, pecking at a key resulted in reinforcement.

        Both groups, of course, were subsequently exposed to a variety of sounds ranging from 300 Hz to 3500 Hz. What happened was that the discrimination learning group (the one exposed to two sounds) showed the tent-shaped gradient, but the approach learning group did not: They pecked equally regardless of the sound in the background. Jenkins and Harrison's results thus fit in with Lashley and Wade's learning theory, rather than Hull's innateness claim.

        Unfortunately, if you compare Jenkins and Harrison's experiment with Guttman and Kalish's study, you will notice a contradiction. Guttman and Kalish's pigeons also constituted an approach learning group, since they were reinforced in the presence of a given color, and had no discrimination training in which another color failed to yield reinforcement. Nevertheless, Guttman and Kalish's pigeons showed the classic tent-shaped gradient, instead of the flat-line gradient of Jenkins and Harrison's approach learning group. So why the contradiction?

        A number of theorists investigated one possible explanation for this contradiction. On this explanation, pigeons and other animals have had plenty of experience learning to tell colors apart (particularly insofar as we know that belongingness effects in the taste aversions paradigm seem to involve visual appearance for birds) long before they get to an experiment such as that run by Guttman and Kalish. So, perhaps we ought to see what happens when we prevent animals from having experience with colors.

        And that is what Riley and Leuin did: They raised birds in monochromatic environments. In one experiment, their subjects were in an environment where filters allowed only light of a certain wavelength (589 nanometers) to get through. Ten days after hatching, the birds were trained to peck at a disk lit up at this wavelength. Then a week later, the birds were given a generalization test involving a disk lit up at 569 nanometers, and another disk at 550 nanometers. Quail, ducks, goslings, pigeons, etc., all pecked more to the 569 nanometer disk than to the 550 nanometer disk, consistent with a tent-shaped gradient. Thus, in this and many similar experiments (some involving fitting birds with filtering contact lenses!), non-flat gradients to color occurred even when there had been no prior exposure to color differences. Such a finding fits in with the claim that graded (i.e., tapering or tent-shaped) generalization is innate.

        But that leaves us with the problem of accounting for the flat-line gradients Jenkins and Harrison found in their study. Kerr, Ostapoff, and Rubel repeated Riley and Leuin's method, except that they raised birds in single-sound environments before testing for generalization to different sounds. And they verified the Jenkins and Harrison results: These animals displayed flat gradients in the absence of learning to discriminate.

        And to complicate matters further, flat gradients may be found with color under certain circumstances. Peterson also raised ducklings in monochromatic environments (589 nanometers), and compared their responding on a generalization test to ducklings raised in a normal, color-filled world. In this study, unlike the Riley and Leuin study, the generalization gradients for the monochromatic group were relatively flat, especially when compared to very impressive and steep gradients for the control group. However, Peterson used a much wider range of colors on the generalization test, and a careful examination of these data suggests that there may have been some reduced responding to the colors that Riley and Leuin tested, consistent with their claim.

        Thus, graded gradients may have multiple causes. They may sometimes arise from innate factors, and may at other times reflect experience. But in any case, whatever their initial causes, there is no doubt that perceptual experience and further training may influence the shape of a gradient. We turn next to a brief discussion of the effects of factors such as these.

 Factors That Influence Gradients

        TRAINING. Let us start with experience, as provided in the form of amount of reinforced training. Generally speaking, the more pairings there have been of a stimulus and a reinforcer, the steeper the generalization gradient becomes. Hearst and Koresko, for example, trained four groups of pigeons to peck at a disk for reinforcement using a VI-1 minute schedule. The disk had a line on it going up and down. Their design was as follows:

            Group             Amount of Training

            1                     2 50-minute sessions
            2                     4 50-minute sessions
            3                     7 50-minute sessions
            4                     14 50-minute sessions

        Following training, each group was put through the same generalization test in which they were exposed to 9 disks that differed in orientation of the line. The gradient for Group 1 was quite broad, and showed only slight curvature around the original upright line. But, as training increased, the gradient became considerably steeper. In Group 4, for example, there was more than a three-fold increase in responding from a horizontal to a vertical line, but in Group 1, the increase was only about 50%.

        If we define training in terms of number of reinforced responses, then the same relationship holds when we compare different partial reinforcement schedules. As the interval or ratio decreases, generalization tends to steepen. But decreased schedules, of course, yield more reinforced trials, all things being equal. We saw in Chapter 6 that one set of theories of partial reinforcement effects claimed they were due to generalization: Schedules with high ratios or intervals more nearly resemble extinction than schedules with low ratios or intervals. Here we find some further evidence supporting such a claim.

        DELAY OF TESTING. Another factor that will influence the shape of the generalization gradient concerns the delay between training and the generalization test. Thomas and his colleagues have conducted a number of experiments on this relationship, and they generally find that as this interval increases, generalization increases, as well (i.e., the gradient broadens rather than narrows). Although the following result does not always occur (as may be seen by comparing Thomas & Lopez's results against those of Thomas et al.), there is some evidence that a broader gradient occurs because the animal increases its responding to other novel stimuli (rather than decreases its responding to the original stimulus with which it was trained). Such a result would be consistent with forgetting some of the features of the training stimulus that distinguish it from surrounding stimuli. For purposes of comparison, consider how your ability to tell Guthrie and Hull apart might differ right before versus two weeks after a test on their theories. With the passage of time, loss of information results in a memory becoming less distinct, and so, more easily confused with something else that has some overlap of features.

        (Similar results occur with training involving an aversive outcome and gradients of inhibition. We concentrate here on reinforcement-based training.)

        DIFFERENTIATION. Is differential reinforcement (in which one stimulus is preferentially marked due to an outcome) necessary for generalization to decrease? The answer appears to be "no." In fact, several studies suggest that simple exposure to stimuli prior to discrimination training will steepen gradients. This is in part an outcome of work done by E. Gibson involving perceptual learning. Similar in some respects to observational learning, perceptual learning involves the notion that organisms learn from their perceptual experiences. Reinforcement is not needed in such an approach; the learning occurs because of its adaptational value: An animal that leans about the perceptual constancies in its environment is better able to survive. Thus, the theory assumes attention to perceptual features in the absence of specific rewards.

        One aspect of this theory relevant to our current concerns involves stimulus differentiation. An animal that is more familiar with certain stimuli ought to be able to discriminate them more easily (assuming non-trivial stimuli). Because of perceptual learning, the animal ought to pick up more and more features that characterize stimuli, and make them different. As a common-sense example, think of the many new faces you saw in this class on its first day. After meeting someone for the first time, people sometimes have difficulty recognizing that person on a subsequent occasion; there is a moderate level of familiarity about the face, but of insufficiently high a degree to trigger a confident "hello." But as you repeatedly encounter these same individuals, their faces become both more familiar and more distinct, and so, easier to recognize in other contexts.

        The Peterson experiment mentioned above may be taken as one example consistent with the claims of stimulus differentiation. Another comes from Gibson, Walk, and Tighe. They set up a design with rats as follows:

            Group                 Phase 1                                 Phase 2

        experimental         triangles & circles in cage     discrimination
        control                   no triangles & circles            discrimination

In this experiment, the experimental group was exposed to the triangles and circles for 90 days before being put through discrimination training in which it had to respond to one of these, but not the other. Gibson et al. found faster learning of the discrimination in this group, presumably because it had already learned something about the differences between these symbols (i.e., reduced generalization).

        Although there is much evidence for stimulus differentiation, we ought to note two caveats. First, differentiation tends to be more successful if non-differential reinforcement is present. That is, hanging these symbols above where rats eat or drink is a good idea, presumably because the food (associated equally with both symbols) helps call attention to the surrounding context. It acts as a motivator for the animal to note the features of its environment. And second, stimulus differentiation in the design presented above seems to require that the context in Phase 2 differ from the context in Phase 1. If you think back to the chapters on classical conditioning, you will see one reason why this may be so. Earlier, we found that pre-exposing a stimulus led to latent inhibition. Thus, while pre-exposure can result in the animal learning about the perceptual features of a given stimulus, it may also result in the animal learning that this stimulus ought to be irrelevant for important biological consequences, since it has predicted no such consequences so far, in this particular environment. But a move to a different environment allows us to assess the extent to which differentiation has occurred when the stimulus's predictive value is not negated.

        DISCRIMINATION TRAINING. We have already noted above that discrimination training (involving differential reinforcement whereby some stimulus or stimuli fail to be paired with an outcome) will change the shape of a gradient. The gradient may become non-symmetric, and may exhibit peak shift. It will also steepen. A good example of this is the Jenkins and Harrison experiment mentioned earlier. They actually ran several different discrimination groups. In one, pigeons had to discriminate between a 1000 Hz and a 950 Hz tone. In the other, pigeons had to discriminate between a (1000 Hz) tone present and a tone absent condition. A much narrower gradient was found in the condition involving two tones.

        The Jenkins and Harrison finding regarding type of discrimination training fits in nicely with some of the discussion of stimulus differentiation. When the animal discriminates between presence and absence of a tone, that becomes the most salient perceptual feature. Hence, other tones are similar to the reinforced tone in representing a distinctive departure from the non-reinforcing silence. But when the animal has to distinguish among tone frequencies, then a different level of perceptual feature comes into play. Such findings exhibit aspects of relational learning (see below): learning in which the animal compares stimuli with one another to find a distinguishing feature or characteristic that will enable telling them apart. Which stimuli are presented along with a training stimulus may help determine which features are first found or attended.

        As peak shift is an important test case for Hull's and Spence's theory, however, we defer further discussion of discrimination training effects to the section on algebraic summation theory.

        DEPRIVATION. Another finding that we discussed earlier ought to be recalled. Animals that have been deprived of an outcome tend to exhibit steeper generalization gradients. Deprivation is a good motivator for paying attention to the various stimuli, and trying to 'get things right.'

        STIMULUS FACTORS. Finally, as the Guttman and Kalish experiment makes clear (see Figure 1 in Chapter 4), training with different stimuli can result in different gradients. How steep or narrow a gradient will be will depend in part on an animal's perceptual systems. We are not equally sensitive to all physical differences. Thus, the effects of stimulus factors are in part bound up with issues of psychological or psychophysical similarity, and with features of a given stimulus that may make it inherently more salient than other stimuli.

B. Discrimination

        In discrimination training, we have so far essentially presented just one method whereby an animal is reinforced in the presence of one stimulus, but not in the presence of another. There are, in fact, a host of methods that may be employed. We start with a brief catalog of these.

Methods For Training A Discrimination

        An initial distinction that needs to be made concerns how many stimuli are present on a given trial. If there are two or more stimuli present only one of which needs to be responded to, then we have what is termed a simultaneous discrimination. For example, in the earlier study on sensitivity to contingency in pigeons conducted by Killeen (Chapter 4, p. 115), pigeons were presented with a left button and a right button, and had to choose which one of these two to peck for their reward. Similarly, the Povinelli and Eddy study represents a simultaneous discrimination, in that only one person of the pair Povinelli and Eddy's chimps were faced with could see the chimp, and thus reward its begging gesture. The task in a simultaneous discrimination is to determine which stimulus to respond to amongst the various stimuli present.

        Normally, the animal will initially respond to both stimuli, and thus discover that one of these consistently fails to yield the desired outcome. Over the course of training, then, we will see responding first rise to both stimuli, and then drop off to the S-. This is normal discrimination training. In a variant of this procedure termed errorless discrimination training, however, we typically start with the S- being so reduced in intensity compared to the S+ that it is virtually non-salient. Over the course of a large number of trials, we gradually increase the intensity of the S- until it equals that of the S+. If this is done slowly enough, then we may discover that our animal has never actually made a false response to the S- (hence the name errorless).

        An alternative approach is to present just one stimulus at a time. In this case, we have a successive discrimination. Successive discrimination techniques open up several different possibilities, however. If the successive discrimination is like the simultaneous discrimination in that only one stimulus is associated with a reinforcer, then we have what is termed a go-no go situation. In this case, you go (respond) when the correct stimulus (the S+) is present, but do not go (withhold a response) when any other stimulus is present. In successive discriminations, however, we also open up the possibility of requiring several different responses, depending on which stimulus is present. A triangle, for example, could be used to signal pressing a left-hand button, whereas a circle could be used to signal pressing a right-hand button. In the simplest version of this situation, a reinforcer can be obtained on each trial (assuming you are using a continuous reinforcement schedule) so long as the animal knows which response to make to which stimulus. This type of set-up is referred to as a choice situation. Unlike the simultaneous condition or the go-no go situation, each stimulus in the choice situation could, in principle, be associated with the same degree of excitation or inhibition. Or put another way, in a choice situation, the stimulus acts as an occasion setter indicating which response is appropriate.

        There is a certain looseness about the word "stimulus" in the above presentations. Successive discrimination is also possible when multiple stimuli are present, so long as those multiple stimuli together act as a stimulus complex that signals a single response rather than separate responses or a choice between them. Thus, whether more than one stimulus is present is not really the distinguishing characteristic of successive versus simultaneous discriminations. Rather, successive discriminations involve selecting an appropriate response to make to the given stimulus ensemble, whereas simultaneous discriminations involve choosing which of several stimulus ensembles to respond to.

        To illustrate the point about a stimulus ensemble, let us consider yet another type of discrimination, conditional discrimination, involving successive discrimination. In conditional discrimination, at least two stimuli (or stimulus dimensions) are typically present. The reaction to one will depend on the presence of the other. For example, in an experiment by Nissen, chimps were trained to discriminate between a large square and a small square. Only one of these yielded a reinforcer, but which one that was depended on a second stimulus characteristic. Because this was a successive discrimination procedure, each trial involved just one of the four complex stimuli below:

            Stimulus                             Outcome

            large black square            reinforcement
            small black square            no reinforcement
            large white square            no reinforcement
            small white square            reinforcement

        In this example, color was critical in signaling the large square or the small square needed the go response: If the squares were black then large was the S+ and small the S-; but if the squares were white, this assignment reversed. So, color in this case was the occasion setter informing the animal about meaning of size.

        In Nissen's experiment, color and size were combined in the same stimulus as different dimensions of that stimulus. One could argue that these actually represent four different stimuli with four different associative links (though see below for evidence that animals do learn about individual dimensions such as size or color). However, conditional discrimination does not depend on such a set-up. More commonly in conditional learning, separate stimuli are presented. Thus, we could have conducted the Nissen experiment in the following way:

            Stimulus 1             Stimulus 2                     Outcome

            1000 Hz tone         large square                 reinforcement
            1000 Hz tone         small square                no reinforcement
            900 Hz tone           large square                no reinforcement
            900 Hz tone           small square                reinforcement

        In this design, unlike Nissen, we now have an animal make different responses in the presence of the same physical object. That is, what the animal ought to do when faced with the second stimulus (a large or a small square) will be conditional on what the first stimulus was (the 1000 Hz or 900 Hz tone). Rats, pigeons, and chimps can learn these types of conditional discriminations.

        A final way of looking at the ease and success of discrimination training involves transfer tasks. In a transfer task, we ask whether or how prior discrimination training on one problem affects later discrimination training with another problem. Sometimes the problems are very similar. Indeed, a favorite type of problem involves a reversal shift, in which the animal learns to do the exact opposite of what it did earlier. So, if it was trained to respond to a square but not a triangle, in a reversal shift, it would have to respond to the triangle, but not the square. Sometimes the problems are very different. Does training on the dimension of color in one situation help a color problem involving very different stimuli, and very different responses, for example? And sometimes the training involves combining problems. In a technique called acquired distinctiveness of cues, for example, we try to speed up the process of discrimination by compounding the two stimuli we want the animal to discriminate with very different stimuli that we know from earlier training are easily distinguished.

        Transfer training per se isn't really a method of discrimination training, but it has sometimes been used as such. Thus, we might try to teach the animal a pattern of responding (a bit similar to Hulse's studies in Chapter 6 that looked at sensitivity to patterns in partial reinforcement schedules) by alternating a series of problems in a certain way. We may, for example, try to teach the animal to systematically switch to the other stimulus after every successful response, a discrimination procedure that requires multiple problems and a transfer set-up. Such work on learning sets (also called learning-to-learn) has been done by Harlow, and we will examine it more closely below.

        In any case, let's look at some of the factors that influence how easily discriminations are learned.

Factors That Influence Discrimination Learning

        STIMULUS SIMILARITY. The first factor will be a familiar one, particularly from our discussion of discrimination effects in extinction. The more similar two stimuli are, the longer it takes to learn a discrimination between them. As similarity increases, it becomes more difficult to perceive what makes two things different. And in the extreme, if we ask our animals to make a discrimination that is beyond their ability, we may obtain something called experimental neurosis. Pavlov discovered this when he attempted to get dogs to discriminate between two very similar ovals. They became extraordinarily agitated and snappish. In fact, they subsequently failed to make an easy discrimination they had earlier proved capable of, telling a circle from a relatively elliptical oval. If you are interested, there is an article by Gantt on this topic that examines the results of a number of studies.

        From the point of view of Gibson's Perceptual Learning Theory, the more different two stimuli are, the easier it should be to find a perceptual dimension or feature that helps separate them. As a sample experiment, we might consider the results on training relative numerosity. In this paradigm, subjects typically choose between two displays that contain a number of items. Their discrimination may involve deciding which display has the smaller number of items, for example. Typically, as the number of items in the displays becomes more similar, the problem increases in difficulty, as evidence by training time, time to choose, and number of mistakes made in choosing. Kraemer, for example, finds that pigeons have more trouble with displays having similar numbers of items, and in the human literature, a well-known finding is that people generally take longer to decide which of two single-digit numbers (the numbers 1 through 9) is larger (see, e.g., Parkman) when the numbers are close together in value (5 and 6 will be a more difficult pair than 4 and 7, for example).

        Closely related to stimulus similarity is the notion of feature or cue salience. In a discrimination involving complex stimuli in which some feature or cue has to be attended to in order to discriminate them, discrimination will occur more rapidly if the cue is salient or dominant. For example, if we try to teach humans to discriminate two types of artificial flowers, a problem involving colors or leaf shapes will be learned sooner than a problem involving the angles of branches going off of the main stem (Trabasso). The dimensions of color and leaf shape are dominant, salient dimensions for us in categorizing flowers (perhaps because these are important dimensions in discriminating real flowers). As such, they are the features we will first be drawn to; they will likely serve as our first hypotheses regarding how these artificial flowers are to be distinguished.

        EXPERIENCE. Prior experience will also have a profound effect on learning, as may be evident from the example above regarding artificial flowers. You will see in a later section below that animals and humans tend to first pay attention to a dimension that has worked in the past. So, discrimination learning can be profoundly influenced by past experience, if the current problem seems at all similar to one encountered earlier. In this case, we say that there is transfer of training. If the transfer speeds the learning of the new problem, it is referred to as positive transfer (or facilitation). If it slows the learning of the new problem, we call it negative transfer (or interference).

        The theoretical issue we will face here is whether such transfer can be accounted for solely through a principle of generalization.

        In a sense, the rest of this major section will be devoted to the effects of experience. There are a number of findings here. For example, consistent with observational learning and perceptual learning, animals that observe other animals making a discrimination will learn that discrimination more rapidly (the Kohn and Dennis study mentioned in Chapter 5). Also, stimulus differentiation will speed up discrimination learning, as we saw in the Gibson, Walk, and Tighe study. However, for now, we will mention an additional finding: the easy-to-hard effect. Basically, this finding states that positive transfer can occur when a later discrimination involves a harder problem on the same dimension. An experiment by Marsh can serve to illustrate this. Here is the design:

            Group         Problem 1                                     Problem 2

            1                 easy brightness discrimination     hard color discrimination
            2                 easy color discrimination              hard color discrimination
            control                                                               hard color discrimination

        By comparing Groups 1 and 2 with our control group, we can assess the effects of prior discrimination learning. We will find reasonably good positive transfer in Group 2, but not Group 1.

        DIFFERENTIAL ATTENTION TO S+. In a typical discrimination problem, we can ask whether the animal focuses more on the stimulus that works (the S+), or on the stimulus that doesn't (the S-). Is it more important to avoid the frustration of not getting a reward than it is to actually get the reward? Transfer studies can help us decide this issue, as well. Thus, in one experiment by Hearst, two groups of pigeons were taught an initial discrimination. For Group 1, the two stimuli were an empty circle and a circle with a small vertical line in it. The circle with the small line served as the S+ in this group. But for Group 2, the stimuli were an empty circle and a circle with a long vertical line in it, and this group had the empty circle as the S+.

        Both groups were subsequently given a second discrimination problem in which the two stimuli were the circles with the lines in them. In this second problem, the small line-circle was the S+ and the large-line circle was the S-. What you should note about this procedure is that each of our two groups had been exposed to one of these stimuli before. Group 1 had learned about the small-line circle since that stimulus was also its S+, and Group 2 had learned about the large-line circle (which was Group 2's S-). In theory, then, each group could have had the same experience with the consequences of a given stimulus, which might have been expected to make the discrimination learning easy in Problem 2 because of positive transfer. The design for this study is given below (and I have used uppercase words to show you that each group in Problem 2 was being exposed to one stimulus-response association identical to what it had received in Problem 1):

            Problem        Group                S+                        S-

                1                1                 SMALL line         empty circle
                                  2                 empty circle          LARGE line

                2                1                 SMALL line         large line
                                  2                 small line              LARGE line

        Hearst found that Group 1 learned the second problem more rapidly. There is a strong suggestion here that the animals learned more about the differentiated features of the S+ in initial acquisition.

        We can point to several studies with humans that yield the same conclusion. Suppose we give people a discrimination in which they receive pairs of stimuli such as :


There are actually quite a few features that may be relevant to forming the proper discrimination in this case. For example, will the correct reinforced answer be large letters? If so, our subject ought to choose the symbol on the left. Will it be Xs? Again, the choice then is to pick the object on the left. Will it be black? That also corresponds to choosing the left-hand object. And finally, perhaps our subject has the very simple idea that only things on the left result in reinforcement. Thus, as you can see from this example, four different features or hypotheses would lead our subject to choose the left-hand symbol. By the same token, four other features (small, right, orange, T) will lead to a decision to respond to the object on the right (if you are viewing this in black and white, the orange color appears as a very light grey). The task of the subject is thus to determine which of these eight possible features is the one that constitutes the S+.

        In this type of experiment, Levine and others have found that how fast the person solves the problem depends on the relative proportion of positive versus negative feedback. Positive reinforcement (being told "yes" when you choose the correct object) will result in faster solutions than negative reinforcement (being told "no" when you pick the wrong object). Thus, in a sense, positive reinforcement is more informative; we are more successful in picking up information about the positive object.

        That may strike you as a no-brainer. But if you go back to the example above and think about it, you will quickly realize that it doesn't matter whether we tell you "yes" or "no!" Choosing either object and getting feedback about it will be equally informative from a logical point of view. So, if you choose the X above and are told "yes," you know (1) that it's either X or black or large or left, and (2) that it can't be T or orange or small or right. But if you choose T above, instead, and are told "no," then you know exactly the same thing! In other words, right and wrong choices ought to be equally helpful in solving the problem. That they're not argues for a bias towards learning about the S+.

        In another example, Craik and Tulving had people see a word and answer "yes" or "no" to a simple question about it (Is the word typed in blue?; does it rhyme with "weight?,"is it a type of fish?; etc.). Later, people were given a surprise memory test in which they were asked to recognize the words they had seen in the earlier part of the experiment. Consistent with our bias towards positive instances, Craik and Tulving found that the positive words (the ones people responded "yes" to) were better remembered than the negative words (the ones they responded "no" to).

        REINFORCEMENT DIFFERENTIATION. Finally (to reiterate a point brought up in Chapter 5), in successive choice discriminations, learning will be enhanced to the extent that the reinforcers associated with the different choices also differ. That is, if we want our animals to make different responses when seeing a circle and a triangle, then we would be wise to provide them with different reinforcers such as food for the response to the circle, and drink for the response to the triangle. Peterson's group has done a lot of work on this issue (see the related Peterson experiment on acquired stimulus equivalence discussed towards the end of Chapter 5), but Trapold gets the credit for bringing everyone's attention to this effect. You may wish to consult the discussion on p. 154 (including the findings of Carlson and Wielkiewicz) in which an explanation of this finding is given in terms of expectancies in memory that are less likely to be confused with one another.

C. The Spence-Hull Associational Approach

        Now that we've briefly reviewed some of the findings and procedures relevant to generalization and discrimination, let us turn to a consideration of some of the major theories attempting to handle these findings. We will look specifically at an associational theory, algebraic summation theory, based on Hull's work, and modified by Spence, and we will later compare this to a representational-level cognitive theory.

Basic Assumptions

        The algebraic summation theory is essentially based on the notion of effective reaction potential: It states that both generalization and discrimination will depend on the presence of excitation and inhibition associated with any given stimulus. Essentially, we can claim four basic postulates governing this theory. First, each additional reinforcement increases the strength of an association between a present stimulus and the response the animal made just prior to receiving the reinforcer. This, of course, is a postulate of excitation. Second, each additional non-reinforced trial increases the inhibition between a stimulus and the response the animal made that failed to obtain reinforcement. This, of course, is a postulate of inhibition. Third, inhibition and excitation generalize to other stimuli in proportion to their similarity to the excited or inhibited stimulus. That is the postulate of generalization. And finally, the extent to which a response to a given stimulus is triggered or suppressed will depend on the sum of the excitation and the inhibition connected with or generalizing to that stimulus. This fourth postulate is the postulate of effective reaction potential (or algebraic summation).

        There are several additional points to note about the theory. It claims that both inhibition and excitation grow continuously, so that each additional trial ought to have an effect on habit strength. The theory is thus sometimes referred to as Continuity Theory, to contrast it with some of the other theories that claim sudden or insight-based learning. And it also has nothing to say about the attentional capacities or mechanisms of an organism and how these might relate to learning. It is thus a non-attentional theory. It claims that learning, whether excitatory or inhibitory, involves an association with a specific, physical stimulus. It is thus (no surprise here!) a behaviorist and an associational theory. And it relies on physical similarity of stimuli to explain generalization gradients. By virtue of being a behaviorist, associational theory, it has to make such a claim. This is a result of the positivist, peripheralist approach to learning: Only observable features or events may be included in the theory's description of what goes on.

        As you already know, Hull had claimed that generalization was innate, a claim that failed to hold up in a lot of the later work inspired by Lashley and Wade's claim. But whether generalization is learned or innate ought not to be a major sticking point for assessing algebraic summation theory. Even if generalization gradients change over the course of learning, so long as we can still figure out how much excitation and how much inhibition there is, the theory claims that we ought to be able to calculate the strength or probability of a response to a given stimulus.

        With this as background, let's look at some of its successful predictions.


        We'll start with an example. Suppose that we train one group of animals to respond to a stimulus display consisting of 5 circles. We can then perform generalization tests looking at their responses to displays of 2 circles, 7 circles, 11 circles, etc. Let us set up a generalization gradient that represents an idealized pattern of results in which we've trained the animal to an excitatory level of 50 responses above normal baseline (say, 50 more licks of water per 10-minute period than our rat would normally take). Our potential gradient might then yield the following results:

        Display:         1         2         3         4         5         6         7         8         9         10
        Licks:            0         5         15       40       50       40       15       5          0         0

        Let us now take a second group and inhibit licking to the 6-circle display by associating licking in the presence of that display with something unpleasant. Again, we will train these animals sufficiently to inhibit licking by 50 responses below baseline (i.e., 50 less licks than our animal would normally take). We will represent these decreases in responding by a negative sign. In this case, an idealized generalization gradient (a gradient of inhibition, of course) might involve the following values:

        Display:         1         2         3         4         5         6         7         8         9         10
        Licks:            0         0         -5       -15      -40      -50      -40     -15      -5          0

        Finally, let us take a third group and run them through discrimination training in which one stimulus (the 5-circle stimulus) is associated with reinforcement, and another stimulus (the 6-circle stimulus) is associated with the unpleasant outcome (which in this case could even be lack of an expected reinforcement). In other words, in Group 3 we combine the training procedures of Groups 1 and 2 above. According to the postulate of algebraic summation, we need simply combine the generalized inhibition and excitation to see what will happen. So, adding the respective values above ought to yield the following predicted generalization gradient for our discrimination group:

                                                                           S+      S-
        Display:         1         2         3         4         5         6         7         8         9         10
        Licks:            0         5         10        25      10      -10      -25      -10      -5          0

        What do the generalization gradients look like for Groups 1 and 3? Looking just at excitatory responding (i.e., the gradients of excitation tracking responding above the baseline level), we should find something like the gradients presented in Figure 3. In this figure, the gradient that is presented in a solid color corresponds to the condition in which discrimination was taught (i.e., in which there was both excitation and inhibition), whereas the other gradient corresponds to what would happen with just excitation present. The discriminative gradient exhibits three important features that are predicted by algebraic summation theory. The first is a phenomenon we mentioned earlier called peak shift: The discriminative gradient's peak is no longer at the original S+. Thus, the peak is now at the 4-circle display in the solid gradient, even though we reinforced our animals for responding to the 5-circle display. The reason, of course, is that there has been much generalization of inhibition from the 6-circle display to the 5-circle display, canceling out a lot of its excitation.

        Second, the solid gradient is no longer symmetrical. It is squashed up on the side where the S- is, due again to inhibition generalizing to the stimuli on that side.

        And third, the peak has shifted to the side opposite from the S-.

        There is actually a fourth prediction associated with algebraic summation, although it is not graphed in Figure 3: Generally speaking, the closer together the S+ and the S- are, the greater ought to be the peak shift.

        So, do these results occur as predicted by the theory? Figure 4 presents part of a famous experiment by Hanson on this question. Hanson taught pigeons to peck at a certain wavelength (550 nanometers) for a reinforcement. In the absence of discrimination training (i.e., when the 550 nanometer key was the only key present during training), the generalization gradient looked like the dark blue-line gradient in Figure 4. But when a second group of pigeons was given discrimination training in which an S- of 590 nanometers was also present, then Hanson obtained the type of generalization seen in the solid gradient. The peak for that gradient has clearly shifted to the side opposite from 590 nanometers (the S-), and is now somewhere around 540 nanometers. Hanson also tested several other groups with different S- wavelengths closer to the S+; they all displayed peak shift. And although the asymmetry is a bit hard to see in this example, it is obviously present in some of the other gradients. Thus, the initial results don't appear to be far off what algebraic summation would predict (although there is a discrepant result here which we'll talk about later: Can you spot it?).

        We've mentioned earlier that particularly strong tests of theories occur when people can get them to make unusual predictions; in that case, researchers place more emphasis on the results, and gain greater confidence in the theory. In fact, such a prediction is available from algebraic summation theory. It is this: A procedure that results in successful discrimination learning in the absence of any responding to the S- should result in no peak shift! The reason, from the point of view of Hull's and Spence's approach, is quite simple. In order for inhibition to occur, the animal must make a non-reinforced response (recall, for example, the relevance of the Seward and Levy experiment on latent extinction here). So, no response, no inhibition. And if there is no inhibition, then there can be no generalization of inhibition that causes the peak to change, or the gradient to become asymmetric.

        But you already know such a procedure from our discussion of discrimination techniques earlier in this chapter: errorless discrimination training. To remind you, this technique involves starting with a non-salient S- that will not attract a response, and gradually increasing its salience until it finally matches the S+. Terrace has experimented with this technique, and reports that it is possible to train a discrimination in the absence of any responding to the S-. More to the point, Terrace reports failing to obtain a peak shift. The theory accordingly seems to correctly predict that peak shift depends on inhibition.

        What if we can train a discrimination through the normal procedure (unsuccessful responding to S-) but somehow diminish the inhibition? The theory would also predict lessened or no peak shift in this instance. A clever study that seems to do this was conducted by Lyons, Klipec, and Steinsultz. Their study doesn't quite fit within the Hull-Spence framework, but it is certainly suggestive. They concentrated on the emotional aspects of inhibition, arguing that animals tend to avoid things that are emotionally frustrating or unpleasant, and connecting peak shift to this emotional component. So, to dampen the negative emotionality associated with making a wrong response, they had animals undergo discrimination training while under the influence of a tranquilizer (chlorpromazine). Although the animals made responses to the S- in the course of learning the discrimination, they did not display a peak shift.

        The phenomenon of peak shift is also of interest because it is theoretically capable of providing an explanation for a study on relational learning by Köhler conducted on both birds and chimps. In this study, Köhler started by teaching a brightness discrimination. Animals had to learn to pick the brighter light. Then, for the generalization test, he presented the animals with a choice between the light they were trained on, and yet a brighter light. To illustrate this schematically, let us use levels of grey as our stimuli. A corresponding experimental design would then be something like the following:

            Training:     S-           S+                 Generalization Test

                              ###         ###                         ###         ###

        In this design, the animal might do several different things on the generalization test. Standard behaviorist theory would seem to predict that it ought to respond most to the same stimulus it was trained with, as that is where the greatest habit strength is to be found. Recall that associations form to individual stimuli in the standard approach. But Köhler argued instead that the animal was basing its response on comparing the two stimuli to determine their relationship to one another. During training, the comparison results in the animal responding to the lighter stimulus. Thus, during the generalization test, the animal should also compare the generalization stimuli, and respond to the lighter. If it does so, however, it will pick the novel stimulus (the light grey) over the formerly reinforced stimulus (the medium grey). And that is what Köhler's subjects did.

        This type of finding is sometimes referred to as a transposition. For Köhler, who was a Gestalt psychologist, the pattern of stimuli was what was important, not their individual identities. Transpositions preserve the same abstract pattern. So, when a melody is played in two different keys, for example, we still recognize it as the same melody. But, the finding seems to present a problem for associational theorists who claim learning involves forging associations to single stimuli, and who do not wish to posit a mechanism by which animals may process abstract information different from the actual present physical features.

        Köhler's view is an alternative to the algebraic summation account of what learning involves. But for the moment, do note that peak shift can explain his findings without the need to posit abstract information, or processing some sort of relationship between stimuli that animals can discover. Thus, according to algebraic summation theory, the dark grey stimulus (the S-) will cause the peak to move away from the S+ to the light grey stimulus. We can verify this by playing with the same numbers we used for the earlier demonstration of peak shift:

                Stimulus:                                 ### (S-)             ### (S+)             ###

                Excitation Gradient:                40                      50                       40
                Inhibition Gradient:                -50                   - 40                      -15

                Algebraic Summation:            -10                     10                       25

And as you can see from this example, there will be more net excitation to the light grey than to any other stimulus, so it should receive most of the responding.

        Peak shift phenomena aren't the only successful predictions of the theory, but they are the most dramatic. There are other predictions as well. We will close out this section with two more. Both concern reversal shifts, transfer paradigms in which the go-no go decisions are reversed in a second discrimination problem. The first is quite straight-forward, and plays off of the finding that learning curves typically start off with a relatively flat section in which nothing systematic appears to be happening. This section is termed the presolution period, because there is as yet no evidence in the animal's behavior that it has changed its responding. However, Hull believed that there was a critical value of net excitation that was required to trigger a response. Until effective reactive potential reached this threshold, no systematic responding would be seen (accounting for the presolution period). That did not mean that no association was forming; the habit strength was certainly changing as the animal made random responses, some of which were correct, and others incorrect. On this account, reversing the stimuli in the presolution period ought to slow learning, since the habit strengths in the presolution period would be the opposite of what was required iafter the switch.

        Let's make this example a bit more concrete by providing an experimental design:

            Group                Presolution Period                 Further Training On

        experimental         S+ = triangle                               S+ = circle
                                       S- = circle                                    S- = triangle

        control                   S+ = circle                                   S+ = circle
                                       S- = triangle                                S- = triangle

        For the experimental group, the triangle in the presolution period ought to gain some excitatory habit strength, and the circle ought to gain some inhibitory habit strength. But these are the exact opposite associations from what will be required later, since we want our animals at the end to respond to the circle, and not the triangle. Consistent with Hull's theory, a number of people (including Sutherland and Mackintosh, who are definitely not Hullians!) find that presolution reversals cause negative transfer: The control group shows faster acquisition. Consistent with our earlier discussion of the learning-performance distinction, the lack of systematic performance in the presolution period cannot be taken as evidence that no learning is going on.

        The second study is a bit more interesting. Suppose we train our animals on two discrimination problems. For our problems, we choose stimuli that differ on two dimensions (color and character), and we assign the outcomes as follows:

                    Problem 1                     Problem 2

        S+         #####                             #####
        S-         +++++                            +++++

        In this case, we have four different associations, according to a stimulus-response model of the sort discussed by Hull and Spence. Schematically, we can list these habits as:

                            Stimulus #####  ----> Approach Response
                            Stimulus #####  ----> Approach Response
                            Stimulus +++++ ----> Avoidance Response
                            Stimulus +++++ ----> Avoidance Response

        Given that our animals have learned this first set of problems, let us now transfer them to a new set that involves the same stimuli, but different mappings with the responses. There are in fact two different types of mappings we might try to put our animals through. In a reversal shift, we simply switch the responses, so that each S- becomes the new S+, and vice versa. But in a non-reversal shift, we keep the responses for one of the problems identical while switching the responses for the other. The two new problems can be diagramed this way:
             Type of Shift                 Problem 1             Problem 2

            Reversal             S+         +++++                 +++++
                                        S-           #####                  #####

            Non-Reversal     S+         #####                 +++++
                                        S-          +++++                 #####

        The associations that have to be present to solve this second set of discrimination problems are as follows:

            Shift                                     Association

            Reversal:                   Stimulus #####  ----> Avoidance Response
                                                Stimulus #####  ----> Avoidance Response
                                                Stimulus +++++ ----> Approach Response
                                                Stimulus +++++ ----> Approach Response

            Non-Reversal            Stimulus #####  ----> Approach Response
                                                Stimulus +++++ ----> Approach Response
                                                Stimulus #####  ----> Avoidance Response
                                                Stimulus +++++ ----> Avoidance Response

        The question that numerous investigators asked was, which type of shift would be easier to learn? If you compare these associations, you will see that there appears to be an obvious answer: All of the associations are different from the original associations in reversal shifts, but only two associations are different in non-reversal shifts. Thus, according to models like Hull and Spence, non-reversal shifts ought to be easier. Kendler and Kendler did a number of experiments using this type of set-up with quite a few species of animals (including relatively young humans), and found precisely that result: Non-reversal shifts resulted in positive transfer, compared to reversal shifts.

        You might be tempted to ask why this finding is important. It actually attempts to contrast a theory such as algebraic summation with a more cognitive, relational theory like Köhler's. Thus, Köhler would claim that the animal in initial learning (the very first two problems) is learning to notice shapes of things: It discovers that pound signs (#) are correct, and that plus signs (+) are not. A reversal shift still involves the same abstract difference, so a Köhlerian approach might be tempted to claim that reversal shifts ought to be easier than giving the animal a new abstract relationship (color: the dark things give reinforcement) that has to be learned. But of course, the Kendlers find that doesn't happen. Thus, we seem to have some competitive hypothesis testing that disconfirms a relational approach, and further supports a behaviorist claim that Köhler's results with relational learning must really have been due to algebraic summation causing peak shift.

        At least, in animals and very young humans. Because, as a matter of fact, the Kendlers found that reversal shifts were solved more rapidly by older kids and adults. So, they obtained evidence of some major differences in the forms of learning humans and animals undergo. Is human and animal learning really all that different? We turn next to some problems with algebraic summation.


        We earlier presented results having to do with peak shift, presolution reversals, relational learning, and reversal shifts. We will organize the discussion of problems around these areas, as well.

        Let us start with the phenomenon of peak shift. There are actually a number of embarrassments for the algebraic summation theory here. Some of these surround the finding of peak shift during normal discrimination training, and others concern what is happening during so-called errorless discrimination training.

        Terrace and others had claimed that errorless discrimination training resulted in canceling peak shift due to lack of inhibition (negative emotion in particular, in Terrace's case). Both of these claims have been challenged by others' findings. Several experiments, for example, have claimed a peak shift after errorless training, contrary to Terrace. Also, the question of inhibition and the question of what constitutes "errorless" responding have been raised. Even if pigeons do not ever actually peck the S- key in errorless discrimination, they do often orient towards it, bob towards it, and in other ways, make movements with respect to it that indicate something like a proto-response. Can we safely exclude such acts when we claim that animals make no "errors" in this paradigm? More to the point, Rilling finds that animals will learn a response whose only reinforcer is removal of the S- in an errorless discrimination task. The message of this finding is that there must be some inhibitory or negative emotional quality associated with the S-, since, otherwise, its removal ought not to act as a negative reinforcer.

        What about normal discrimination training? The number of reinforced and non-reinforced trials in discrimination training ought to be the major determiner of excitatory and inhibitory associative strength, according to the Hull-Spence approach. However, it turns out the whether we get peak shift also depends on the order in which the S+ and S- trials are given. If the S- trials are blocked together rather than mixed up with the S+ trials, there is little or no evidence of peak shift, even though both conditions are given so that the animal has the same exposure to S+ and S-.

        An even more interesting and damaging finding is that peak shift does not always move to the side opposite from the S-. If we present more novel stimuli on the S- side during the generalization test than on the S+ side, then the peak shifts to the S- side. Thomas's group has conducted a fair number of these experiments , in accord with an adaptation-level theory of generalization that Thomas has proposed (though be cautioned that many of these studies used human subjects who, as we have already seen in the work of the Kendlers, may not operate quite in the same way as other species). According to this theory, the subject is acquiring information during both acquisition and generalization about the average stimulus presented in an experiment (excluding stimuli deliberately associated with non-reinforcement). Responding will consequently tend to adapt to that average or central tendency (keep this is mind: It will become of particular relevance when we discuss human categories below!) One prediction of this theory is that normal peak shift occurs because taking out the S- means there are more stimuli on the S+ side (so the average slightly favors that side).

        Unlike the algebraic summation theory, Thomas's theory claims that peak shift ought to be found even when there is no discrimination training! As a test of this idea, consider a study by Thomas and Jones. Their human subjects were all given reinforced training with a 525 nanometer stimulus. For the generalization test, people got different ranges of test stimuli, only one of which had 525 nanometers in the middle. In this experiment, the peak shifted with the range. If the test stimuli were generally lower than 525, then the peak shifted to a lower value. And if they were higher, then it shifted to a higher value.

        We earlier saw the presolution reversals resulted in negative transfer, in accord with algebraic summation theory. A similar line of reasoning predicts that reversal shifts following massive training ought also to yield negative transfer. In fact, the more training there is with S+ and S-, the more interference there ought to be due to the incredibly strong habits that have been conditioned. However, people such as Reid and Mackintosh have shown that massive increased training paradoxically can have the reverse result: It can make learning a reversal shift easier! This finding, inconsistent with algebraic summation theory, is called the overlearning reversal effect.

        The central tendency findings of Thomas and his group suggest that we may be doing relational processing after all. Maybe Köhler was right in claiming relational learning based on comparing a variety of stimuli. In fact, subsequent studies of relational learning have attempted to show such an effect when peak shift could not account for the results. Two experiments that are relevant here involve studies by Gonzalez, Gentry, and Bitterman, and by Lawrence and DeRivera.

        Gonzalez et al. did a clever experiment in which they presented pigeons with three lit keys at once. Only the key with the middle color (550 nanometers), however, was the one that provided reinforcement. Thus, in this study, there were two S-'s (540 and 560 nanometers), one on either side of the S+. As you might expect, this resulted in a very narrow gradient when just colors centered around the S+ were presented for a generalization test. However, these experimenters subsequently gave their animals a generalization test involving different groups of three colors. The pigeons had apparently learned the relationship "central color" since their peak was generally the middle color, regardless of which three colors were used. Such a finding is not consistent with algebraic summation, since groups of colors above the 560 (or below the 540) nanometer key ought to show generalization primarily from the nearer negative stimulus. The algebraic summation theory in this case falsely predicts that the peak ought to be with the color furthest away from the S-, and not the middle color.

        Lawrence and DeRivera, in contrast, used a different approach. In their experiment, rats had to take a left turn or a right turn depending on a complex stimulus consisting of two shades of grey. The stimulus on any trial was actually composed of any of seven different shade. In Figure 5, the top row presents stimuli similar to the ones the rats were trained with. Basically, a middle shade of grey (Shade 4) was always on the bottom. If the animal saw a lighter shade on top (as in the three stimuli on the left), then it was supposed to turn to the right. But if it saw a darker shade on top (as in the three stimuli on the right), then it had to turn to the left for its reward.

        One could argue that because the middle shade is present for both left and right, it will be irrelevant to the animal. Alternatively, following relational learning, one could argue that the animal is learning a relation between the two shades (darker on top; lighter on top), and is responding on the basis of that relation. To test these ideas out, Lawrence and DeRivera gave their rats test trials involving stimuli such as the two cards on the bottom. For the card on the left, both shades have been equally associated with turning to the right (since they're light shades); for the card on the right, of course, both shades have been equally associated with turning to the left (since they're dark shades). But waht Lawrence and DeRivera found was that rats presented with the left card turned left, but those presented with the other card turned right. These results are consistent with rats making a "darker than" or "lighter than" relational comparison, and responding on the basis of that decision. Unlike the earlier Köhler study, there was no S- here that could have accounted for these results in terms of inhibition, or of negative emotionality.

        Finally, what about the Kendler's finding that adult humans and non-human animals displayed different types of learning preferences for reversal and non-reversal shifts? Mackintosh and Little examined a slightly different paradigm involving what are called intradimensional (IDS, for short) and extradimensional (EDS) shifts. The procedure for these is actually quite similar to what the Kendlers did, except that we won't use the same identical stimuli in the transfer part of the experiment. Thus, we start out by training animals to make a discrimination on multidimensional stimuli. If you refer to Figure 6, note that Group 1 has two problems to solve: discriminating between a blue rectangle and a yellow circle, and discriminating between a blue circle and a yellow rectangle. If you look at which stimulus in each of these problems is labeled S+, you will see that the animal is being rewarded for learning a discrimination involving color: Blue is correct, and yellow is not. On the other hand, in Group 2, the correct discrimination is shape: These animals are being rewarded for choosing the rectangle over the circle, regardless of what color each is.

        So, we now take both groups and move them to the new set of problems presented in the bottom half of Figure 6. These involve very different shapes than earlier, and very different colors (precluding generalization as an account of the results). But if you look at the S+ assignments, the dimension or relation of shape is still relevant: The animal needs to respond to the plus signs and avoid the triangles. For the animals placed on this set of problems, whether something is red or green is irrelevant. For Group 2, the transfer problem is an intradimensional shift: The dimension stays the same, though new stimuli are presented. But for Group 1, the transfer problem is an extradimensional shift: They need to move from responding to the dimension of color to responding to the dimension of shape. Reversal shifts are a type of IDS, of course, and non-reversal shifts may be regarded as a type of EDS. In this situation where stimulus generalization or overlap of stimulus associations will not predict performance, both animals and humans generally find IDS easier to learn. This result is consistent with relational learning rather than algebraic summation theory if we add that whatever relation has proven important in the past will be first tried out on a current problem, if at all possible. Thus, IDS problems support a claim of learning about abstract dimensions such as color or shape, rather than physical values such as yellow or triangular.

        I earlier asked you to find a discrepancy with algebraic summation theory's claims in Hanson's peak shift data. Did you find it? In Figure 4, the new, shifted peak is much higher than the original, unlike what we would expect to happen (see Figure 3, where the new peak is much smaller than the old peak). Such a finding is called a behavioral contrast. It is a common feature of peak shift accompanying discrimination training. We may be able to get algebraic summation theory to handle such a finding by talking about incentive motivation (K), but clearly more is going on than was expressed in the four simple postulates of algebraic summation theory. That doesn't necessarily represent a strong disconfirmation of the theory. As a strategy, many theorists first start out with an over-simplified theory, and push it to see how far it will go in explaining phenomena. As it starts breaking down, they add complexity to it. But the preference remains for simple theories.

D. Sutherland & Mackintosh's Attentional Approach

        An alternative account of discrimination learning was suggested by one of Tolman's students, Krechevsky, and around the same time by Karl Lashley. According to both theorists, learning wasn't at all about habit strength gradually changing. Rather, animals were forming hypotheses about how stimuli differed. In a hypothesis-testing approach (see also the next major section below on categorization in humans), a feature or characteristic or relationship is attended to (for example, the hypothesis that red things give reinforcement), and evaluated. The process of evaluation involves collecting evidence about whether the hypothesis works. Thus, the animal responds in accordance with the hypothesis, and when sufficient evidence has accumulated to disconfirm the hypothesis, another hypothesis is sought, and attempted. The picture is thus of an organism learning in much the same way that a scientist does. Does a hypothesis provide sufficient contingency (to tie this in to more modern approaches we've looked at in past chapters) between the response to a given stimulus and the outcome? (To give you a feel for the time period, Krechevsky, Lashley, and Köhler were all working with these ideas in the '20s and '30s; the notion of an active organism rather than a passive receptor of stimulus-response associations was thus appealing to a number of people at this time.)

        There are, of course, a number of major differences between this approach and that of Hull and Spence. We have already looked at some of these differences, particularly insofar as they involve Köhler's ideas on relational learning. But two additional ideas ought to be stressed at this point. One is that learning on this account will be intimately related to what the animal pays attention to. As in humans, animals are assumed to have a limited capacity for noticing or attending things in their environments (indeed, presumably a smaller capacity than we do). Thus, only what the animal pays attention to may be coded as a hypothesis. As a simplifying assumption, most theorists start with a model in which animals attend only a single dimension or feature. And second, learning as in Guthrie's theory should have an all-or-none quality (and recall that Voeks showed that all-or-none learning did occasionally occur!). Until the animal has selected the proper hypothesis, no contingent responding will be obtained. Once it has selected the proper hypothesis, however, its learning is essentially complete (although various factors may still influence performance). For this reason, this approach is sometimes labeled non-continuity theory, to contrast it with the gradualist, incremental continuity theories of Hull and Spence.

Basic Assumptions

        A more modern version of the attentional approach was formulated by Sutherland and Mackintosh in the early '70s. Their account was consistent with a good deal of work coming out of the human attentional literature at that time. They basically posited two stages of learning. One had to do with attending the proper dimension, and the other had to do with learning what response to make to the various values on this dimension.

        On their account, there are specialized attentional mechanisms called analyzers that process information about different stimulus dimensions. When a stimulus is presented, it presumably activates various analyzers that detect the stimulus's shape, orientation, size, color, etc. However, these analyzers don't all activate at the same levels. The salience of a given dimension will in part determine how strongly the analyzer kicks in. Thus, we have an account for how salience can influence discrimination learning in procedures such as errorless discrimination training. A very low level of salience will hardly activate an analyzer, so that its activity cannot dominate the animal's attention. But in addition to salience, experience will also affect the activity level of an analyzer. When an analyzer is consistently associated with reinforcement (or successful responding in general), its activity is boosted. Thus, what an animal pays attention to will depend on the relative activity of its analyzers.

        The activity of an analyzer, of course, can be taken as mapping into the hypothesis an animal evaluates. When that hypothesis proves fruitful, that analyzer becomes even more dominant in its activity. But when the hypothesis fails to work, the analyzer's activity level is decreased, and another analyzer must then capture the animal's attention.

        As you can see from this description, there is a very simple prediction to be made here: Whatever has worked in the past is what the animal will tend to try out in the present. Because something such as shape, for example, worked in the past, the shape analyzer is likely to be stronger than other analyzers. We already saw one verified prediction that fits this claim: the results with IDS and EDS (Mackintosh and Little). IDS involves the same analyzers (although different values of the dimension the analyzer is responding to), so that the animal being put through IDS training ought to have an advantage. It is already looking for specific hypothesis concerning shape.

        And that brings us to the second stage. In this second stage, the animal has to focus on various features or values of the proper dimension, and has to acquire knowledge about which response it has to make. Presumably, by accidentally making a response (in the case of a naive animal first being put through training), it discovers that a consequence such as obtaining food occurs. So, hypothesis testing ensues in terms of developing an appropriate response, and connecting that response with the right stimulus. The Kohn and Dennis observational learning study discussed in Chapter 5 is relevant here. In one of their groups, rats observed other rats making a discriminative choice response to one of two stimuli. Observational learning in this case presumably activated the appropriate analyzers because of vicarious reinforcement. These animals started learning about both the correct stimulus and the correct response. But Kohn and Dennis played a nasty trick on this group: They had to learn a reversal shift! That is, they had to approach the stimulus they observed others avoiding. And as you may recall, this group showed negative transfer: They were the slowest to learn.

        Thus, learning will be affected by factors in both the first and the second stage. Another group of rats in Kohn and Dennis's study who did not see the stimuli learned faster, even though they had to go through Stage 1 learning cold compared to our group above. So, as you can see from this type of analysis, paying attention to the proper dimension isn't the only important factor. If you form a specific but wrong hypothesis about a feature-response contingency, it can cause massive interference. Some such mechanism may account for why most animals (though primates are sometimes an exception!) Find reversal shifts so difficult. In reversal and non-reversal shifts, there is a trade-off between Stage 1 and Stage 2 learning.

        Anyway, with this as a short introduction to attentional hypothesis testing, let us look at some of the findings.

Some Findings

        We will start with a rather whimsical experiment by Reynolds. Reynolds set up a discrimination learning experiment with pigeons that involved two keys. One had a green background, and the other had a red background. On top of the green key was a white circle, and on top of the red key was a white triangle. The subjects were reinforced for pecking at the red key with the white triangle.

        Now, let's look at several possibilities regarding learning. According to the Hull-Spence theory of learning, the most likely result is that the animal forms an association to the stimulus complex, the red background/white triangle stimulus. If that is the case, a generalization gradient ought to show some reduced but still strong responding to a stimulus that has just the red color (it is somewhat similar to the S+), or one that has just an illuminated white triangle (again, it is somewhat similar). Alternatively, maybe one of these was a lot more salient than the others, in which case we might find much more generalization to the more salient stimulus. Color might be more salient than subtleties of shape for pigeons, for example, so we might find significantly more generalization to color for all pigeons that go through this experiment.

        But, attentional theory makes a different prediction. First, on the assumption of limited attentional capacity, we would expect these birds to attend one dimension, and not both (the assumption of selective attention). So, we would predict the animal responds to red or to triangle, but that the combination of the two is not terribly important. We can call this winner-take-all processing: The strongest analyzer (the winner) draws all of the attention, so that the animal effectively engages in hypothesis testing (looking at one thing at a time). And second, if there is no prior reason to believe that the shape and color analyzers have become differentially activated due to experience or salience, then by accident some pigeons may have their attention dominated by the shape analyzer, and others by the color analyzer.

        So what happened? Reynolds reports on two pigeons who were given generalization tests following training on the discrimination task. The tests involved exposing the pigeons to one of four buttons: red, green, lit triangle, and lit circle. In terms of pecks per minute, one pigeon simply avoided red, treating it as equal to triangle and circle (which were parts of the S-). And the other pigeon, showed a similar, pattern, except that it avoided triangle. Put simply, the first pigeon's hypothesis was clearly that it was supposed to peck at triangles, and the second pigeon's hypothesis was that it was supposed to peck at red things. This study thus displayed strong evidence of selective attention.

        Curiously enough, the study also showed more responding for both pigeons to the complex (the red and the triangle) than to whichever element that pigeon responded to on later generalization trials. The fact that each bird ignored one of these elements when it was by itself, however, may suggest that this element in the complex served as an occasion setter, much as contextual cues will help remind us (by priming memory) of what we are supposed to do. Thus, for the bird with the hypothesis red, the presence of the triangle may have acted as a retrieval cue that helped the bird remember what it had done in the past when faced with red things. A red key, of course, might also help to prime, but the priming will be stronger when all cues are present.

        A second experiment we can discuss in this context involved a clever study by Lawrence. In this experiment, Lawrence used a complex transfer design that avoided both stimulus generalization and response generalization. In a first phase, rats learned to jump to a chamber, whereas in a second phase they learned to run a T-maze. Running and turning to the left or right is a completely different response than jumping to a chamber, so no response generalization would be expected: Animals in the second phase have to learn a completely new response. And by the same token, chambers and T-mazes represent quite different stimuli, precluding generalization on that basis. If you earlier jumped to a black chamber instead of a white chamber, how would this help you decide what to do when placed in a black rather than a white T-maze?

        Since Lawrence's design was rather complex, let's diagram part of his experiment:

            Group         Simultaneous Discrimination                 Successive Discrimination

            1                 black vs. white chamber                        black vs. white T-maze
            2                 large vs. small chamber                         black vs. white T-maze
            3                 rough vs. smooth floor chamber            black vs. white T-maze

Here, in the first phase, the rat had to jump to one of two chambers from an elevated stand. This, of course, is a choice discrimination in which jumping to the wrong chamber results in lack of reinforcement. In fact, animals who jumped to the wrong chamber were unable to get in to it, and fell into a supporting net beneath the chambers. From the point of view of algebraic summation theory, the two types of discrimination are so different that we ought to predict no generalization (i.e., no transfer) from one to the other.

        But what about successive discrimination? In this transfer task, all of the animals had to learn that one color was associated with a turn to the left (because that is what resulted in reinforcement), and another was associated with a turn to the right. In fact, rather than the two T-mazes listed above, Lawrence actually put his animals in four mazes on various trials. Thus, there were two black mazes, one with a rough floor and the other with a smooth floor, and two white mazes, similarly distinguished by the roughness of the floor. The roughness of the floor was irrelevant to what the animal had to do to get food, however. Only color was contingent with location of the reinforcer.

        You should be able to predict what happened in this latter task. Group 1's analyzer for color should have been strongly active, whereas Group 3's analyzer for texture should have been strongly active. Thus, placed in these mazes, these groups should have been initially attending very different things about the mazes. Specifically, Group 1 should properly have been checking out color as its first hypothesis, and that should have speeded Stage 1 learning. But Group 3 should have checked out the wrong hypothesis first (that texture is somehow linked with what the animal is supposed to do), and that should have slowed learning. The results were in line with these predictions: Group 1 learned the maze problem more rapidly than Group 3.

        And what about Group 2? Their analyzers for size are active coming in to the maze problem, but since the mazes don't differ on size, that analyzer quickly becomes deactivated. Some of these animals would have subsequently checked out texture first, and others, color. Thus, averaging over the animals that chose the right hypothesis and those that chose the wrong hypothesis, we ought to find that Group 2's learning is somewhere in between Group 1 and Group 3. And that is what happened.

        Let us take one more series of studies that fit in with attentional hypothesis testing. This series involves work done by Harlow on learning sets. In these experiments, Harlow effectively asks the question of how quickly and efficiently animals can learn to evaluate hypotheses. Harlow sets up his experiment so that his animal subjects (typically monkeys, but the same results have been found in a number of other species) have to learn a long series of simultaneous discrimination problems. Once they have acquired one problem, they are moved to the next. However, one of the questions Harlow asked was whether his animals would be capable of a win-stay lose-shift strategy? That strategy is in some sense the backbone of a hypothesis testing approach: As long as what you are doing is correct, stick with it (win-stay); when it leads to a wrong answer, change it (lose-shift). What Harlow found was that his animals were capable of acquiring this strategy so that on later trials involving new problems, they effectively needed to make just one response to figure out which stimulus to pick.

        More interestingly, Harlow also asked whether his animals could acquire a win-shift lose-stay strategy. In this latter series of studies, a problem is always followed by a reversal shift problem with the same stimuli, after which a new problem is given, and then followed by its reversal shift. So, once you have solved a problem with new objects, you will now be placed on the next problem in which you will need to choose the stimulus exactly opposite to the one you just picked. In a particularly devious version of this procedure, we get you to the point where there are just two trials on each set of stimuli, the second being the reversal shift. Here, to do well, you must immediately pick the other stimulus if you choose correctly on the first trial (win-shift), and continue to pick the same stimulus if you chose incorrectly (lose-stay), since that loser will be the winner on the next go-around.

        Harlow found that win-shift lose-stay strategies were also learnable. Such strategies pose a particular problem for theories like algebraic summation, because they require an animal to consistently choose the stimulus with the presumably lower habit strength. Nevertheless, they are the type of learning strategy consistent with a hypothesis-testing approach and with a relational approach. Animals can apparently learn about complex sequential patterns of stimuli (a relational approach), a view that should remind you of Hulse's work (discussed in Chapter 6) with multiple schedules of reinforcement.

        Although the attentional, hypothesis-testing models appear to be well-supported by a lot of data, you should not think that these models have no problems of their own. Their view of selective attention (in which the animal attends only one dimension at a time), in particular, is probably overstated. For example, in a study by Thomas, Freeman, Svinicki, Burr, and Lyons, pigeons were given a simultaneous discrimination involving two keys of different colors. Each key had the same vertical line on top of it. Based on the winner-take-all assumption, we would have predicted a strongly activated analyzer for color, and deactivation of the analyzer for orientation. But, when Thomas et al. gave their animals a generalization test involving just lines tilted at various degrees from upright, they found the usual generalization gradient. Thus, these birds had also apparently learned something about the line (a very different result than that of Reynolds).

        On the other hand, perhaps Gibson's idea of stimulus differentiation comes in here. Perhaps the more experience an animal has with a dimension, the more it can start to attend other features that are present in the stimulus. A claim like this would assert that a finding like Thomas et al. will depend critically on the amount of training. Moreover, additional training involving obtaining the reinforcer makes the expectancy associated with the response and with all of the present stimulus elements that much stronger, so that when the expectancy is violated, it is that much more surprising. (Compare your reactions to someone who says "hello" to you five days in a row, and then doesn't on the sixth day, with someone who says "hello" to you 5 months in a row, and then fails to do so: The latter is a much stronger cue that something is out of whack.) Some process like this may account for the overlearning reversal effect. On this account (and I think Harlow's work with learning sets fits in, as well), experience ought to free you not only to learn more and more things about the stimulus, but also to revise your confidence in its link with a certain outcome, thus affecting how much evidence you will subsequently need to decide that a hypothesis is no longer tenable.

        In any case, most theorists today view the Sutherland and Mackintosh model (and variations of attentional theory) as more viable than algebraic summation.

        With this as background, let us turn to studies of human attention.

II. Attentional Processes in Humans

        Does the human literature also support the predominant conclusion obtained from the animal studies that attention is related to learning? That issue (and how attention works) will the focus of this relatively brief section.

A. Preliminary Memory Trace Model

        We need to start with an overview of a standard model for human memory. The model is standard in the sense that it captures some of the features that most memory theorists agree on. (It follows roughly the same organization of levels as the multistore model of Atkinson and Shiffrin discussed in the next chapter.) This model is schematically presented in Figure 7.
        In this generic model, we identify three different levels of memory systems. Two of these systems, working memory (or short-term memory) and long-term memory, should be familiar from our discussion of Wagner's priming model. Working memory (WM or STM) is a highly fragile system (by which I mean that information -- a memory trace -- that enters it can easily be lost due to disruption. Working memory is also highly limited in capacity: It can only hold a small amount of information.         As in Wagner, it is the site of active processing or thinking, and involves what we are conscious of at the moment (though it need not be restricted to consciousness). Memories in working memory need not become permanent. You can use working memory to temporarily store a phone number you've looked up, so that you can dial it from memory. After dialing, however, that phone number will likely be forgotten. Working memory can also hold past memories that have become activated. Thus, we talk about you retrieving into working memory a long-term memory of what you were doing last Friday evening. The information in working memory is encoded into some meaningful form. In other words, it has already been analyzed, processed, and recognized by the brain. So, a certain set of soundwave frequencies may have been recognized as the word psychology (acoustic encoding), and you may also have retrieved information about the meaning of this word, and who you know who is a psychology major (semantic encoding). You might even pull out an image of a prototypical psychologist's office with a couch (visual encoding). As these examples indicate, the three major processes involved in discussing memory are encoding, storage, and retrieval. Storage is temporary in working memory.

        The other system that should be familiar is long-term memory (LTM). There are all sorts of different encodings in long-term memory, and there are all sorts of memory traces in long-term memory. Long-term memory is regarded as relatively permanent, and in most theories, relatively inactive. In order for working memory to work, it has to locate information in long-term memory and use that information in an appropriate fashion. New information that enters working memory may become permanent, in which case you have formed a memory trace for that information not only in working memory, but also in long-term memory. As in Wagner, the idea is that memory traces in working memory may activate similar traces in long-term memory (in which case you have been primed or reminded of something), and sometimes these traces involve automatic routines. Thus, permanent learning involves creating new traces in long-term memory. As in Wagner, for this to happen, episodes in working memory have to be processed or rehearsed in some fashion.

        Finally, we add a series of memories that are called sensory memories, and that involve sensory registration. These are the interface between the outside world and your working memory. The idea is that physical energies (sound waves, light waves, etc.) cause various sensory neurons to react (changing their patterns of electrical activity). These congregations of sensory neurons may fire long after the actual physical energy that caused them to react in the first place has gone. That is why they are referred to as memory systems. They are maintaining a record of the activity that happened in the recent past. They serve a very important function by doing so: It takes a bit of time for the brain to analyze the electrical activity it is receiving, and to make sense of it. By maintaining a representation of the pattern for some period of time, these sensory systems ensure that the brain has sufficient time to process these patterns, and recognize the information they provide about events in the outside world. Thus, they allow the brain to recognize a stimulus potentially long after the actual stimulus has gone. Obviously, in these memory systems (corresponding to visual neurons, auditory neurons, taste neurons, etc.), the encoding is nothing more than a raw pattern of electrical activity: It has not been interpreted as meaningful in the sensory memory; that sensory memory is just holding some transformation of external physical energy for the brain to analyze and make sense of. In this sense, we refer to sensory memories in this model as precategorical.

        So, where does attention fit in, here? At any given moment, you are being bombarded by a huge amount of sensory stimulations (sights, sounds, touch, pressure, etc.). But, working memory can only process a minute fraction of that information because of the bottleneck in its capacity. Thus, there is a need for a process by which some of the incoming information is selected out for further processing. That process is what we normally call attention. It implies an ability to focus on certain tasks or events while ignoring others.

        Given that, let us ask where such selection occurs. Is it before information gets to working memory, or after? And how successful is it?

B. Evidence from Focused and Divided Attention Tasks

        We'll start with how successful selection is. That is, can we focus attention, and what are the implications of such focusing for learning? These questions have been the concern of experiments on divided and focused attention. In divided attention, people have to do several things at once. We look at how successful they are, compared to people who are only asked to do one of those things. Specifically, when we come back and check for what was learned, did having to do a second task affect what got into long-term memory? Similarly, in focused attention tasks, we ask people to look at or listen to one stream of information out of multiple streams. The question here is, does anything besides what people focus on get into long-term memory? To see, let's look at some studies.

        The first study we'll look at is the most fun. It involves the infamous shadowing task developed by Cherry. Here is the procedure: Our subjects have headphones on. There are words streaming in on both their left and right ears. In order to focus their attention, Cherry asks them to repeat back (at the same time that they are hearing it) everything that comes in on one ear (say, the left ear). The technical terminology for this is the attended channel or attended ear. And the issue is, what, if anything, do people pick up from the unattended channel or unattended ear?

        What Cherry found was that people did not remember any of the words on the unattended ear. They did recognize major physical changes (such as a change of voice from male to female, or a change of intensity from loud to soft), but not the individual words themselves. In fact, half way through the experiment, Cherry started sneaking in words in German (these subjects spoke only English): Nobody noticed! Cherry also snuck in reversed, backwards English words, and nobody noticed that either. So, from this study, it looks like information that is paid attention will likely be learned, but stimuli that are not attended will not. Note how this contradicts a model like algebraic summation theory that effectively claims all stimuli on a given trial where RF is present will have their associations affected by the presence (or lack ) of a reinforcer. These results, however, fit in nicely with the attentional hypothesis testing model in that attention is required for memory (our measure in Cherry's study of whether there had been any learning).

        Moray did a similar study. In his experiment, people were also asked to shadow. However, he repeated some words in the unattended ear could up to 35 times. After the experiment, Moray gave a recognition test for words in the two ears. People were able to recognize words they had shadowed, but not words that were in the unattended ear. If these had gotten in, we would have expected that words repeated 35 times should have higher recognition rates than words presented only once or twice. This did not happen, further supporting Cherry's finding that attention and learning are intimately connected.

        What about divided attention? Mowbray had one group of subjects read a story. Another group both read the story and listened to another story at the same time. When Mowbray checked for what people remembered about the story in the two groups, he found considerably worse memory in the dual-task group. So, the corollary to focusing attention being good for learning appears to be that dividing attention is bad. Although we did not discuss this in the earlier section, there is evidence that such a principle holds for animals, as well. The more features there are distinguishing two stimuli in a discrimination learning experiment, the more slowly the animal learns the discrimination.

        Another study on divided attention comes from Axelrod and Guzy. They had a very simple task: Listen to clicks over ear phones, and estimate their rate. As did Mowbray, they used a divided and a focused attention group. The divided attention group (the dichotic or two-eared group in a task involving listening) had the clicks divided between the right and left ears. The monotic (one-eared) or focused attention group heard all the clicks in one ear. Not surprisingly, both groups increased their estimates of click rate as the clicks actually started coming faster. However, the increase was higher for the focused attention group. Why? Supposedly, the divided attention group had to switch back and forth between ears to listen to and estimate the rate of the clicks. When the clicks were coming too fast, they were losing some of the information because of the time taken to switch attention. Thus, for them, fewer clicks got into their working memory at the fast rates. So of course, they gave lower estimates.

        And since we have concentrated on auditory attention, let us consider yet a further study by Neisser and Becklin. They also used a divided and a focused attention task. In their study, people watched two films superimposed on one another. One involved a hand-slapping game, and the other involved a ball-passing game. People in the focused attention condition had to keep track of the number of significant events in one or the other game, but people in the dual-task condition had to keep track of both (i,e,; the number of ball passes, and the number of hand slaps). Performance deteriorated greatly in the dual-task condition. Thus, the findings with visual attention are in line with the findings regarding auditory attention.

        Thus, from these studies, we obtain some preliminary conclusions: (1) It takes time to switch attention, and when you do, you may lose some information (i.e., it may never get into working memory); (2) if you try to pay attention to too many things at once, you will get overloaded and lose information; (3) you can focus attention, but it seems to be based on physical features such as attend-to-stuff-on-the-left or attend-to-stuff-on-the-right, since we are sensitive to physical changes but not meaningful changes in the unattended channel; and (4) what you pay attention to will have a chance of being learned (getting into working memory, and from there, maybe long-term memory). As with the animals, it appears that attention is a prerequisite for learning.

C. Filter Models of Attention

        So, given the basic findings, how can we incorporate these into specific models of how attention operates? There have been a number of models posited over the years. Generally, we can break these down into two broad sorts of classes (though see the chapter on connectionism for yet a third approach), filter models and capacity models. Filter models claim that attention acts as a filter that lets some information through while blocking or tuning out other information. Capacity models involve later theorizing, and claim that attention is equivalent to the amount of mental energy you have for doing something. As there is a limited amount of capacity or energy, the more you try to do, the less energy you use for any given task (a bit like overloading an electrical circuit, where all the appliances start operating less efficiently, and the lights start dimming). Within the filter models, there is a further breakdown depending on whether a theorist claims that attention is precategorical (operating as an early filter between the sensory memories and the working memory), or postcategorical (operating as a late filter once information has actually gotten in to working memory). Let us take a brief tour of the major models.

Broadbent's Blocking Filter Model

        Broadbent's model was really the first modern model of attention. In this model, an early, precategorical filter basically allowed one stream of information to get through from sensory memory to working memory: Everything else in sensory memory was blocked out perfectly (which is why this is called the blocking filter model). Moreover, Broadbent claimed that we focused or switched attention by tuning in a physical characteristic (just like a radio tuned to a certain physical frequency allows the reception of information being carried on that bandwidth). How Broadbent's model handles the results in the previous section should be obvious: They all fit in nicely with it. Divided attention tasks can be explained by the need to constantly switch the filter, which should result in losing information. Since the filter may be kept tuned to a single channel in a focused attention task, however, the information in that channel will have a very good chance of getting in. Moreover, information in the unattended channel will not be processed because the filter doesn't let it through. At the same time, since filters require tracking physical changes to maintain focus (or switch attention), our systems will be sensitive to those.

        There are a number of similarities between Broadbent's model and the attentional theories of animal learning we looked at earlier. In particular, Broadbent's claim that we generally focused on just one stream is certainly reminiscent of a winner-take-all postulate in which one analyzer only draws attention to itself. But while reminiscent, it should not be mistaken for exactly the same idea: Multiple information (including multiple dimensional values) can presumably stream in on the channel that is currently active.

        In an experiment on this model, Broadbent gave subjects simple words to listen to. Subjects heard three words flow in to the right ear, and three flow in to the left, in pairs. In a switching group, he asked people to recall the words chronologically in pairs (i.e., the first pair consisting of the word on the left and the word on the right; then the second pair; and finally, the third pair). In a non-switching group, he asked people to recall all the words on one side before recalling the three words on the other. The non-switching group did a lot better than the switching group.

        If we relate his model to Figure 7, there will be a filter between the sensory memories and the short-term memory that allows one channel of information through. Because the sensory memories decay rapidly, however, switching back and forth between two sensory channels means that the memories are decaying in the time it takes to switch. Thus, requiring a lot of switches should mean that you are more likely to lose information by the time you try to recall your last pair.

        Unfortunately for Broadbent, several results came in that were inconsistent with the notion of a perfectly blocking filter. This led to a new model. Before discussing the new model, however, let's talk about some of the problems Broadbent's model faced.

        There is a phenomenon called the cocktail party effect. When you walk into a large party, you cannot hear all the conversations at once. If you try, it sounds like one huge buzzing noise. But, obviously, we can focus in on a specific conversation and apparently block out the other stuff going on (same with being in a crowded, noisy restaurant). So far, that is compatible with Broadbent's model. However, if someone across the room or the restaurant happens to mention your name, you may become aware of that and suddenly switch attention to that conversation. In other words, people do seem to pick up some unattended material that can cause their attention to switch. That suggests that some type of information is getting through on an unattended channel.

        Is there a way of testing the reality of the cocktail party effect in the laboratory? The answer is "yes." Thus, Moray in the experiment mentioned earlier involving word repetitions also occasionally snuck in the subject's name. At least a third of the time, people did hear their own names, though they didn't pick up on anything else. Similarly, Neisser did a study on visual attention in which a focused attention group read a story that appeared on even lines, while a second story was presented on odd lines. Consistent with the results of earlier focused attention studies, Neisser found little evidence of people picking up details from the unattended story. But if their names were in that story somewhere, then they did notice that, at least some of the time.

        A third study conducted by Underwood was rather interesting. Underwood had subjects shadow one ear, but he also required them to do a divided attention task. Not only were they supposed to shadow, but they were also supposed to press a button whenever they heard a number. Numbers, of course, could come in on the attended (shadowed) ear or the unattended ear. People detected numbers on the unattended ear only about 8% of the time, compared to nearly 90% on the shadowed ear. That partially fits in with earlier results, although that 8% suggests that some information may be getting through. However, people in this task might also be doing a bit of switching every now and then, and occasionally picking up a number because they've momentarily switched attention to the other ear.

        However, Underwood also asked Moray to serve as a subject. Moray by this time had designed quite a large number of experiments on shadowing, and had become well practiced in this technique. And when Moray did the task, his detection rate for numbers in the unshadowed ear was much higher than the rates of the unpracticed subjects. So, the attentional system appeared significantly more flexible than Broadbent had given it credit for.

Treisman's Attenuator Model

        The next model that was developed attempted to handle such findings as the cocktail party effect. It involved some (though not a whole lot of) change in Broadbent's model, and he in fact also subsequently adopted it. This was the attenuator model of Anne Treisman.

        Treisman, like Broadbent, adopted an early filter model (i.e., a precategorical model involving a filter between a sensory channel and working memory), and also claimed that people could focus only on the basis of physical features. In her model, however, unattended channels are not perfectly blocked. Rather, they are tuned down low enough that the signals coming through those channels are very weak. As an example, a weak TV signal from another channel will sometimes appear on top of the channel you're actually watching; it's annoying, but you can still follow the desired channel. Or as another example, you can occasionally hear someone else's barely audible conversation when you're on the phone, and there's some cross-talk.

        There were a few other additions to the model that we need to talk about, however. Thus, Treisman suggested that there were a series of filters specialized for different stimulus features, and not just one filter. Moreover, associated with this model was the notion of a lexical dictionary.

        A lexical dictionary is the part of the memory system that stores the patterns (visual, auditory, etc.), that we can recognize. When you hear the word "dance," for example, a DANCE pattern in the lexical dictionary (in LTM) gets activated, and provides information about where to find further information about what dances are. But, different entries in lexical memory have different thresholds of activation. Words or patterns that are very familiar will have very low thresholds for becoming fully activated. In contrast, unfamiliar words or patterns will have high thresholds (they will need more processing to be fully activated). And these notions of activation and threshold can account for findings like the cocktail party effect.

        How do they do so? The entries in the lexical dictionary become activated when they enter working memory. Normally, weak signals will result in weak activation, so that those concepts never get above their threshold levels. Occasionally, however, a weak signal will come in that activates a concept with a very low level. When that happens, even though the signal is very weak (because it came through one of the attenuated channels), it may nevertheless be just strong enough to push a concept above its threshold. And when a concept is above threshold, it becomes recognized for further processing, and enters consciousness.

        On this account, of course, Treisman's quite reasonable claim is that one of the entries that will have an extraordinarily low threshold is your own name. And that explains why you sometimes are aware of your name when it comes in on the unattended channel (the cocktail party effect).

        Is there any further evidence in support of these ideas? Quite a lot, actually. The notion of threshold, in particular, is a well defined one. Consistent with this notion, we find in lexical decision tasks that how long it takes people to recognize a word correlates with the frequency of that word in common language. Words that are used a lot will have lower thresholds than words that are used only rarely. So, presumably, less common, less frequent words will need to be kept in working memory a bit longer before their corresponding entries in the lexical dictionary are activated above threshold, providing the information that they, indeed, are recognizable patterns.

        But there is also one more idea that has found a lot of support in experiments using the lexical decision task (LDT). This is the notion of spreading activation (similar in some respects to Pavlov's notion of activation or excitation spreading amongst similar stimulus centers). In the lexical dictionary, concepts that become activated spread some of that activation to semantically or associatively related concepts in the lexical dictionary. Thus, activating one concept starts getting other related concepts closer to their thresholds. John Anderson has put it quite nicely: He claims that activation is a method that prepares information that is likely to be relevant to whatever it is we are currently doing.

        With this as background, consider the following experiment with LDT by Meyer and Schvanenveldt. People get a string of letters on a screen, and have to press a button to indicate whether the string forms a word in English. The question is, how long do lexical decisions take? They find that a priming word (also called the prime) presented just before the string of letters people have to judge will speed decision times for related words. This finding is called the priming effect. In the following design, for example, Group 2 ought to be faster at judging the target word (the time to judge that word starts, by the way, the moment it appears on screen): The prime word is related to the target word in Group 2, but not particularly in Group 1:

                                Group 1                 Group 2

        (prime)             butter                    doctor
        target word     NURSE                 NURSE

        I have spent some time on spreading activation at this point, because it also forms a part of Treisman's results, and allows her to explain some interesting findings. To see this consider yet another experiment by Treisman and Geffen. In their study (a type of experiment referred to as a switching experiment), they gave their people a shadowing task, but of course presented sentences to the unattended ear as well as the attended ear. The following two sentences are examples of what their people might get:

        Rondstadt has a lovely voice, but her selection of music is strange.
        Swann caught the ball and proceeded to run for a touchdown.

        Midway through these sentences (when each hit the second clause), Treisman and Geffen switched the sentences so that they traded ears. Thus, if people started out shadowing, say, the Rondstadt sentence in the left ear, then half way through, they would be hearing the Swann sentence in that ear. Accordingly, what they ought to be repeating in their shadowing would be:

        Rondstadt has a lovely voice and he proceeded to run for the touchdown.

        What actually happened, however, is that people followed the Rondstadt sentence for a few more words when it switched ears, even though it was now supposedly in the unattended or attenuated ear. When the sentence switches ears, so do the shadowers, at least for the first few words. Then they realize they've made a mistake, get confused, and try to pick up the correct ear, again.

        Note that you could interpret these results as suggesting that people can focus attention on the basis of meaning (that is, once information has become fully recognized in working memory): a postcategorical filter. In some sense, this seems the natural interpretation. But, it doesn't immediately suggest why people generally do so poorly in remembering unattended information.

        But if we now take note of the spreading activation concept, then we have an easy way of explaining these results while still maintaining that people use an early filter. That explanation is that strongly activated terms in working memory will prime corresponding concepts in the lexical dictionary. If a weak signal on the attenuated channel corresponding to one of these concepts now comes through, that signal along with the activation from priming may push the signal above threshold. Thus, much as doctor will prime nurse, hospital, illness, and golf, so will Rondstadt prime music, singer, song, Jerry Brown, and female personal pronouns like she and her. These words associated with the weak signal now become part of consciousness and may mistakenly be repeated before the subject realizes his or her mistake. It's a bit like trying to listen to a distant radio station; when another station fades in, you may momentarily think you're listening to the station you wanted, and wonder why they messed up the music in mid-song.

Attention as a Late Filter: Deutsch-Norman Model

        So much for the early filter models. Not everyone buys into these. Indeed, some people have claimed that attention operates at a late stage, once all stimuli have been processed and identified in working memory. The most famous of these is the Deutsch-Norman memory-activation model. In their model, the bottleneck that requires selective attention has nothing to do with capacity limitations in working memory. Instead, they claimed a limitation on how many different responses people could execute at a given time. Thus, they claimed that people chose some streams or channels of information to attend on the basis of pertinence. Pertinence refers to priming or activation of concepts in long-term memory because you are expecting to deal with related issues (the fact that you are reading this means that you are expecting information about attentional models and learning; that is what is pertinent to you at the moment, and some of the related ideas you already have on this topic should thus be partially activated). When a concept is activated both because of pertinence and because it is actually encountered in a visual or auditory task, the activation from the two sources (internal and external) sums. So, we can regard the filter here as one that lets very strongly activated concepts through. Some information is accordingly selected for further processing, while the rest is dismissed or ignored. Even though the latter information has reached a level of meaningful analysis, because it is not processed further, it will be rapidly forgotten, accounting for the standard results of divided and focused attention tasks.

Johnston & Heinz's Flexible Filter Model

        Finally, let us look at a model that in some sense represents a transition to the next phase of theorizing about attention. This is the flexible filter model proposed by Johnston and Heinz. Essentially, this model claims that we have both kinds of filters: a early filter and a late filter. Moreover, we can choose which kind to use (hence the flexibility). What makes this a transitional model is that Johnston and Heinz tie what type of filter you use into a notion of effort: The late filter requires more work, attentional capacity, or effort. Thus, if at all possible, Johnston and Heinz claimed that people would prefer to use an early filter. That would account, as well, for why people generally don't seem to pick up on meaningful information in the unattended channel.

        Johnston and Wilson tested this model in a clever way. They set up an experiment in which some people could choose an early filter, but others had to use a late filter. In their experiment. subjects listened over head phones to streams of words coming into both ears. They were given the task of pressing a button whenever a word that belonged to a certain category appeared. So, if the category was clothing, they had to press for words like shoe, sock, shirt, belt, etc. One group of subjects was told that a target word would always be in the left ear. Another group had their target words appear sometimes on the left, and sometimes on the right. Thus, the first group could choose to block out the right ear, but the second group had to try to listen to both.

        Note that the words can sometimes be ambiguous. Sock can be an article of clothing, or it can be an action that you take in a soccer game. When an ambiguous target word appeared in one ear, one of three different types of words simultaneously appeared in the other. First, it could be a neutral word like Tuesday. If you are paying attention to that word, it shouldn't prime sock. Second, the other word could be an appropriate word like smelly. Appropriate words are semantically related to the target word as an article of clothing, and so should help prime the right meaning. Finally, the other word was sometimes an inappropriate word like punches. Inappropriate words are related to the non-relevant meaning, and so should prime the wrong meaning of the target word, and thus make it more likely that you will miss that word.

        Figure 8 presents the number of words that were detected as a function of what the other word was (neutral, appropriate, or inappropriate). Let's look first at the neutral word. What you can see there is that the focused attention group (those who knew the target word would be on the left ear) did better than the divided attention group (who had to listen to both ears). This supports the claim that late filters take up more work or capacity, and thus result in poorer performance. Second, if you look at the divided attention group, you can see that they did better with an appropriate word, and worse with an inappropriate word. So, apparently, they were processing meanings in both ears, as a late filter model claims ought to be possible. What you hear on one ear may influence how you interpret what you hear on the other. This is a straight-forward priming effect. Finally, if you look at the focused attention group, they were totally uninfluenced by the other word: they showed no priming. So, consistent with the flexible filter model, we apparently can use an early filter. Indeed, this study shows that people generally choose to do so, where possible; and the reason they do so is because it takes less effort.

D. Attention as Capacity

        Most people today have moved away from the notion of a filter that lets some information through while blocking other information. Instead, one of the favorite metaphors is that attention involves how much processing resources or capacity you have for doing something. The first of these models (and still the major model in this group) is the capacity model of Kahneman.

        According to Kahneman, we have a single central pool of processing resources that all tasks must compete for. Because the pool is limited in capacity, we have need of an allocation policy that determines which tasks will be given what resources. Presumably, some tasks that are of momentary importance to us will be allocated a lot of resources, while other tasks that are of less importance will be allocated only a few resources. Since how fast and how successfully a task completes depends on the resources allocated to it, we can now predict the basic findings in the attention literature. Namely, in focused attention tasks, some tasks are given virtually all the resources they need. But, in divided attention tasks, there are too many tasks competing for too few resources, so everything slows down. Or to put it another way, two tasks will interfere with one another when they compete for the limited resources available, and there are not enough resources for both.

        What determines the allocation policy? First, it is influenced by how much capacity is available at any given time. Although the capacity is limited, the total amount does vary somewhat. In particular, it looks like arousal of the central nervous system suddenly frees up more capacity. So, when you are startled or suddenly scared, you are capable of doing things faster, presumably because of the additional capacity that is infused because of the adrenalin now rushing through the body.

        Second, the allocation policy is also determined by what Kahneman calls enduring dispositions. Enduring dispositions are long-standing, relatively permanent allocation policies. Secret agents like James Bond, for example, are likely to have long-standing allocation policies to process and recognize anything that sounds like the click of a gun about to go off (well, in the movies, anyway). More prosaically, we all have a long-standing allocation policy that gives resources to recognizing the sound or sight of our own names, since we normally have to take some action when we hear our names being spoken. As you can see, this predicts the cocktail party effect.

        Finally, Kahneman talks about momentary intentions also influencing the allocation policy. Momentary intentions play a role similar to attention switching in the filter models. They are the strategic decisions that subjects make about what is important, and what is not. So, in Johnston and Wilson's experiment, subjects in the monotic group could form a momentary intention that only words coming in on the left ear were important, whereas subjects in the dichotic group had to form an intention to listen to both ears. Or to take another example, if you are reading a book while the TV is on, which of those two sources you pay more attention to (that is, you allocate more resources) will depend on your momentary intentions: Which is more important? If the book is a text and you have a test the next day, you will probably allocate more resources to reading. If the book is a formula Steven King novel and a new Red Dwarf episode is on, you may form a momentary intention to allocate more resources to the tube. Momentary intentions can also account for the Treisman and Geffen sentences switching result: If you are processing information about a female singer, you allocate resources to tracking semantic concepts and information related to that person.

        So, the question: How on earth does anyone figure out how much capacity is allocated to a given task? Although there are several techniques for doing this, the favorite technique is to use a specific type of divided attention task that technically involves what's called the dual-task technique. Specifically, we ask people to do two tasks at once. One of these tasks is called the primary task. That task is the one whose capacity demands we want to measure. The other task is typically a very simple task, and is the instrument by which we (indirectly) measure what the primary task is taking. That other task is referred to as the secondary task. The basic idea is simple: The more capacity the primary task needs, the less will be available for the secondary task. So if we have two different primary tasks but the same secondary task, and if the secondary task is accomplished more slowly with one of the primaries, it must be because that primary task requires more capacity.

        Since that last bit was a bit abstract, let's look at a specific example involving an experiment by Tyler, Hertel, McCallum, and Ellis. Their secondary task was very simple: A light would come on at various points, and their subjects would have to press a button as soon as the light came on. So, they measured simple reaction time to the light (from the point when the light is turned on to the point when the button is pressed). The primary task involved solving anagrams. They gave their subjects relatively easy anagrams and relatively hard anagrams for the same word. An example of one of their hard anagrams was CROODT. As for an easy version of this anagram, once you've solved it, interchange the third and last letters to see what they used. (I'm not giving it to you, since I want you to try to solve the relatively hard anagram). What they found was that people had longer reaction times to the light when they received hard anagrams. Thus, as the difficulty of the primary task increased, subjects allocated more processing resources to it, leaving fewer for the secondary task. As a result, the secondary task slowed down.

        There are plenty of other examples here. To go back to the lexical decision task, if we make that our primary while using a reaction time task as our secondary, we'll find that we get faster times on the secondary when the words on the primary are more common. Again, with words with low thresholds, fewer processing resources need to be allocated to recognizing those words. Or to give you another example, you can recognize someone you know well from one or two facial features, but you may have to process a number of facial features to recognize someone you don't know well.

        Finally, to give you an update on the capacity model, today people also tend to believe that there are specialized pools of resources that are dedicated to certain types of processing, and not just a single central pool that everything competes for. The reason for believing in specialized pools involves work like that done by Segal and Fusella. They also used dual-task methodology. But, for their primary task, they trained their subjects to maintain either a visual or an auditory image in working memory. The secondary task was reaction time to detecting a very faint light. What they found was that reaction time was slower for people maintaining a visual image. The normal conclusion that we would come away with is that visual imagery must take up more capacity than auditory imagery. So, to verify this, Segal and Fusella adopted a new secondary task: reaction time to a very faint sound. It visual imagery takes up more capacity, then reaction times for the sound ought also to be slower in the visual imagery group. However, the results reversed: Reaction times were slower in the auditory imagery group. Segal and Fusella interpret this as supporting the claim that there are specialized pools dedicated to visual and auditory processing. Two tasks involving visual processing will compete for the same resources, and two tasks involving auditory processing will also compete. With a visual and an auditory task, however, people can use separate pools, so that the two do not compete.

E. Attention and Automaticity

        One of the findings we looked at earlier suggested that animals may process information about several stimulus dimensions rather than just one. When we discussed this finding by Thomas et al., I suggested that perhaps animals are increasingly capable of paying attention to more information as they gain experience with a stimulus, a claim also consistent with Gibson's perceptual learning theory (and her views about stimulus differentiation), and a claim consistent with findings like the overlearning reversal effect. What about practice effects in humans?

        A number of theorists have in fact suggested that practice does enable significant increases in performance. In one of Neisser's experiments, for example, people were dual-tasked, with one of the tasks consisting of taking down dictation. Initially, the effects of task load interfered sufficiently to cause a major deterioration in handwriting. In addition, subjects in this experiment could only recall a few words of the many they had written down. But with massive practice, both handwriting and memory improved.

        Does practice simply result in being able to do things faster or more efficiently? And if speed is the only effect of practice (in terms of coordinating movements and actions more smoothly), why did Neisser also find an effect on recall? What is it specifically that practice can accomplish? In one quite interesting answer to this question, Schneider and Shiffrin have claimed that the right type of practice can result in the formation of a new process: one that does not require attention, so that it frees up attentional resources for doing other things. Such a process is referred to as an automatic process, as opposed to control or effortful processes that do require attentional resources.

Schneider & Shiffrin's Model of Automaticity

        According to Schneider and Shiffrin, automatic processes have become so well-practiced that they no longer operate in working memory. Since working memory calls on attentional capacity for what it does, this is effectively a claim that automatic processes do not require or compete for attention. Control processes, on the other hand, are effortful, and do compete for a limited pool of available resources.

        Schneider and Shiffrin cite several characteristics that distinguish the two types of processes. Among these are the following:

(a) control processes can interfere with one another, but any number of automatic processes can be performed simultaneously without interference, so long as they don't require competing responses;

(b) control processes are typically slow, but automatic processes are typically quite rapid;

(c) control processes are easy to modify, but automatic processes are very hard to change;

(d) control processes are easy to stop, but automatic processes are extraordinarily difficult to stop or inhibit once they've started; and

(e) control processes often involve consciousness; automatic processes can operate outside of consciousness.

        Their first point basically relies on the claim that control processes require attentional resources. Thus, doing several control processes at once corresponds to dividing attention, with predictable results. But since attention isn't important for automatic processes, a number of these may be active at the same time.

        Their second and fifth points are related: Speed of a control process will depend on how much attentional capacity it receives. In addition, control processes typically have several stages or sequences that have to be put together and monitored by the subject, further increasing the overall processing time. But an automatic process is a sequence of stages in long-term memory that have become very tightly connected so that activation of one immediately triggers the next. All that is needed is for the initial part of the automatic process to become activated because the right stimulus configuration is present in working memory. When that happens, the process goes off rapidly without further working memory involvement. Because there need be no further working memory involvement and because automatic processes do not have to be monitored (and because automatic processes can often finish in extraordinarily short times), they do not always intrude on consciousness (which partly involves thinking about what we're doing).

        The third and fourth points, of course, also depend in part on speed of the process, and the ability to monitor it. For Schneider and Shiffrin, automatic processes represent incredibly strong associations in long-term memory that are acquired over thousands of trials, so changing them will not be easy. But aside from that, in order to inhibit or change a process, we need to be able to react to it before it has completed. In addition to the strong associative connections, the speed of an automatic process will likely prevent the completion condition.

        So, how are automatic processes acquired? According to Schneider and Shiffrin, the answer is that automatic processes always first start out as control processes. That is, automatic processe have to be learned, and in order for learning to take place, attention has to be used. But in addition to attention, Schneider and Shiffrin claim that only a certain very special type of practice can cause a control process to eventuate in an automatic process. This type of practice is called consistent mapping. In consistent mapping, we always react to a stimulus in the same way. Thus, another way of putting it which will strike a chord from our discussion of classical and instrumental conditioning is that consistent mapping requires an extraordinarily high contingency.

        Consistent mapping, of course, isn't the only type of practice there is. The other type is called varied mapping. Varied mapping involves responding to a stimulus in different ways. For example, if your task is the LDT, whenever you see the word shirt, you will react to it consistently: Shirt always requires you to press the YES key indicating that it is a word in English, rather than the NO key indicating that it is a non-word. So, shirt is consistently mapped to the this-is-a-word decision. To take another example, if every time the animal receives a tone CS it gets a food UCS, the tone is consistently mapped to food. But if the contingency is zero so that tone doesn't predict food or its absence, then this is the same as saying that tone is mapped both to food and to lack of food. Thus, because it is mapped to two different things, tone in this latter example displays varied mapping. I am, of course, bringing in this example deliberately: I want you to start thinking of the possibility (particularly consistent with Wagner's memory model) that some types of learning in animals may result in automaticity, and that learning conditions in which there is high contingency are precisely those conditions under which we would expect automatic processes to develop.

        So, practice isn't enough; only consistent mapping practice can lead to automaticity.

        How well does this model hold up? In one experiment Schneider and Shiffrin asked their people to maintain one or four items in their working memories. We will call these the memory set. They then presented one or four items as probes. The task in this case was to indicate whether any of the probes was among the memory set. Probes that were among the memory set are called targets; probes that were not are called distractors.

        As a first pass through how people might do this, we can hypothesize that people compare each probe against the memory set looking for a match. But note that as the number of comparisons that must be made increases as the number of probes or the number of items in the memory set increases. With a memory set of 1 item and a probe of 1 item, only one comparison need be made; with a memory set of 4 items and 4 probes, then we may have to make as many as 16 comparisons before being able to say yes or no (each probe is compared to each memory set item). The number of comparisons that has to be made is referred to as the task load.

        If a target is present, then the model we have presented would claim that we ought to find it, on average, half way through our comparisons. But if the target is absent, then we will have to make every last comparison to verify that we have only distractors. This type of process, called a serial self-terminating search, makes several predictions. First, as the number of potential comparisons increases, so should the reaction time: More comparisons take up more time. So, there should be an effect of load. And second, the reaction times should be increasingly longer when we have just distractors. When there are 8 comparisons to be made, you will have to make about 4.5 of these to find a target, but all 8 if only distractors are present. But when there are 16 comparisons to be made, the numbers will be nearly twice as large: 8.5 for a target being present compared to 16 for there being just distractors. The search is termed serial because we check each comparison one by one; so, the more comparisons, the longer. It is called self-terminating because we stop when we find what we're looking for.

        In any case, Schneider and Shiffrin gave their subjects thousands of trials, so that all subjects were extraordinarily well practiced. One group, however, got consistent mapping, and the other got varied mapping. In consistent mapping, items that were targets on one trial were always targets on other trials (i.e., they were always consistently mapped to the this-is-a-member response). Similarly, items that were distractors on one trial were always distractors on all the other trials (they were always mapped to the this-is-not-a-member response). In varied mapping, on the other hand, the letter T might be a target on one trial because, for instance, the probes are T, S, C, and K and the memory set is B, X, T, and F), but on a later trial it might be a distractor because the probe set is H, T, C, and P and the memory set is C, L, G, and F. (And note that C is also inconsitently mapped here: It is a distractor on the first trial, but a target on the second!) So, on this later trial, the reaction to T is doesn't match, although it was matches on the earlier trial.

        Schneider and Shiffrin found that people in the varied mapping condition did indeed engage in serial self-terminating searches. There was an effect of task load, and the fact that the searches were self-terminating is consistent with control processes being stoppable. But the interesting result is that people in the consistent mapping condition showed no effect of task load! Their reaction times were uniformly faster than those of the varied mapping group, and it didn't matter whether they had to do one comparison or sixteen. In further experiments, Schneider and Shiffrin verified that asking people to do two control processes together hurt the processing compared to asking people to do an automatic and a control process together, consistent with their claim that automatic processes don't compete for attention.

        Let's look at two more studies relevant to Schneider and Shiffrin. These were conducted by Nissen and Bullemer. In their first experiment, they ran two groups of subjects. Their task was to press a key every time a light came on. There were four different lights, and four keys. In the consistent mapping task, the same sequence of 10 positions repeated over and over again. In the varied mapping task, the lights appeared in a random sequence. Both groups got plenty of practice at this task, as each had 800 trials. What Nissen and Bullemer found was that reaction times speeded up a little bit for the varied mapping group (from about 380 msec to 340 msec). For the consistent mapping group, however, reaction times speeded up considerably (to under 180 msec). So, to repeat, practice by itself isn't enough; consistent practice is what really allows a speed-up.

        What about Schneider and Shiffrin's claim that learning requires attention? To test this claim, a second experiment by Nissen and Bullemer used the same procedure as the first, except that subjects were dual-tasked. For their second task, people heard low or high tones presented during the position task, and were asked every once in a while to state how many of each type of tone they had heard since last being asked. Although this might not strike you as a difficult task, it in combination with the original task was apparently difficult enought to prevent the development of automaticity. Nissen and Bullemer found that people in the consistent mapping condition performed much like those in the varied mapping condition. Thus, dividing attention can affect learning, including the learning of an automatic process.

        Finally, Schneider tells a funny (or tragic, depending on your point of view) story about one of his subjects. She had spent huge numbers of hours automating the ability to search for the letter K in page after page of text. When she finally automated, her report of the experience was that the Ks popped out at her; she didn't have to look for them. And indeed, she was equally fast with Ks at the end of a page as with Ks at the beginning, in contrast to her initial performance (where, like the rest of us, she first found the Ks at the top of a page). Later that day, she went home and tried to study for a midterm. But, for hours afterwards, whenever she opened the book, she was unable to read, because all the Ks kept popping out at her. This illustrates the difficulty of modifying or stopping an automatic process. If you want to give yourself a feel for what it's like, play Tetris for an hour and a half or two, and then open a book. If you are like me, you will now find your attention focusing on the spaces between words, and processing those spaces as the forms for the blocks that come tumbling down in Tetris.

        In short, practice improves performance a bit, but by itself it doesn't cause automaticity. Also, attention is required for automaticity, so attention seems to be needed for learning. Finally, to tie this back to the animal literature we've studied, I suspect that much classical and operant conditioning involves automating responses to certain stimuli.

        As you can see, attention and learning are intimately related for these theorists: What you pay attention to may make it into long-term memory, but information you haven't attended will very likely not make it, for whatever reason. There is, however, one report (widely cited) that seems to contradict this conclusion. It involves a study by McKay.

        McKay gave his subjects a focused attention task involving shadowing. They heard sentences in the attended ear that were sometimes ambiguous (for example, "They threw stones at the bank."). At the same time that an ambiguous word ("bank") was presented, an appropriate or inappropriate disambiguating word ("money" or "water") also was spoken in the unattended ear. Later, McKay presented his subjects with pairs of sentences, and asked them to indicate which of the pair they had shadowed. A sample pair of sentences for the example I gave you might be as follows:

        (a) They threw stones at the credit union.
        (b) They skipped rocks at the river.

        Basically, McKay found some evidence that the word presented in the unattended ear influenced your interpretation of the sentence you shadowed. Note that while this seems similar to the Johnston and Wilson study, it is actually quite different: People here can choose an early filter, so they should act like Johnston and Wilson's focused attention group (the one that showed no priming), rather than the divided attention group. So, apparently, it is not always the case that learning requires attention.

        Or is it? Unfortunately (something most texts don't both pointing out), there was a serious flaw in this procedure, as pointed out by Newstead and Dennis. Specifically, McKay gave his subjects only isolated words. So, there would be silence in the unattended ear, and then every once in a while, a word would slip in. But, one of the things we know about attention is that sudden changes cause a shift of attention. Newstead and Dennis accordingly claimed that it is quite likely that people momentarily shifted attention when the word on the unattended channel came in. If that happened, it really wasn't unattended.

        To test their idea, they replicated McKay's experiment, but added a new condition of their own: Some subjects were in an isolated-word condition, just as in McKay (where we would predict attentional shifts), and others were in a condition were a continual stream of words was presented to the unattended ear (so that there would be no attentional shifts). In both cases, a disambiguating word was presented in the other ear when the ambiguous word appeared in the attended ear. They found the same results as McKay when isolated words were presented in the supposedly unattended ear. But when there was a continual stream of words, then a very different result happened. Namely, there was no evidence that people were biased one way or the other by the disambiguating word. Thus, despite some initial evidence question the conclusion that attention is needed for learning and memory, a cleaner methodology turns out to support that conclusion.

        Thus, the Schneider and Shiffrin approach still seems viable.

Hasher & Zacks

        An approach that seems very different at first blush but is actually compatible with Schneider and Shiffrin is the work on automaticity done by Hasher and Zacks. Rather than concentrate on learned automatic processes the way Schneider and Shiffrin do, Hasher and Zacks focus on automatic processes that appear to be unlearned, and that seem to operate in a different fashion. In particular, Schneider and Shiffrin claim that automatic processes by themselves cannot change the contents of long-term memory because they don't use attentional resources: All learning for Schneider and Shiffrin has to go through control processes and working memory. But that is not true for Hasher and Zacks's automatic processes, as you will see in a moment.

        I will not spend a great deal of time on their model. But I do want to bring up several points they make. One is similar to what you've already heard from Schneider and Shiffrin: Automatic processes do not rely on attentional resources, so several automatic processes can be performed simultaneously. Based on that, a second criterion that is relevant is that automatic processes are very little affected by intentional instructions (i.e, being told to learn, or trying to alter the strategy by which you learn). Incidental and intentional learning ought to be equal for an automatic process, whereas intentional learning should be better for a control process (incidental learning involves creating new memories in the absence of a deliberate intention to do so). And finally, automatic processes will show no developmental trend: People of all ages will display the same finding.

        One of the candidates they have identified as an automatic process involves sensitivity to situational frequency. A number of studies show that people who get a surprise test for how often they heard something do about as well as people who know that there will be a test (the incidental versus intentional learning condition), Whether situational frequency is really completely uninfluenced by intentional learning instructions is a matter of some debate. But to give you a feel for these experiments, let me describe one of my studies.

        In my experiment, I tried to manipulate whether something would be in the focus of attention or not, as a way of getting at the intentional-incidental distinction. So, I told people coming in to an experiment that they would be participating in several different short studies during the course of the hour. One of these would be a memory study in which they would have to remember a list of words (the intentional manipulation). I then read a long list of words to them at the rate of about 2 words every 5 seconds, after which they tried to write down as many of the words as they could remember. Some of these words were presented up to 4 times. I then told them a little bit about the purpose of the experiment before going on to the next study. Specifically, I told them that the reason words were being repeated was to look at whether giving people additional chances to rehearse a word would result in better memory. Now this wasn't the real reason for the experiment. Its purpose was to let me present some additional new words as examples during this mini-lecture ("So, when you hear link, for example, you start repeating link to yourself over and over."). Presenting these new words and repeating several of them was the incidental condition, in which people would not be expected to be strongly focused on the words.

        We then went on to the next study, which involved a ratings task whose sole purpose was to distract people for 15 minutes. Finally, I handed out a surprise frequency estimation sheet in which I asked how many times people had heard a given word in the experiment. Some people were asked just to tell me about the frequencies of words in the list they had been asked to study, and others were asked to tell me just about the frequencies of the words they had heard in the explanation of what that experiment had been about. Basically, the results showed that people were sensitive to both list and explanation frequency (intentional and incidental learning), and that they couldn't tell them apart. But as this example also shows, the incidentally presented words got in to long-term memory without the need for much attentional resources.

Logan's Race Model

        A final theory of automaticity we will consider very briefly is the race model of Gordon Logan. It forms a major contrast to the two models above. On Logan's account, people engage in two different processes that race with one another when they are given some task. One of these processes involves what Logan terms computation: figuring out what to do, and then doing it. The other involves what Logan calls retrieval: remembering what you did in the past for this task, and then doing it. Logan's point is that consistent mapping will build up more and more memories relevant to the task, so that the retrieval process becomes increasingly likely to beat the computation process. Thus, for Logan, automaticity is some point along a continuum of various mixtures of computation and retrieval. At one end of this continuum, people have to complete the process primarily by computation. At the other, they get to complete the process almost always by retrieval. And in the spots in between, as more and more memories build up, the probability that retrieval wins the race increases.

        Note how this accounts for Schneider and Shiffrin's findings. With massive memories, there should be no load effect, because retrieval is basically a one-step process. And additionally, this can only work with consistent mapping, because varied mapping will result in inconsistent memories that give you competing answers in retrieval, thus allowing computation to win the race. Computation, of course, will be subject to load effects.

        But there is also a significant difference with Schneider and Shiffrin's account that deserves mention. Logan claims that automaticity should start developing after only a few trials, instead of the thousands and thousands of trials that Schneider and Shiffrin require. Like Schneider and Shiffrin, however, Logan claims that attention is important: In order to form a memory of something, you have to attend it.

        It is very tempting to speculate that the speedup of latencies over learning trials seen in high-contingency classical and instrumental conditioning reflects this mixture of computation and retrieval. Regardless of which theory of automaticity proves correct in the long run, however, it seems undeniable that different results are obtained with consistent mapping. And it would be highly surprising were that finding not of relevance to the work on contingency in the animal literature. Indeed, as you saw earlier in Wagner's explanation of the CS pre-exposure effect, the consistent mapping of a CS with background contextual cues seems to result in the CS no longer receiving much working memory attention: It becomes increasingly likely to be processed automatically, and thus increasingly likely not to be given the attentional resources necessary for joint rehearsals with the UCS (required, on Wagner's account, for forging a CS-UCS link).

        And if automaticity theories are correct, then automating should free up more capacity, so that other things get paid attention that they didn't earlier. In this respect, an experiment by D'Amato on monkeys is suggestive (and fits a point made earlier). He compared animals given various levels of overtraining on a discrimination involving complex, multidimensional stimuli. As training progressed, these animals increasingly learned about the second dimension. That is quite consistent with a model claiming that capacity is being freed up to be deployed on other things.

        In short, learning occurs when sufficient attentional resources are available for control processing. At some point, if a process becomes a learned automatic process, then it no longer directly contributes to new learning. Some such mechanism may account for why learning appears to reach an asymptote in classical and instrumental conditioning.

III. Categorization in Humans

        Certain features of some of the generalization and discrimination models appear to have a similarity with what we do as humans, and how we learn, and learn to sort things into categories. In particular, many of our categories exhibit what is called graded structure, a type of generalization in which objects are judged typical of a category to the extent that they resemble the average category member. Also, some experiments have been conducted specifically on a hypothesis-testing theory of human learning. These are some of the issues that will absorb us in this section.

A. Single-Feature Categories & Hypothesis-Testing

        The simplest human categories are a bit like some of the generalization and discriminative problems we have asked our animals to take on in that they consist generally of a single feature whose discovery we track. For example, the feature may be red, in which case the category is red things, or it may be animate, in which case the category is living things. Two examples of a single-feature category are vertebrates and invertebrates. The relevant feature here is a spine or backbone. Vertebrates have one; invertebrates don't. When the category is defined in terms of presence of a feature, we say that it involves an affirmative rule. On the other hand, when it is defined in terms of the absence of a specified feature, then we say that it involves a rule of denial.

        There are a number of people who study various aspects of how quickly these types of categories are discovered. They look at how many dimensions are present, for example, or how salient the relevant feature is, etc. We will concentrate on one individual here, however, as he has developed a hypothesis-testing theory (also called H Theory) that is reminiscent of the type of hypothesis testing some people claim animals engage in. His name is Levine.

        He sets up discrimination problems that are like the simultaneous discriminations animals sometimes get. So, to use an example mentioned a bit earlier, let us assume that we get the following two stimuli:

            X     T

        As with the animal discrimination experiments, we ask you to pick one. Only one of these is associated with reinforcement. But instead of giving you pigeon food, we use as our reinforcer the word "yes" and as our lack of reinforcement the word "no" (as in, "No, that's the wrong one.") The task is to discover which of the many features here yields the reinforcer. What is the category? Orange things? (Grey, if you're looking at this in black and white.) Small things? Etc.

        Generally, Levine starts off with a very simple model. It asserts that people choose a hypothesis, and then test it out. The way in which they evaluate the hypothesis is through a win-stay lose-shift rule. (Sound familiar?) That is, if the hypothesis leads you to select an object that leads to a "yes," then you will keep the hypothesis as still viable. But when it leads you to select an incorrect answer, you will need to find a new hypothesis.

        Now we need also talk about how much memory for previously disconfirmed hypotheses we want to allow. This involves what is called the sampling rule for choosing a new hypothesis. On the simplest model, we will assume that people effectively aren't bothered to remember previously disconfirmed hypotheses. This is exceedingly unlikely, but the point is that we can set up the model, and then collect actual data that will either confirm or disconfirm it. So, with the simplest model, when you need to pick a new hypothesis because you have been told "no," then you will randomly choose a hypothesis from the original set. But since this is the original set, the hypothesis you have just given up is part of it, also. And that, in turn, means that there is a measurable (though normally small) probability that you will choose the same hypothesis all over again. This type of model is called sampling with replacement, since you replace the disconfirmed hypothesis into the pool of potential hypotheses before sampling another one.

        We don't have to look at just this model, of course. We can also build in some memory by talking about different sampling rules that will all include sampling without replacement. Here, you are at least smart enough not to choose the same hypothesis immediately after you have disconfirmed it. So, you sample a new hypothesis, when you have to, without first replacing the current hypothesis you have just given up on. If the original pool consists of 8 hypotheses, for example, then sampling with replacement means you are always picking from a pool of eight hypotheses, whereas sampling without replacement means the pool is getting smaller.

        In the example above, the pool actually does start out as eight hypotheses: Large things; small things; things on the left; things on the right; Ts; Xs; black things; and orange things. If people sample with replacement, then there is a 1 in 8 chance (12.5%) of picking the same hypothesis all over again. But if they sample without replacement, the probability should be 0. By observing those probabilities, we can see what people really do.

        Now, I had said that there were several possible models for sampling without replacement. You can generate a bunch of possibilities for yourself. For instance, maybe we can only remember the last three wrong hypotheses, so we keep those out of the pool. Such a model would predict that we can systematically narrow the pool from eight down to five, but that it will stay at five thereafter. So, after a while, if you have not yet solved the problem, a wrong hypothesis will be back in the pool. Let's ignore the more complicated of these models and concentrate on just some simple variants. One is that you simply avoid the current wrong hypothesis. But on this model, a wrong hypothesis two hypotheses ago would be back in the pool. So, this model would claim the pool goes from eight to seven, and stays there. The observation that corresponds to this model is that there ought to be a one in seven chance (14.3%) of choosing a hypothesis that was proven wrong earlier, several hypotheses ago. Alternatively, maybe you have perfect memory for everything that was proven wrong. On this model, the pool goes from eight down to seven, then down to six, five, four, etc.

        So, how do we test these ideas out? Levine came up with a clever method called the blank trials procedure. Essentially, these are trials on which we provide no feedback (hence the word blank). Our assumption is that during blank trials people just stick with whatever hypothesis they had. The point is that if we are clever enough, we can figure out from people's performance what hypothesis that is.

        The blank trials procedure regarding the example I gave you earlier will require four trials to figure out someone's hypothesis. One group of four (but not the only group!) that will work is as follows:

            1st Trial         T     X
            2nd Trial       T     X
            3rd Trial       X     T
            4th Trial         X     T

    Basically, there are 16 possible patterns of responding on these four trials. But only eight of these correspond to hypotheses. Namely, any pattern in which there is an even number of lefts or rights for this particular set will correspond to a hypothesis. Thus, the pattern left, left, right, right (LLRR) is the hypothesis that you need to respond to the T. (And of course, RRLL is the hypothesis that it's Xs.) The pattern LRLR clearly corresponds to the hypothesis that it's orange things (and the reverse pattern, RLRL, the hypothesis that it's black things). The pattern LRRL corresponds to small things (and RLLR is large things), and finally, LLLL and RRRR correspond to the hypotheses things on the left and things on the right. But note that a pattern like LRRR doesn't pull up any obvious hypothesis, nor will the other seven hypotheses in which there is an odd number of left or right choices. You can verify this for yourself: Below are the 16 possible response patterns on the blank trials procedure (BTP). Only the first 8 of these should make any sense:

                            Possible Response Patterns On BTP
            Trial       1    2    3    4   5    6    7   8    9   10  11  12  13  14  15 16

            1             L   R   L   R   L   R   L   R   L   R   R   L   R   L   R   L
            2             L   R   L   R   R   L   R   L   R   L   L   R   R   L   R   L
            3             L   R   R   L   R   L   L   R   R   L   R   L   L   R   R   L
            4             L   R   R   L   L   R   R   L   R   L   R   L   R   L   L   R

        So, how do we go about testing some of these ideas about the sampling rule and the use of hypotheses? The first test is easy. We use the blank trials procedure to see whether people use a hypothesis. Because there are 16 possible response patterns only eight of which are hypotheses, if people respond at random, then they ought to have a 50% probability of showing a hypothesis-consistent pattern (8 out of 16). But when we use blank trials, we actually find that 98% of the response patterns fit hypotheses. So, people apparently do use hypotheses rather than responding randomly.

        Next, we can ask about the claim that they test hypotheses. That is, do people use the win-stay lose-shift strategy? Testing this is also simple. We use blank trials to see what Hypothesis 1 is, then we deliberately give one group negative feedback and another group positive feedback, and then we use another set of 4 blank trials to see what Hypothesis 2 is. Schematically:

            Group                         Procedure

            + Feedback                BTP: H1
                                                Feedback Trial ("Yes")
                                                BTP: H2

            - Feedback                 BTP: H1
                                                Feedback Trial ("No")
                                                BTP: H2

        In these two groups, the question we ask is whether H1 and H2 are identical. If people engage in win-stay, then they should be. But if they engage in lose-shift, then there ought to be a high probability that H1 and H2 differ. In fact, there is about a 95% probability that people stick with the same hypothesis following positive feedback, consistent with win-stay. And following negative feedback, there is a 98% probability that the second hypothesis (H2) differs from the first (H1), supporting lose-shift.

        If you were thinking carefully about that last result, you should also have realized that it answers another one of our questions, one regarding the sampling rule. Put broadly, do people sample with or without replacement? If they sample with replacement, there should be a 12.5% probability of choosing the same hypothesis all over again following negative feedback (see above). But as we have just seen, the probability is actually considerably smaller (2%), so that we can now state that people generally avoid a disconfirmed hypothesis, at least on the immediately following choice.

        Finally, in terms of asking how they sample without replacement (i.e., how much memory to build in to our model), the research has shown a developmental sequence in humans. Fairly young children use a strategy of hypothesis checking: They choose a hypothesis, and when it is proven wrong, move on to another hypothesis at random. In terms of attention and memory, that places a burden on the child, as there will be a need to try to remember all the wrong hypotheses. A child who has been unfortunate in choosing hypotheses, for example, may have to remember 7 wrong hypotheses in order to finally get the correct feature. (And of course, we can provide whatever feedback we wish to ensure that subjects don't solve the problem until the bitter end, since we are in charge of the feedback they get.) That is attention to and memory of a lot of things for a young kid!

        Older kids have developed a more efficient way of playing this game. They use a strategy of dimension checking in which they check out both hypotheses associated with a particular dimension. So, for example, a child may start off with the dimension of color, and systematically check out black, and then orange. The reason this strategy is more efficient is that there are only four dimensions (in this problem: In general, though, there will be fewer dimensions than hypotheses in any problem), so that you need keep track just of the number of dimensions you've already tried, and not the randomly organized hypotheses.

        Finally, yet older kids and adults try, if the circumstances permit (they won't, for example, when we dual-task them!) To use a strategy of global focusing. On this strategy, any feedback trial may be used to narrow down the remaining hypotheses by half! Thus, to go back to the first example we used, suppose that on the first trial our subject chooses the item on the left below:

            X     T

        If the feedback is positive ("yes"), then the person using global focusing knows that the only four possibilities are black, X, left, and large, since these are the only features of that stimulus. But if the feedback is negative ("no"), then our global focuser knows that the correct answer cannot be any of these four possibilities, and that it must therefore be either orange, T, right, or small. With an initial pool of eight hypotheses, the first feedback trial ought to narrow the pool to four, the second feedback trial ought to narrow it to two, and the third feedback trial ought to narrow it down to the remaining answer. So, at most, three feedback trials will be needed to solve the problem. But of course, the memory burden here includes remembering what is still possible, and what isn't. People aren't perfect global focusers, but the results suggest that they do try to evaluate several different dimensions simultaneously. And as we would expect from a claim that the pool is narrowing down, people get faster and faster at choosing a hypothesis. And that is not so far different from some of the evidence like Thomas et al. suggesting that animals, even though they engage in something like hypothesis testing, can learn about several different dimensions. In this sense, the notion that only one thing is attended to and that learning is all-or-none with respect to this one thing may be an oversimplification.

        Levine's work hasn't gone completely unchallenged. There are some studies that suggest that people have trouble verbalizing which hypotheses are still viable, and that asking them to explicitly discuss hypotheses can actually slow learning. On the other hand, it's not clear that hypothesis testing need be something done at an explicitly conscious level, although most theories of hypothesis testing are couched in terms that suggest conscious decision-making. Regardless of the outcome of this debate, we can state that the evidence seems to strongly favor the claim that people and animals act as if they are evaluating hypotheses.

        One more point that is reminiscent of the animal literature: People do better after positive feedback than after negative feedback, even though global focusing with either type of feedback allows the same narrowing of the pool, and should thus be equally efficient. But as you will recall, animals also seemed to learn more about the positive cases.

B. Well-Defined Categories

        At least implicitly, hypothesis theory was concerned with the acquisition of what is called a well-defined category. Such categories may be characterized by a simple rule that specifies the relevant features. By virtue of there being s rule of this sort, category membership, in turn, is very precise: Either something is, or it is not, a member of the category. Anything that meets the rule is a member, and anything that fails to meet the rule is not. Within the category, all members ought to have the same status, since they all meet the rule. But outside the category, we can clearly specify that some things are more like the category than others, depending on how much of the rule they match. In short, categories have sharp, crisp boundaries, and all members are equally good instances of the category.

        One of the first definitions of a well-defined category was provided by Aristotle, whom we've met before in his role as the founder of Empiricism. According to this definition, categories must have defining features that are individually necessary and jointly sufficient. That is, the rule consists of a list of features, each of which must be present. When all of those features are present, the object in question fits the category; it doesn't matter what other features may or may not be there. Such categories are sometimes called classical categories. They are also called conjunctive categories because you need the conjunction of all of the features. Finally, the rule that lists these features is called a conjunctive rule, although informally, it is sometimes known as an AND-rule.

        A common example often given for a conjunctive category is bachelor. As a first pass through specifying the rule for what is a bachelor, we may specify the features unmarried and male. By carrying a rule around with us that specifies the necessary features, we cut down on the need for storing separate instances. You do not have to recall every car you've ever driven in order to drive a friend's car: Cars generally all have the same defining features, so that we can drive one on the basis of having learned how to drive others. Moreover, we can recognize new objects we've never seen before on the basis of identifying features that tell us they're cars. Thus, categories allow us to generalize our responses and learning to novel instances.

        But conjunctive categories aren't the only types of well-defined categories. If we restrict ourselves to just rules based on two features, and if we do not worry about whether to specify presence versus absence of a feature, then there are actually four different rules that logicians and philosophers have identified as providing types of well-defined categories. In addition to conjunctive or classical categories, these are disjunctive categories; conditional categories; and biconditional categories.

        A disjunctive category is a category based on a disjunctive rule (also known as an OR-rule). In such categories, there is again a list of features. But only one of these is needed. Thus, we have relaxed the Aristotelian criterion of features being jointly sufficient. That is why it is called an OR-rule. For example (although it involves more complex categories than the two-feature categories we are limiting the discussion to), a strike in baseball can be an event in which the batter swings at the ball and misses, or an event in which the batter fails to swing at a ball that passes within a certain zone over home plate.

        In a conditional category based on a conditional rule (an IF-rule), on the other hand, there is a certain feature that is used as a test or a gate, if you will, for membership. If something has that feature, then there will be another feature it must also have in order to be a member. In this case, both features will be present. So far, that sounds like a conjunctive rule. But where it differs from a conjunctive rule is that if the test feature is not present, then we assume that the object is a member of the category. That may sound a bit bizarre, so consider the test as being a way of deciding what gets excluded from the category: Everything is included except for those objects that have the test feature, but not the second feature. A real-life example that will illustrate this is the category customer, as defined by a certain restaurant: Men must wear jackets. The dimensions that we are dealing with here are gender and apparel, and the values for those respective dimensions are male and female (for gender), and has-jacket and no-jacket (for apparel). So, if you're male, then you have to have a jacket to get in. That means males with jackets get in, and males without do not. But what about females? Since the rule didn't explicitly specify a test for females, they automatically get in, and it doesn't matter whether they wear jackets or not.

        The biconditional category is based on the biconditional rule (an IF-AND-ONLY-IF-rule, typically abbreviated IFF-rule). This is the equivalent of a having two conditional rules. If our features are A and B, then the rule for categorization is A IFF B. So, an example of this rule, applying again to the category customer, might be male IFF tie. Informally, males must wear ties, and only males can wear ties. In this example, both features act as gates for exclusion: If it has feature A, then it must have feature B; if it has feature B, then it must have feature A. So, customers of this fine establishment would include males with ties and females without ties. Unlike the conditional rule, females with ties would be excluded from the category, using this rule.

        These four rules divide the potential members and non-members of a category in very different ways. To see that, let us take a universe of four objects that is defined by two dimensions: color and size. For color, we will choose black and orange. For size, we will choose large and small. Thus (and we will use Xs to represent these four mystery objects; pretend they're which case, X marks the spot...), the four objects in our universe are as follows:

                X     X          X     X

        Let us now set up four rules corresponding to the four categories to see how they divide this limited universe up into members of the category, and non-members. We'll take the features large and orange as the features mentioned in the rules. In this case, the rules will yield the following:

        Rule                                                 Members                             Non-Members

        conjunction (large and orange)       X                           X     X     X

        disjunction (large or orange)           X     X     X                        X

        conditional (if large then orane)      X     X                              X

        biconditional (large IFF orange)     X                                         X

        As you can see, the category membership is quite different in these different situations.

        Bourne studied these four rules, and how easily they were learned. Interestingly enough, he found that for naive people who were unfamiliar with formal logic, the rules were acquired in the order listed above: conjunctive rules were learned most rapidly, and the biconditional rules took the longest. But with just a bit or practice, people became equally adept at learning problems with each of these rules.

        Bruner, Goodnow, and Austin were also famous for their work on well-defined categories. They gave people experiments in which artificial concepts were presented that involved dimensions such as shape, border, number, etc. They asked a question similar to that Levine asked: How did people acquire these categories? And like Levine, they found that people used hypotheses and strategies to test out ideas about what the categories might be. This work was one of the important lines of research in the late 50s that questioned the then-prevalent behaviorist approach. Up until that time, categories were regarded as nothing more than generalization of responses on the basis of stimulus similarity. So, if a combination of stimulus elements were associated with a response such as This is a cat, then other things would be judged as cats to the extent that they were physically similar. Bruner, Goodnow, and Austin showed that it wasn't a matter of stimulus similarity, but of deliberate testing of hypotheses.

        While we are on the subject, do note that one type of category listed above, the biconditional category, is particularly difficult for an associational approach to explain. This category is sometimes also called the exclusive-or (or XOR) category. It is the equivalent of stating that we need one or the other feature to be a member, but not both. Thus, in the example above, we could also describe the biconditional category as orange XOR small (i.e., orange or small, but not both). Why is this a problem? The two members of the category that are given the same response (this is a member) have no physical feature in common with one another! And the two non-members are at least more similar to one of the members than the two members are to one another. So, to put it mildly, if XOR categories are real human categories (and there is much evidence to suggest that they are), then a theory that specifies categorization specifically on the basis of physical generalization misses something important.

        In any case, a number of findings and theories started converging on a very different idea of categorization. These suggested that categories weren't well-defined at all. We turn to these next.

C. Fuzzy Categories

        Bruner, Goodnow, and Austin had done some preliminary work on probabilistic categories, categories in which a feature was sometimes, but not always, present. (This should remind you a bit of the issue of contingency: How strong must a contingency be in order for classical or instrumental learning to take place?) They found (as have many other researchers since) that people were able to learn probabilistic categories. This work in conjunction with a general shift in researchers' interests resulted in a very different approach than what we have discussed above. The shift involved a number of people becoming interested in natural language categories, as opposed to the artificial categories that were normally studied by people like Levine or Bruner et al. Natural language categories are the categories that are expressed in our languages: They are the categories that we deal with in our every-day lives.

        The philosopher Wittgenstein had criticized the classical definition of categories. In his highly influential critique, he pointed out that many natural language categories simply did not seem to have a shared set of defining features. One of the famous examples he used was the category game. He defied anyone to locate a feature common to game, but not to non-games. Can it be competition between at least two people? Hardly, as there are single-player games such as solitaire. Can it be competition per se, even if, as in solitaire, it may be competition against oneself, or against the draw of the cards? Hardly, as kids play such games as make-believe or dressing-up in their parents' clothes, without this element of competition. Can it then be simply entertainment? Some games are deadly serious (chess matches, for example, or war games), and in any case, there are certainly things that qualify as entertainment, but that we would not call games (such as going to the movies).

        Moreover, Wittgenstein pointed out that we could analyze a family in a similar manner. If you imagine a large family reunion and look at all the people gathered there, you will be able to find people who have no feature in common; they completely fail to resemble one another in precisely the manner that two games such as dressing-up and football may fail to resemble one another. But they are still part of the same family. Wittgenstein adopted this metaphor of a family for human categories, as well. Essentially, he claimed that what made very different things games was the fact that they were essentially all members of the same family.

        What does it mean to be a member of the same family? If we go back to our family reunion and examine people carefully, we should see that people do resemble one another to varying degree. Some people may have most of the family features (the eyes; the nose; the hair color; the chin; etc.), while others will have some of the family features. Thus, while our hypothetical two people discussed above failed to resemble one another, they were nevertheless part of the family because they did have some of the family features. In other words, the reason they failed to resemble one another was that they had different family features.

        So, Wittgenstein claimed that a natural language category was a family resemblance structure. Things were in the category because they resembled the family more than they resembled other families. And it is this notion of family resemblance that will prove important in much of the later research in psychology. The essential idea is that we can determine what the typical family member is. That is, what are the common family features? Once we have identified those, then we need compare other individuals to this, to see whether they have features that overlap. As there are more and more overlapping features, we have a better and better representative of the family, someone who embodies more of the features. Thus, contrary to the well-defined category approach, this view of categories claims that category membership may be expressed as a matter of degree. There will be typical members of a category or family (members that have most of the features), and there will be less typical or even atypical members of the category or family (members that have only a few features). Categories ought therefore to exhibit graded structure: degrees of membership or 'goodness' as examples of the category. Graded structure is often referred to as the degree of typicality. On this account, dressing-up is not a typical game.

        But there is another implication here, as well. Suppose there are two family reunions going on in the park, and you see an individual wandering around. Which reunion does that individual belong to? Presumably, we will compare that individual to each family to assess her resemblance or typicality. But suppose we find a moderately low level of typicality or resemblance to one family. Is the individual a member, or just someone who happens to be in the park on other business? As this example shows, a rule that specifies categorization on the basis of degree of similarity will have some unclear cases. There will be situations in which we are not sure whether something is a member of the category or not. In this sense, natural language categories have what are called fuzzy boundaries, as opposed to the crisp boundaries of well-defined categories. There are grey areas in which there is a lot of uncertainity regarding category membership.

        What evidence is there of such fuzzy boundaries in our real-world categories? A linguist, Labov, gave people drawings of a variety of different objects and asked them to indicate what each was. In one of his conditions, he started out with a cup, and systematically increased the width of this object in different drawings. The question he was interested in, of course, was when people would stop calling it a cup, and start calling it a bowl. What he found was that there was no sharp boundary at which people stopped calling it one thing and started calling it another. Instead, as the width increased, the proportion of times people called it a cup declined much like what we might expect to find in a generalization gradient. At extreme widths, he started seeing an increase in use of bowl, but again, it was a gradual increase, rather than a sudden, dramatic shift in categorization.

        Labov's experiment illustrates the notion of degree of membership, and the corresponding fuzzy boundary. But he also illustrated another aspect. In a second condition, he used the same drawings, but asked people to imagine the objects filled with, say, mashed potatoes. In this condition, the gradients were different. There was a faster decline in the use of cup as width increased, and a faster rise in the use of bowl. Thus, context can influence our categories, suggesting a degree of flexibility, in contrast to a well-defined approach that requires always applying the same test (does it fit the rule?)

        Roth and Shoben also showed such flexibility in one of their studies. They asked people to rate various beverages for how typical they were. Roth and Shoben were particularly interested in the beverages coffee, tea, and milk. When people are asked to rate typicality of beverage in a context such as something secretaries have with their mid-morning break, coffee comes out on top, followed by tea, and then milk. But when people have to rate typicality in the context of a truck driver having a beverage and a doughnut for breakfast, milk and tea switch positions.

        And finally (although we can mention many more such studies), consider a study by McCloskey and Glucksberg. They asked people to categorize various things. Amongst the decisions people had to make was whether a stroke was a disease. About half of the people in their experiment claimed it was, and on a later follow-up condition, many people changed their minds about what they had said. As in Labov's experiment, this study illustrates fuzzy boundaries that will lead people to classify objects sometimes one way, and sometimes another.

Prototype Theory

        One of the major changes in research in categorization occurred in the '70s when Eleanor Rosch build a psychological model of prototypes based on the work of Wittgenstein. Rosch argued that we could view categorization as the formation of a prototype that represented the central tendency or the average representative of a category. Deciding whether something was a member of the category or not would then be a matter of assessing its similarity to the prototype. If the similarity proved sufficiently high, then it would be classified as a member.

        A first question naturally arises: How do we determine what the prototype is. Rosch took over the notion of family resemblance from Wittgenstein, and claimed that the prototype ought be the item with the highest family resemblance to all the other known members of the category. Computing family resemblance scores involves comparing each object to every other object in the category. (Or psychologically, perhaps checking to see how many features are primed in common between one object and a variety of other objects.) As an example, consider the following five items that are members of the same category:

        ABC    ACE     CDE     ACD     CDF

        The letters in these items symbolically represent features (A, for example, might represent the feature large). What we can do is compare each object to the other four, and count the number of feature matches we have in common to come up with a family resemblance score.. Doing this will yield the following results:

        Item:                     Family Resemblance Score

        ABC                     2 (A) + 0 (B) + 4 (C) = 6
        ACE                     2 (A) + 4 (C) + 1 (E) = 7
        CDE                     4 (C) + 2 (D) + 1 (E) = 7
        ACD                     2 (A) + 4 (C) + 2 (D) = 8
        CDF                     4 (C) + 2 (D) + 0 (F) = 6

        Thus, since ACD has a higher family resemblance score than any other item, it is the one item that has the most similarity to everything else, and therefore, the one item (among the five actual objects) which is the best candidate for the prototype. As for the other items, each seems to resemble the prototype to the same degree, because each has an overlap of two features with it. But if we want to get a bit more technical, then we can also look at importance of the features. The most common feature is C (all five items have C), followed by A and D (which three of the items have), and then by E (common to 2 items), and finally, B and F (present in only one). So, based on importance or centrality of the features, we would say that the objects ACE and CDE are more similar to the prototype (in terms of having frequent features) than the objects ABC or CDF. In fact, our family resemblance scores told us this: The higher the family resemblance score, the more similar something ought to be to the prototype.

        Another way of doing this is to calculate the real average, based on these five objects. In terms of what the central tendency of a category will be, it ought to have the following six features at the following probabilities (I'm averaging over how often a feature appears):

        Feature             Average

        A                      3/5 = .6
        B                      1/5 = .2
        C:                    5/5 = 1.0
        D:                    3/5 = .6
        E:                     2/5 = .4
        F:                     1/5 = .2

        So, if we define our theoretical prototype as an object that is likely to have certain features (because their probabilities are over .5), then we again find that this will also yield an object that has the features ACD, which happens to be the object we located by using family resemblance scores. However, I have gone through both ways of calculating a prototype deliberately, because you ought now to have a sense that the two methods need not result in the same thing. The use of family resemblance scores will always converge on one (sometimes more) actual object in the category. But the use of mathematical probabilities may converge on a non-existent object. With the right set of objects, for example, we might have found that the central tendency would have involved four features, although all of our objects have just three. Normally, with rich categories having a lot of members, we assume that both methods will come up with sufficiently similar results: The object selected through family resemblance calculations ought to be quite close to the theoretical average calculated in the second method. So, we will take the sometimes potentially wrong approach of claiming that a prototype is one of the objects of the category (though you will see a study below in which this won't be the case).

        In any case, Rosch and her associates conducted a study in which they gave people a bunch of items from different categories such as toys, weapons, furniture, vehicles, etc. They asked this group to list features that came to mind when they thought of each object. Based on people's feature lists, they calculated family resemblance scores. These family resemblance scores constitute their predictions about how people should judge the similarity of any two objects, and more to the point, how people should judge the typicality of each object for its category (i.e., its degree of resemblance to the prototype). To test their theory out, then, they took yet another group of subjects and asked them to rate each of these objects for typicality (goodness of exemplar). How typical is a bicycle, for example, for the category vehicle? What about a car? A plane? Rollerblades? The orderings they obtained from the second group who did not list features correlated highly with the orderings obtained from the family resemblance scores of the first group who did.

        Do people really use prototypes, and are they capable of calculating central tendency? A famous study by Posner and Keele suggests affirmative answers to both questions. Posner and Keele created several decks of cards. Each deck was constructed from a base card that was essentially a drawing made up of large dots placed in different spots. Some of the base cards might be meaningful drawings (making up the shape of the letter F, for example, or a diamond shape), and others would represent random placements of the dots on the card. What Posner and Keele did was to systematically create other cards from these base cards by distorting the base card drawings. There were a number of levels of distortion, and these involved how many dots were randomly selected for moving to new positions, how far they were moved, and in what directions.

        So here is the experiment: Let us call the non-base cards distortion cards. We take half of the distortion cards at each level of distortion (from mild to severe), shuffle them together with similar distortion cards based on three or so other base cards, and give the resulting deck to our subjects. We ask them to sort the cards out into 4 or 5 groups (as appropriate: see below), and simply tell them whether they are placing a card in the correct group ("yes" or "no" feedback). We repeat this process (painful as it may be for the subject) until all our subjects can successfully group the cards. The rule for placing cards into piles (which we know, but our subjects do not) is that all cards that are distortions of the same base card need to be placed together.

        Now stop for a moment and think about what someone like Hull might say would be going on. Each individual card serves as a stimulus for a response (which is the pile in which it is placed). So, the people in this experiment are acquiring a lot of links. Moreover, we might want to predict that they will also be able to sort the new cards they haven't seen (remember, we gave them only half of the distortion cards) somewhat successfully because of generalization. But of course, the strongest habit strength ought to involve the specific cards people did learn about, so their performance should be more successful with those. Can you figure out what prototype theory predicts?

        Posner and Keele at this point shuffled all the cards together, including the distortion cards people hadn't seen, and including the base cards. They then gave the deck to their subjects to sort again, but this time they asked them how sure they were that they had seen a card before. Let's talk about sorting or classifying behavior first. People in fact could reasonably successfully sort the new cards that they hadn't seen earlier. (There was another finding here, but I'll hold it off, for the moment.) But the news came in their confidence ratings. The cards people were most sure they had seen were the base cards, which had never been presented in the initial task! And moreover, their subjects did not systematically distinguish in their confidence ratings between the new distortions and the old distortions: Both new and old distortions, on average, had the same rating of confidence that it had been presented in the experiment. But, the less distorted a card was, whether old or new, the more sure people were that they had seen it earlier.

        So, what is going on, here? Posner and Keele claim that how people sorted or classified cards was not by learning individual stimulus-response links, but by creating average dot patterns for the piles they were sorting into. Since the dot movements were in random directions, averaging over the distortions should have yielded something very similar to the original base card, the central tendency. Thus, once people constructed average cards in their working memories, they subsequently classified by comparing one of the stimulus cards to the 4 or 5 average cards they had created, and picked the pile corresponding to the average the card most resembled.

        We can now see why confidence ratings came out the way they did. Although people had never actually seen the base cards, they had re-created them in their working memories in order to do the task. Thus, in some sense they had actually mentally 'seen' these cards on the majority of the trials, making them very familiar. These were the cards people learned about for the most part, not the actual cards being presented. Consequently, the closer something was to one of these mental prototypes, the more confident people were that they had seen it. Their failure was thus a failure of what we call reality monitoring: being able to tell whether you actually saw something, or merely imagined it.

        The Posner and Keele study provides reasonable evidence for the claim that people can create a prototype, and that they do not need to be exposed to the actual prototype in order to do so! It additionally provides evidence for the claim that categorization involves some sort of process like matching to a prototype, in at least some cases. Do by all means note the similarity of this approach to the Thomas et al. finding regarding peak shift, in which we discovered that peak shift might also represent a sensitivity to central tendency of a category. So on this account, the type of typicality effects that Rosh and others find in these experiments may have more than a passing resemblance to the similarity functions of a generalization gradient.

        Rosch also noted another characteristic of categories that was relevant to prototype theory. This is the fact that categories in the real world are arranged in a hierarchy in which large categories include smaller categories as members, and these, in turn, include yet smaller members. Although we could easily identify many different levels of hierarchy, Rosch concentrated on three levels, the basic, subordinate, and superordinate levels. A subordinate level category for Rosch is a highly specialized category that typically involves a modifier in its name (though this need not always be true!). Thus, dining room chair and classroom chair would be examples of subordinate level categories. Superordinate level categories, on the other hand, are very high level categories that include many, many different things. Furniture is an example Rosch gives for a superordinate level. According to Rosch, there is a tradeoff involved in having one or the other of these categories as our preferred category to carry around. The subordinate level category provides a very rich and detailed level of information about its members, but there are so many subordinate level categories that we would have to carry around a huge number of prototypes. Also, the prototypes for a lot of these (kitchen table; dining room table) would be quite similar, making classification difficult. The superordinate level, on the other hand, reduces the cognitive load by requiring us to carry around only a few categories, but the level of detail is nowhere near as rich or useful. In particular, similarity to the prototype cannot be used with a superordinate level as easily, since at this level, there are many things that will not resemble the prototype. Thus, for the category furniture, what is the prototype that will resemble such good and typical examples of this category like table, chair, bed, dresser, etc.?

        Accordingly, Rosch gives privileged status to a level of category in between these two that she calls the basic level category. It is a compromise between having huge numbers of lower-level categories with rich information and a few higher-level categories with sparse information. As to what constitutes the basic level, she claims that it is the highest level at which a prototype may be identified that will resemble the other typical members of the category, thus enabling use of a similarity function to categorize.

        In agreement with Rosch, basic level categories do appear to be the first categories children acquire. Moreover, basic level categories also appear generally to be coded in language by single words such as chair or table. Even in sign language, basic level categories appear to be coded by single signs rather than combinations of signs. Thus, some further indirect evidence for her theory may be allowed from the finding that being concerned with similarity in categorization predicts what types of categories are more fundamental.

        Although we have spoken of prototypes as being equivalent to feature lists (and a matching rule as involving looking for shared or overlapping features), Rosch herself was somewhat ambiguous about her use of the term prototype. In some cases, she talked about a prototype as being like an image. And indeed, in one experiment, she superimposed multiple drawings on top of one another to see whether people could recognize the depicted objects. These drawings were outlines of objects from the same category chosen at random from magazine photos. When the objects were from a basic level category, people were much more likely to correctly identify them than when they were from a superordinate level. Whether this indeed constitutes great evidence for an imagery explanation of prototypes is questionable, to me, at least; but it does support the claim that basic level categories are, well, more basic.

        Rosch and her colleagues have also looked at the correlations among people's typicality ratings. If categories such as these are to be useful when we communicate to one another via language, then we need to share roughly the same notions of what the category means. And she does find that typicality ratings correlate highly. Everyone, for example, will agree that bear and lion and horse are typical animals, and that coral and sponge and protozoan are not.

        As to flexibility (recall the Labov study in which category labels for cup versus bowl changed when we asked people to name objects that were empty or filled with food), we could easily claim that flexibility involves temporarily shifting the prototype, so that typicality changes, as well. This hypothesis, termed refocusing by Roth and Shoben, can account for our ability to know that parrot and parakeet are more likely to be typical birds for someone living in the Caribbean than someone living in Idaho. Indeed, Barsalou and Sewell have shown that people can shift their assessments of typicality when asked to adopt a different point of view.

        As influential as Rosch's view of prototypes has been, however, it too fails to handle all categories and all findings. Barsalou, in particular, has posed a number of challenges to her view. In particular, Barsalou has suggested that Rosch used the wrong statistics to assess how similar people's estimates of typicality were. When he uses the corrected statistics, Barsalou finds that the correlation between people's typicality ratings is not all that high. And perhaps more disastrously, people do not appear to be completely internally consistent: Their own typicality assessments shift over the course of a week or so. (There is still high within-person consistency, but it is not perfect by any means.) He has investigated several alternative characteristics that could account for typicality ratings in addition to similarity to central tendency. Among these have been frequency of instantiation and proximity to ideals. Frequency of instantiation refers to the number of times someone thinks of a given object when exposed to the category. An ideal represents the best possible theoretical representative in certain categories, rather than the central tendency. For example, compare an ideal athlete with an average athlete to get a feel for how different these two points or persons will be in the category. Barsalou points out that some typicality ratings may naturally be geared towards comparing similarity to the ideal. It turns out that each of these three factors (similarity to central tendency; frequency of instantiation; proximity to ideal) accounts for some of the typicality effects under certain circumstances. So, there is not just one cause of typicality.

        Barsalou also points out that typicality in some categories seems to have nothing to do with physical similarity, raising questions about a similarity-to-prototype rule. He distinguishes taxonomic categories, the types of categories that exist in the natural world, from goal-derived categories that involve human activities of various sorts. In a goal-derived category, objects are included in the category because they fulfill various goals. For an example, take the following list:

            beer    chicken    frisbee    blanket    ice    napkins    radio    spoons

        In what way are these items similar to one another? Physically, the items aren't similar at all. But they are members of a coherent category. The category is organized around some interconnected goals that you will agree are coherent when you hear what the category is. Here it is: Things to take on a picnic. Among the goals are eating, drinking, and entertainment, and the objects in the list above are associated with satisfying one or another of these goals.

        A final point Barsalou makes is that categories are so flexible that we can make them up on the spot, and agree on typicality, at least to some extent, regarding some of their members. Such categories are termed ad hoc categories. An example of an ad hoc category that Barsalou likes to mention is ways to avoid getting hit by the mob. We can all clearly come up with instances or examples, and we can all clearly agree that some of these are good, and others not so good.

        For Barsalou, concepts in memory have context invariant information, and context flexible information that depends on or is primed by some contexts, but not others. That partly accounts for the flexibility we find in categorization, and the variability we find in people's ratings. For example, round may be a good candidate for the type of context invariant information that would almost always be retrieved when dealing with the concept basketball; but buoyant would normally not, although basketballs certainly have this property. Under the right circumstances, however, this property can become active ("He threw the basketball into the pool.").

Exemplar Models

        Barsalou's notion of frequency of instantiation carried the suggestion that we are perhaps sensitive not just to central tendency, but also to the different ranges found in various categories. Posner and Keele had actually obtained a result in their experiment with the dot patterns that came to the same conclusion. In a detail of their experiment I withheld earlier, they actually exposed groups to different levels of distortions. Thus, we can simplify the design to the following:

        Group             Patterns During Learning                         Generalization Test

            1                 low distortions only                                    new low & high distortions
            2                 low & high distortions                                new low & high distortions

        From our discussion of Rosch, we would expect people to come up with the same prototypes in each group. So, when we give them the full deck after initial learning, including the new distortions, they theoretically ought to perform equally, since they need only compare a new card to the 4 or 5 prototypes they have created. But as you might have guessed, the groups are not equal. In particular, Group 1 (which had not been exposed to the high distortion cards during initial learning) had some trouble classifying the high distortion cards. Thus, not only do people pick up information about central tendency; they also learn something about the range of objects that would be expected in the category. When something is outside that range, they are not as well-equipped to deal with it.

        There are a number of ways in which range information could be included in prototype models of categorization. But we'll now turn to a very different class of models that provides a very natural explanation for sensitivity to range. These models are called exemplar models. They have been sponsored and supported in large part by Medin and his colleagues (e.g., Medin and Schaffer), and by Nosofsky.

        In direct contradiction to a minimalist prototype model that claims representation of the category only by an average example, exemplar models claim that the category is represented by a series of examples (or exemplars). Thus, instead of average information, we have individual information. And since we do have such individual information, it is also clear that we will have access to information about the range of variation within a category.

        Most exemplar models (and Medin's and Nosofsky's work is a good illustration of this) also claim that categorization is based on a rule of similarity. Thus, a young child trying to decide whether something is a dog or cat will presumably compare the mystery animal to all the dogs and cats he has encountered, and will derive an average value for similarity to a category. That is different from doing a simple comparison to a prototype. But, it might be expected to yield similar classifications, at least on some occasions. However, other theorists (Reed, for example) have also proposed alternative rules of classification that become possible in an exemplar but not a prototype model. Thus, in some sense, an exemplar model inherently supports greater flexibility of categorization.

        For example, one of the possible rules that has been studied is what is called the nearest neighbor rule: When you have an object whose category isn't directly known, classify it in the same category as the exemplar to which it has the highest similarity (its nearest neighbor). In some cases, the nearest neighbor's category may have a central tendency (if we want to calculate family resemblance scores) that will actually be less similar to the mystery object than the central tendency of some other category. As this example illustrates, it is certainly possible to get exemplar models to predict different types of categorization than prototype models.

        As the discussion has been rather abstract so far, let us consider a more concrete example. One of the situations Medin and Shaffer made famous is called the 5/4 problem. In this problem, we teach people two categories that involve objects with values on four dimensions. Typically, we will restrict ourselves to just two values per dimension. This, using D1 through D4 to represent the four respective dimensions and using 1 or 0 to represent the four values along these dimensions, we might teach people the following category assignments:

                        Category A                                                 Category B

                    D1     D2      D3     D4                          D1      D2      D3     D4

        A1:      1         1         1         0                 B1:     1         0         0         0
        A2:      1         0         1         0                 B2:     0         1         0         1
        A3:      1         0         1         1                 B3:     0         0         0         1
        A4:      1         1         0         1                 B4:     0         0         1         0
        A5:      0         1         1         1    

        On a prototype account, the prototype for Catgory A should be all 1s on each dimension, and the prototype for Category B should be all 0s. But item A2 is equally similar to both of these prototypes, as is item B2. Nevertheless, people do seem capable of learning to classify these in their proper categories. Moreover, note that the two theories will make different predictions about how a mystery object whose description is 1 0 0 1 will be classified. It is equally similar to both prototypes, but it will actually have higher similarity with Category B. To see this, assume that we use similarity to prototype. Counting the number of feature matches for 1 0 0 1 with either 1 1 1 1 (Category A) or 0 0 0 0 (Category B) will give us similarity scores of 2 in each case.

        But now count up the number of matches to each item in Category A, and divide by 5 to get an average. This yields:

            [ 1 (A1) + 2 (A2) + 3 (A3) + 3 (A4) + 1 (A5) ] / 5 = 10/5 = 2

Doing the same for Category B gives us:

            [ 3 (B1) + 2 (B2) + 3 (B3) + 1 B4) ] / 4 = 9/4 = 2.25

And on this basis, similar as you might have first through these models to be in their average similarity calculations (if radically different in their assumptions about memory), you can see that they actually make different categorization predictions. Both Medin and Nosofsky have shown in a number of such cases that the exemplar model's predictions seem to fit the data better

        But can an exemplar model handle data suggesting that people can decide which of two objects is more typical? That is, a prototype model has a seemingly natural mechanism for generating typicality ratings since it bases those (not always correctly, as we have seen from Barsalou) on similarity to the prototype. How on earth are similarity or typicality ratings obtained in a strict exemplar model? If you will recall, typicality was obtained in really two ways in Rosch's original work. She claimed that typiclity arose from the process of comparing an item to the prototype. But in actual fact, she obtained family resemblance scores that were the basis for predicting typicality by comparing each item in a category to every other item. And as you can see from my description, that process of calculation works from an exemplar model. (Of course, Rosch didn't mean to say that this is the way her people actually arrived at family resemblance; but in practice, she used some exemplar assumptions to predict typicality, and thus inadvertently demonstrated that exemplar models could handle typicality effects, as well.) Estes in a recent book on his exemplar model has modified some assumptions about how similarity ratings are obtained, but he essentially also shows that exemplar models easily predict typicality or similarity to a category. This is an important result, as there is no representation of the category by itself in a real examplar model. There is nothing like a prototype or central tendency that is privileged in terms of being the one item we would compare others to.

        In a number of cases, exemplar models simply make better and more reasonable predictions. One of these should strike you as intuitively appealing. Thus, think of what a typical spoon looks like. For most people, this will likely be a teaspoon or a tablespoon. Now imagine a large wooden spoon, and a metal spoon otherwise identical to the wooden spoon except for the material it is made out of. Most people will argue that the wooden spoon is a better example of the category than the otherwise identical metal spoon. But that is a very odd prediction for a prototype model to make, as a large metal spoon is more similar to the prototype on any measure of physical similarity than is the wooden spoon. But of course, we are familiar with wooden spoons as exemplars, and thus an exemplar model can handle this particular finding well.

        You will see in a later chapter on memory that several people such as Tulving have proposed a theory of memory in which every different occasion on which we encounter something results in the formation of a different memory. Indeed, you have already encountered something of the sort in the discussion of Logan's race model in an earlier section of this chapter. Such models, termed multi-trace models (since they posit multiple memory traces for similar events or repetitions of events rather than the strengthening of a single mental representation), are quite consistent with an exemplar theory approach. Different exemplars are stored along with an indication of the category they belong to, just as different experiences are stored separately. On such models, there need be no generalizing or abstracting process that results in a representation of the category separate from its individual members.

        Also (although this will really be true, as well, for more sophisticated versions of a prototype model), exemplar models can handle category flexibility results (such as the Barsalou or Roth and Shoben results) by claiming that certain dimensions get weighted more strongly than others in certain situations. To tie this in to the well-defined categories we discussed earlier, you may wish to note a possibility that several theorists have mentioned. In the right circumstances, a context manipulation that weights one dimension extraordinarily strongly may result in something like a hypotheses-theory rule specifying a certain feature. The advantage of an exemplar model in this case is that the 'rule' can be formulated along with several exceptions to it that are still members. Whether rule-plus-exception models are really different from exemplar models is an issue that Nosofsky, among others, is concerned with.

Theory-Based Categorization

        Finally, although there are still many theorists who embrace and defend both prototype and exemplar models (and some who defend hybrid models in which there is both a central tendency and exemplars: the central tendency in these models may be important in distinguishing between categories whereas the exemplars become important for distinguishing within a category), a number of theorists have started becoming concerned with the issue of what is termed conceptual coherence: The 'sense' in which a bunch of objects that are grouped together seem to fit together for important, underlying reasons. Medin, who is now one of these, has pointed out that items in a category seem to belong together for reasons that are very different than the rather casual fact that they may resemble one another. The question is what these reasons might be. And the answer that Medin and others like him give is that categories make sense because of underlying causal and explanatory structures that link the objects together into a coherent role.

        As this is rather abstract, let's look at an example from a study by Medin and Shoben. Among other things, they asked people to judge the similarity of the colors black, white, and grey. In some of their conditions, grey and black were judged to be more similar than grey and white, and in other conditions, the exact opposite finding occurred. But since these colors stay identical, why is it that there relative similarities change?

        Once you know the manipulation, the answer will not surprise you. Medin and Shoben had some people judge color samples in the context of color of hair, and others judged color samples in the context of what the sky might look like. Those contexts engage your understanding of change in hair and sky color, and it is the underlying causal principles in each case that drives your evaluation of similarity. Thus, in hair color, the aging process will turn hair grey, and then white. But many people are already mature when their hair turns grey, so that the age differences ought to be less for grey-haired and white-haired individuals. And in an analogous account, a storm system will be ushered in by a change in clouds from white to grey, and then as the storm gets fierce, to black.

        Similarity is thus itself flexible, and may signal the operation of some underlying principle (in the case of biological species such as cats and dogs, for example, shared genetic material that results in certain overt similarities).

        Moreover, similarity may not be really important to determining what a category is, at all! That is an important point made by Armstrong, Gleitman, and Gleitman. They gave people categories, and found typicality effects for all of their categories. But, since these categories were well-defined and included such things as odd number and women, just what does it mean that some things are better odd numbers or women than others? If you think about it, 3 will also strike you as a better odd number than 1017, but here is a case where similarity to a prototype odd number such as 1 will not work, as 2 is even more similar to 1 than 3. Armstrong et al. thus found typicality effects in some cases that we know have to be well-defined. So, similarity to prototype may be a way of sometimes evaluating items in a category, but it need not be a way of defining them.

        A very similar point has been made by Smith and Medin. They distinguish between the core definition of the concept, and what they term the identification procedure. Their point is that we may often be lazy enough to use visual appearances or physical similarity to classify, but that we also know that categories aren't the same things as similarity structures. Secondary sex characteristics such as facial hair, for example, may be used to classify sex, but a core definition of sex (though not necessarily gender!) will involve assessing presence of X and Y chromosomes.

        Anyway, let us end this section by briefly considering one more study. This was an experiment by Hastie and Rehder. They argued that a theory-driven approach to categorization would claim that categories ought to be learned easier and differently when the features of various members of the category could be explained by underlying causes. Thus, in one of their conditions, they taught people about a new (made-up) species of ant, the Keyhoe ant, living on a volcanic island. There were four features that identified these ants: they had very high levels of iron sulphates; their immune systems were hyperactive; their blood was unusually thick; and they built nests at extraordinarily fast rates.

        Hastie and Rehder systematically examined two possible organizations of these features. In the common cause condition, the first feature was used to explain the other three. In the common effects condition, each of the first three features was used to explain the fourth (so that the fourth had four causes). So, on common cause, for example, the reason the ants had high levels of iron sulphates was because their diet on this island was rich in these. But, the iron sulphate molecules were not recognized by the immune system, triggering an immunity response. Also, blood high in iron will be very thick. Notice how the first feature has just been used to provide an explanation of the next two features. And in similar fashion, the first feature in the common cause condition was also used to generate a plausible explanation for the last feature. So, the features should have hung together as a reasonable package, as opposed to the usual categorization procedure in which people would have had to learn the features as a random assortment that happens to be there for Keyhoe ants.

        So what happened? Hastie and Rehder found that the first feature was more important for categorization in people who got the common cause, but the last feature was more important for people who got common effect. Moreover, these four features were not treated as independent features by over half of their subjects: For the common cause subjects, people were particularly sensitive to combinations of the first and second, first and third, and first and fourth features (i.e., the cause and its effects). The common effects subjects, in analogous fashion, were particularly sensitive to combinations of the first and fourth, second and fourth, and third and fourth features (what they got as causes and the effect). Many people thus did exhibit sensitivity to the underlying causal relations making these feature bundles coherent.

        There is one final result Hastie and Rehder obtained. They not only looked at the ability of people to classify new ants into several different species, but they also asked people to make similarity judgments. The similarity judgments did not predict categorization results. Thus, in harmony with the point made above, similarity or typicality need not be the primary mechanism of categorization.

IV. Animal Categories

        So what does all of this have to do with generalization gradients and discrimination learning? On similarity-based approaches to categorization (such as prototype models and exemplar models), typicality effects look very much like generalization gradients. And even if we adopt the claim that typicality effects or similarity functions represent, at best, an identification procedure rather than knowledge of the core concept, the similarity of results at least suggests the possibility that animals classify objects in ways reminiscent of what humans do. And that is a very far cry indeed from models like Hull and Spence that claim generalization is simply a basic finding with no additional significance or explanation needed.

        Shepard has made the claim that generalization provides information about animals' categories, although others such as Herrnstein have also put forth similar ideas. Shepard refers to generalization as a cognitive act by which animals assert equivalences between certain events that effectively place them into the same status. He gives the example of a bird that eats a tasty moth for the first time (so the bird is being reinforced for responding to the stimulus of the moth). When it sees another moth with slightly different mappings, it will be faced with deciding whether the new moth fits into the category of edible tasty things, or non-edible (and potentially dangerous) things. If the bird uses a similarity metric to make its classifications, then its behavior is essentially informing us about the assessed probability that the second moth will fit into the edible category.

        If you have paid careful attention to some of the concept-learning and categorization studies above, you may have noticed a similarity between these and discrimination learning: Some items are associated with one response (the category name or label, in the case of categorization; a reinforcer in the case of discrimination), and others are associated with another response (the other category name; the absence of a reinforcer). But is this an important distinction? Certainly, we may wish to feel a bit uncomfortable in talking about separate categories when there are only two stimuli. Nevertheless, as the experiment with complex stimuli by Reynolds showed, pigeons learned about a feature that separated the two stimuli (different features were chosen by the different pigeons), and such a type of feature learning is easily consistent with a simple category rule of affirmation (in which a category is defined by the presence of one feature). Still, most theorists would argue that categorization involves an ability to go beyond physical stimuli to some sort of abstract relationship that a number of stimuli may in fact have in common. But even here, we saw that animals were capable of relational learning of just this sort (for example, Gonzales, Gentry, and Bitterman; Lawrence and DeRivera), and we also saw evidence of abstraction across different stimuli in the transfer studies (IDS, for example, or Lawrence's study).

        Arguably, generalization does not always have to be an instance of categorization, or a cognitive act in Shepard's sense. An animal that learns an association to a certain shade or red may have that association primed to the extent that another wavelength similar to the original triggers the memory or the association. Generalization gradients and discrimination experiments on relatively simple stimuli are thus ambiguous in their interpretations. But interesting questions can be asked about gradients based upon more complex stimuli, including what level of categorization various species of animals are capable of. Some theorists have indeed been asking these questions.

        In humans, of course, generalization may occur on the basis of non-physical features, so that we certainly can claim that gradients sometimes tie into the same types of abstract features that are discussed in the literature on categorization. Lacey and Smith, for example, show semantic generalization whereby a generalization gradient is found for concepts similar to a training concept. People who are shocked when the word cow appears will show evidence of anxiety in the presence of words such as farm, horse, or plow, for example.

        But can animals actually acquire natural categories of the sort humans have? If they do have an ability to acquire these categories, then we gain at least a little bit of confidence in going back to our generalization gradients and speculating that they, too, represent processes of categorization. So, to finish up this brief section, let's look at several studies on animal categories.

        Perhaps the classic studies (since followed up by a plethora of studies) were done by Herrnstein and his associates. In one series of famous studies by Herrnstein, Loveland, and Cable, pigeons were given discrimination training on a long series of pictures that differed in whether an object was present or absent. But the object was a different object in each case; the only commonality was that all of the objects were members of the same category. So, for example, Herrnstein et al. exposed their pigeons to 40 pictures in which some type of tree was present, and 40 pictures in which there were no trees. Sometimes, these pictures showed only part of a tree; sometimes, the tree was in the background; sometimes the tree was in the foreground. Despite the large variation in the 40 training photos in types of trees, position of the tree, etc., the pigeons were able to acquire the concept tree: They showed very good classification for new photos on the basis of whether the new photos had trees in them. In the Herrnstein et al. studies, pigeons were also able to learn the category human female (the S+ pictures had women in them) and water (the S+ pictures had water in them).

        In a later, similar study, Herrnstein and deVilliers used photos that were taken by a scuba diver on a series of dives. The photos had been taken independently of the experiment, and were standard vacation-dive underwater shots. Forty photos that had fish in them were chosen, as were another 40 photos of underwater scenes in which fish were absent (shots of coral heads, for example, or starfish, turtles, shrimp, etc.). The fish came in a variety of sizes, shapes, colors, numbers, etc. So, trying to pick on a single perceptual feature that would discriminate fish from non-fish photos would be difficult, at best. Nevertheless, the pigeons in this study also acquired the concept. They were able to transfer their responses to new, never-before seen pictures such that they generally pecked when a fish was in the scene.

        Is such learning exactly the same as discrimination learning? Or to put this another way, are these animals basically learning exemplar-like categories in which they acquire the correct response to 40 individual stimuli, and because of generalization transfer that response to new stimuli? If categories (including exemplar categories) have some underlying coherence, then we might predict that natural categories that presumably are coherent (like fish) should be easier to learn than arbitrary categories that lack such coherence. And indeed, that was the issue in another experiment by Herrnstein and deVilliers. They took another group of pigeons and randomly rearranged the 80 pictures into two sets, one of which comprised the S+ set. The pigeons did indeed finally learn this set, but it took them much longer, consistent with the claim that the learning in this case involved a different process.

        As another example, consider a study by Cerella. In this experiment, pictures of leaves served as the stimuli. Pigeons were taught to discriminate between an oak leaf (the S+) and other leaves (the S-) that came from different types of trees. Unlike the Herrnstein experiments, these subjects saw a single stimulus (i.e., a single exemplar) as the representative of the category. They did, however, see 40 non-oak leaves for the S- condition. But when they were tested on a later generalization test, they were able to transfer their categorization to other oak leaves of different sizes and shapes. So apparently, they showed evidence of knowing about the category oak leaf.

        You might think that Cerella's results were perhaps due to pigeons learning to avoid the type of leaf used as the S- (elm, sycamore, and sassafras leaves). Maybe they just learned to peck at anything that wasn't an elm, sycamore, or sassafras leaf. In a further follow-up study, Cerella reinforced the animals for pecking at a screen that showed a single oak leaf. There were a number of trials with this one leaf. No S- trials involving other types of leaves were presented. Thus, this experimental condition really is like the generalization experiments we have looked at earlier, in which we test for generalization following simple approach or appetitive learning. The news in this latter experiment was that the pigeons during generalization tests pecked at other oak leaf pictures, but not at pictures of non-oak leaves. So, experience with a single exemplar in this case served to set the category.

        Are categories specifically limited to humans? Do they require the use of language? These studies suggest that they aren't, and that they don't. To quote Herrnstein:

(T)he most challenging puzzles surrounding categorization have more to do with stimulus generalization and discrimination than with the specialized capacities of the human mind. (1984, p. 234).

        On the other hand, pigeons' categories and humans' categories need not be the same types of things. Cerella also taught pigeons to discriminate Charlie Brown (in the comic Peanuts) from other characters. The pigeons also learned this discrimination easily, and pecked at partially blocked Charlie Browns, etc. However, they also pecked at a drawing of Charlie with his head upside down, relative to his body. We would not tend to do that, since we would view such a stimulus as no longer legitimately symbolizing a person. These pigeons apparently didn't 'get' the idea that the head has to be oriented in a certain way with respect to the body.

        An even more suggestive study was done by Aydin and Pearce. They attempted to train pigeons on a fuzzy discrimination (although the studies above arguably deal with fuzzy features). In any case, Aydin and Pearce chose three features, and then reinforced pigeons for pecking at stimuli that contained any two of these. So, the correct stimuli varied in presence of features, but there was a family resemblance type of structure in this experiment. Were pigeons able to learn this discrimination? Most definitely, they were. But more impressively, the pigeons on a later transfer test responded most to a new stimulus they had not seen before, but which contained all three features. If you think about it, this result should remind you of what Posner and Keele found. The new stimulus with all three features in some sense should be the central tendency or prototype that develops when you are exposed to different combinations of any two features. Thus, the animal learning literature does show some evidence for existence of a prototype-like category (and the formation of a prototype).

        Another procedure by which to test for abstract concepts involves what is generally called delayed-matching-to-sample. This procedure, which relies on memory, presents a stimulus for a brief period of time. The stimulus is then removed, and some time later, the animal is subsequently presented with two (or more) stimuli, one of which normally matches the first. The usual task is to choose the matching sample.

        In a more interesting variation of this problem, we teach what is called oddity matching (or the oddity problem). Here, the animal needs to choose the stimulus that doesn't match, rather than the one that matches. So, for example, we may have a display of three keys in front of a pigeon. The middle key lights up red or green. Then, it goes out and the two other keys, after some delay, come on, one red, and the other green. Our subject has to choose the key that has the different color from the one that had come on earlier.

        Honig has done a number of these experiments. He finds, for example, that pigeons trained on this problem will show good transfer to new colors (say, blue and yellow) without need for excessive training. They have apparently learned the general concept of different. But at the same time, Honig reports that they seem to limit such a concept to the dimension they were trained with. Changing to a new dimension (such as shape or texture) will require renewed learning for the oddity problem. In a sense, the oddity problem is reminiscent of Harlow's work on win-shift problems.

        Many people today are studying animal concepts and categories, especially in primates. I have concentrated on pigeon studies here primarily to show you that a relatively simple species is nevertheless capable of some extraordinarily sophisticated learning. With primates, people have been studying such things as number concepts (can chimps count?), order concepts, three-dimensional object recognition from two-dimensional pictorial representations, and whether primates have a concept of the self or of others as aware entities. It is exciting work that suggests a high-level cognitive apparatus may often be involved in such apparently low-level phenomena as associative learning. But based on what you have learned so far regarding classical and instrumental conditioning, that should not surprise you.

        We can do no better than to end this chapter with another quote from Herrnstein; one which concisely summarizes the theme of this section:

Human language may depend on categorization, but the evidence shows clearly that categorization does not depend on language. Generalization of discriminative stimuli shades into categorization as the contingency of reinforcement is associated with actual objects (or representations of them), rather than the usual abstracted lights or tones of psychological research. The complexity and variability of natural stimuli uncovers a capacity to see through stimulus variation even in relatively simple organisms...To categorize, which is to detect recurrences in the environment despite variations in local stimulus energies, must be so enormous an evolutionary advantage that it may well be universal among living organisms. (1984, p. 257)

        To which, I will add my own coda: Studies using impoverished stimuli may give impoverished reflections of the learning capacities of our subjects. In human learning, the categories and pre-existing knowledge humans bring to the laboratory with them often influence the results the experimenter obtains. We should expect no less of our animal subjects. The study of animal learning can only benefit from a systematic exploration of a species's categories and ways of perceiving the world. Learning and memory are too tightly interconnected to ignore one while studying the other.


Partial Bibliography

Armstrong, S.L., Gleitman, L.R., & Gleitman, H. (1983). What some concepts might not be. Cognition, 13, 263-308.

Atkinson, R.C., & Shiffrin, R.M. (1968). Human memory: A proposed system and its control processes. In K.W. Spence & J.T. Spence (Eds.), The psychology of learning and motivation: Advances in research and theory (Vol 2). NY: Academic.

Aydin, A., & Pearce, J.M. (1994). Prototype effects in categorization by pigeons. Journal of Experimental Psychology: Animal Behavior Processes, 20, 264-277.

Axelrod, & Guzy

Barsalou, L.W. (1985). Ideals, central tendency, and frequency of instantiation as determinants of graded structure in categories. Journal of Experimental Psychology: 11, 629-654.

Barsalou, L.W. (1991). Deriving categories to achieve goals. In G.H. Bower (Ed.), The psychology of learning and motivation (Vol 27). CA: Academic.

Barsalou, L.W. , & Sewell, D.R.

Bourne, L.E. (1970). Knowing and using concepts. Psychological Review, 77, 546-556.

Broadbent, D.E. (1954). The role of auditory localization in attention and memory span. Journal of Experimental Psychology, 47, 191-196.

Broadbent, D.E. (1957). A mechanical model for human attention and immediate memory. Psychological Review, 64, 205-215.

Broadbent, D.E. (1958). Perception and communication. Oxford: Pergamon.

Bruner, J.S., Goodnow, J.J., & Austin, G.A. (1956). A study of thinking. NY: Wiley.

Carlson, J.G., & Wielkiewicz, R.M. (1972). Delay of reinforcement in instrumental discrimination learning of rats. Journal of Comparative and Physiological Psychology, 81, 365-370.

Carlson, J.G., & Wielkiewicz, R.M. (1976). Mediators of the effects of magnitude of reinforcement. Learning and Motivation, 7, 184-196.

Cerella, J. (1979). Visual classes and natural categories in the pigeon. Journal of Experimental Psychology: Human Perception and Performance, 5, 68-77.

Cherry, E.C. (1953). Some experiments on the recognition of speech with one and with two ears. Journal of the Acoustical Society of America, 25, 975--979.

Craik, F.I.M., & Tulving, E. (1975). Depth of processing and the retention of words in episodic memory. Journal of Experimental Psychology: General, 104, 268-294.

Deutsch, J.A., & Deutsch, D. Attention: Some theoretical considerations. Psychological Review, 70, 80-90.

Gibson, E.J. (1969). Perceptual learning and development. NY: Appleton.

Gibson, E.J., Walk, R.D., & Tighe, T.J. (1959). Enhancement and deprivation of visual stimulation during rearing as factors in visual discrimination learning. Journal of Comparative and Physiological Psychology, 52, 74-81.

Gonzalez, R.C., Gentry, B.V., & Bitterman, M.E. (1954). Relational discrimination of intermediate size in the chimpanzee. Journal of Comparative and Physiological Psychology, 47, 385-388.

Guthrie, E.R. (1952 ). The psychology of learning. (Revised edition) NY: Harper & Row.

Guttman, N., & Kalish, H.I. (1956). Discriminability and stimulus generalization. Journal of Experimental Psychology, 51, 79-88.

Hanson, H.M. (1959). Effects of discrimination training on stimulus generalization. Journal of Experimental Psychology, 58, 321-334.

Harlow, H.F. (1949). The formation of learning sets. Psychological Review, 56, 51-65.

Harlow, H.F. (1959). Learning sets and error factor theory. In S. Koch (Ed.), Psychology: A study of a science (Vol. 2). NY: McGraw-Hill.

Hasher, L., & Zacks, R. T. (1979). Automatic and effortful processes in memory. Journal of Experimental Psychology: General, 108, 356-388.

Hasher, L., & Zacks, R. T. (1984). Automatic processing of fundamental information: The case of frequency of occurrence. American Psychologist, 39, 1372-1388.

Hastie, R., & Rehder, B. (1998, November). Explanation-based category representations. Paper presented at the 39th Annual Meeting of the Psychonomic Society, Dallas, TX.

Hearst, E. (1971). Differential transfer of excitatory versus inhibitory pretraining to intradimensional discrimination learning in pigeons. Journal of Comparative and Physiological Psychology, 75, 206-215.

Herrnstein, R.J. (1979). Acquisition, generalization, and discrimination reversal of a natural concept. Journal of Experimental Psychology: Animal Behavior Processes, 5, 116-129.

Herrnstein, R.J. (1984). Objects, categories, and discriminative stimuli. In H.L. Roitblat, T.G. Bever, & H.S. Terrace (Eds.), Animal cognition. NJ: Erlbaum.

Herrnstein, R.J., & deVilliers, P.A. (1980). Fish as a natural category for people and pigeons. In G.H. Bower (Ed.), The psychology of learning and motivation (Vol 14). NY: Academic.

Herrnstein, R.J., Loveland, D.H., & Cable, C. (1976). Natural concepts in pigeons. Journal of Experimental Psychology: Animal Behavior Processes, 2, 285-302.

Hearst, E., & Koresko, M.B. (1968). Stimulus generalization and amount of prior training on variable-interval reinforcement. Journal of Comparative and Physiological Psychology, 66, 133-138.

Honig, W.K. (1965). Discrimination, generalization, and transfer on the basis of stimulus differences. In D.J. Mostofsky (Ed.), Stimulus generalization. CA: Stanford U. Press.

Hull, C.L. (1943). Principles of behavior. NY: Appleton-Century-Crofts.

Hulse, S.

Jenkins, H.M., & Harrison, R.H. (1962). Generalization gradients of inhibition following auditory discrimination learning. Journal of Experimental Analysis of Behavior, 5, 435-441.

Johnston, W.A., & Heinz, S.P. (1978). Flexibility and capacity demands of attention. Journal of Experimental Psychology: General, 107, 420-435.

Johnston, W.A., & Wilson,

Kahneman, D. (1973). Attention and effort. NJ: Prentice-Hall.

Kendler, & Kendler

Kerr, L.M., Ostapoff, E.M., & Rubel, E.W. (1979). Influence of acoustic experience on the ontogeny of frequency generalization gradients in the chicken. Journal of Experimental Psychology: Animal Behavior Processes, 5, 97-115.

Killeen. P. (1978). Superstition: A matter of bias, not detectability. Science, 199, 88-90.

Köhler, W. (1955). Simple structural functions in the chimpanzee and in the chicken. In W.D. Ellis (Ed.), A source book on Gestalt psychology. London: Routledge & Kegan-Paul.

Kohn, B., & Dennis, M. (1972). Observation and discrimination learning in the rat: Specific and nonspecific effects. Journal of Comparative and Physiological Psychology, 78, 292-296.

Kraemer, P.J. (1984). Forgetting of visual discriminations by pigeons. Journal of Experimental Psychology: Animal Behavior Processes, 10, 530-542.

Krechevsky, I. (1932). Hypotheses in rats. Psychological Review, 39, 516-532.

Labov, W. (1973). The boundaries of words and their meanings. In C.-J.N. Bailey & R.W. Shury (Eds.), New ways of analyzing variations in English. DC: Georgetown U. Press.

Lacey, J.I., & Smith, R.L. (1954). Conditioning and generalization of unconscious anxiety. Science, 120, 1045-1052.

Lashley, K.S. (1929). Brain mechanisms and intelligence. Chicago: U. Chicago Press.

Lashley, K.S., & Wade, M. (1946). The Pavlovian theory of generalization. Psychological Review, 53, 72-87.

Lawrence, D.H. (1949). Acquired distinctiveness of cues. I. Transfer between discriminations on the basis of familiarity with the stimulus. Journal of Experimental Psychology, 39, 770-784.

Lawrence, D.H. & DeRivera, J. (1954). Evidence for relational transposition. Journal of Comparative and Physiological Psychology, 47, 465-471.

Levine, M. (1966). Hypothesis behavior by humans during discrimination learning. Journal of Experimental Psychology, 71, 331-338.

Levine, M. (1975). A cognitive theory of learning. NJ: Erlbaum.

Logan, G. D. (1988). Toward an instance theory of automatization. Psychological Review, 95, 492-527.

Lyons, J., Klipec, W.D., & Steinsultz, G. The effect of chlorpromazine on discrimination performance and the peak shift. Physiological Psychology, 1, 121-124.

Mackintosh, N.J. (1969). Further analysis of the overlearning reversal effect. Journal of Comparative and Physiological Psychology Monograph Supplement, 67 (Pt. 2), 1-18.

Mackintosh, N.J., & Little, L. (1969). Intradimensional and extradimensional shift learning by pigeons. Psychonomic Science, 14, 5-6.

Marsh, G. (1969). An evaluation of three explanations for the transfer of discrimination effect. Journal of Comparative and Physiological Psychology, 68, 268-275.

McCloskey, M.E., & Glucksberg, S. (1978). Natural categories. Well-defined or fuzzy sets? Memory & Cognition, 6, 462-472.

McKay, D.C. (1973). Aspects of a theory of comprehension, memory, and attention. Quarterly Journal of Experimental Psychology, 25, 22-40.

Medin, D.L., & Schaffer, M.M. (1978). Context theory of classification learning. Psychological Review, 85, 207-238.

Medin, D.L., & Shoben, E.J.

Meyer, D.E., & Schvaneveldt, R.W. (1976). Meaning, memory structure, and mental processes. Science, 192, 27-33.

Moray, N. (1959). Attention in dichotic listening: Affective cues and the influence of instructions. Quarterly Journal of Experimental Psychology, 11, 56-60.


Neisser, U. (1969). Selective reading: A method for the study of visual attention. Nineteenth International Congress of Psychology, London.

Neisser, U., & Becklen, R. (1975). Selective looking: Attending to visually specified events. Cognitive Psychology, 7, 480-494.

Newstead, & Dennis

Nissen, H.W. (1951). Analysis of a complex conditional reaction in chimpanzee. Journal of Comparative and Physiological Psychology, 44, 9-16.

Nissen, M.J., & Bullemer, P. (1987). Attentional requirements of learning: Evidence from performance measures. Cognitive Psychology, 19, 1-32.

Norman, D.A. (1968). Toward a theory of memory and attention. Psychological Review, 75, 522-536.

Nosofsky, R.M. (1991). Tests of an exemplar model for relating perceptual classification and recognition memory. Journal of Experimental Psychology: Human Perception and Performance, 17, 3-27.

Parkman, J.M. (1971). Temporal aspects of digit and letter inequality judgments. Journal of Experimental Psychology, 91, 191-205.

Pavlov, I. (1927). Conditioned reflexes. Oxford: Oxford U. Press.

Peterson, G.B. (1984). How expectancies guide behavior. In H.L. Roitblat, T.G. Bever, & H.S. Terrace (Eds.), Animal cognition (135-148). NJ: Erlbaum.

Peterson, N. (1962). Effects of monochromatic rearing on the control of responding by wavelength. Science, 136, 774-775.

Posner, M.I., & Keele, S.W. (1968). On the genesis of abstract ideas. Journal of Experimental Psychology, 77, 353-363.

Povinelli, D.J., & Eddy, T.J. (1996). What young chimpanzees know about seeing. Monographs for the Society for Research in Child Development, 61 (2), Serial No. 247.

Reid, L.S. (1953). The development of noncontinuity behavior through continuity learning. Journal of Experimental Psychology, 46, 107-112.

Reynolds, G.S. (1961). Attention in the pigeon. Journal of the Experimental Analysis of Behavior, 4, 203-208.

Riley, D.A., & Leuin, T.C. (1971). Stimulus generalization gradients in chickens reared in monochromatic light and tested with a single wavelength value. Journal of Comparative and Physiological Psychology, 75, 399-402.

Rilling, M. (1977). Stimulus control and inhibitory processes. In W.K. Honig & J.E.R. Staddon (Eds.), Handbook of operant behavior. NJ: Prentice-Hall.

Rosch, E. (1973). Natural categories. Cognitive Psychology, 4, 328-350.

Rosch, E. (1975). Cognitive representations of semantic categories. Journal of Experimental Psychology: General, 3, 192-233.

Rosch, E., & Mervis, C.B. (1975). Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, 7, 573-605.

Rosch, E., Mervis, C.B., Gray, W.D., Johnsen, D.M., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8, 382-440.

Roth, E., & Shoben, E.J.

Schneider, W., & Shiffrin, R.M. (1977). Controlled and automatic human information processing. I. Detection, search, and attention. Psychological Review, 84, 1-66.

Segal, S.J., & Fusella, V. (1970). Influence of imagined pictures and sounds in detection of visual and auditory signals. Journal of Experimental Psychology, 83, 458.474.

Seward, J.P., & Levy, N. (1949). Sign learning as a factor in extinction. Journal of Experimental Psychology, 39, 660-668.


Smith, E.E., & Medin, D.L. (1981). Categories and concepts. MA: Harvard U. Press.

Spence, K.W. (1936). The nature of discrimination learning in animals. Psychological Review, 43, 427-449.

Sutherland, N.S., & Mackintosh, N.J. (1971). Mechanisms of animal discrimination learning. NY: Academic.

Terrace, H.S. (1963). Discrimination learning with and without "errors." Journal of the Experimental Analysis of Behavior, 6, 1-27.

Thomas, D.R. (1993). A model for adaptation-level effects on stimulus generalization. Psychological Review, 100, 658-673.

Thomas, D.R., Freeman, F, Svinicki, J.G., Burr, D.E.S., & Lyons, J. (1970). Effects of extradimensional training on stimulus generalization. Journal of Experimental Psychology Monograph, 83, 1-21.

Thomas, D.R., & Jones. C.G. (1962). Stimulus generalization as a function of the frame of reference. Journal of Experimental Psychology, 64, 77-80.

Thomas, D.R., & Lopez, L.J. (1962). The effects of delayed testing on generalization slope. Journal of Comparative and Physiological Psychology, 55, 541-544.

Thomas, D.R., Mood, K., Morrison, S., Wiertelak, E. (1991). Peak shift revisted: A test of alternative interpretations. Journal of Experimental Psychology: Animal Behavior Processes, 17, 130-140.

Thomas, D.R., Windell, B.T., Bakke, I., Kreye, J., Kimose, E., & Aposhyan, H. (1985). Long-term memory in pigeons. I. The role of discrimination problem difficulty assessed by discrimination measures. II. The role of stimulus modality assessed by generalization slope. Learning and Motivation, 16, 464-477.

Trabasso, T.R.

Trapold, M.A. (1970). Are expectancies based upon different positive reinforcing events discriminably different? Learning and Motivation, 1, 129-140.

Treisman, A.M. (1960). Contextual cues in encoding listening. Quarterly Journal of Experimental Psychology, 12, 242-248.

Treisman, A.M., & Geffen, G. (1967). Selective attention and cerebral dominance in responding to speech messages. Quarterly Journal of Experimental Psychology, 19, 1-17.

Tyler, Hertel, McCallum, & Ellis,

Underwood, B.

Voeks, V.W. (1950). Formalization and clarification of a theory of learning. Journal of Psychology, 30, 341-363.

Welker, R.L., & McAuley, K. (1978). Reductions in resistance to extinction and spontaneous recovery as a function of changes in transportational and contextual stimuli. Animal Learning and Behavior, 6, 451-457.

Wittgenstein, L. (1953). Philosophical investigations. Oxford: Blackwell.


1. Chapter © 1998 by Claude G. Cech