Chapter 4: The Basic Findings In Instrumental/Operant Conditioning(1)

 
Overview: This chapter is arranged in four major sections. The first presents the background to Instrumental Conditioning, covering the early work by Watson and Thorndike, the basic findings, what they thought was going on, and what some of the standard paradigms for Instrumental Conditioning are. The second presents many of the basic principles that determine when conditioning will occur, and whether it will be excitatory or inhibitory. The third discusses important exceptions to these principles, and examines the complex interactions that can arise. Finally, the fourth section briefly examines several alternative accounts of the type of an association that can form. Several additional accounts of learning are introduced here; most notably, Tolman's Cognitive Expectancy Approach and Skinner's Radical Behaviorism. This section closes by examining some of the interrelationships between Instrumental and Classical Conditioning.
 

I. Introduction To Instrumental Conditioning

        We now shift from the topic of classical conditioning to that of instrumental (or as Skinner terms it, operant) conditioning. This topic will prove a bit more complex in its findings. As you will see, however, many of the ideas that were important in classical conditioning will prove relevant here. Indeed, there has long been a debate over whether classical and operant conditioning ought to be regarded as truly different forms of learning. They appear to differ in the sense that classical conditioning generally involves the presence of reflex actions, whereas instrumental conditioning generally involves modifications of voluntary behavior contingent on presence of reinforcers or punishers. Whether that is a sufficient reason to distinguish them is arguable, as we will see later. My sense of the field today is that most theorists would like to see similar theories explain the results in both. Thus, it will not surprise you, for example, that a modified version of the Rescorla-Wagner model has also been proposed for instrumental conditioning.

        Let's start with some historical background.
 

A. Background: Two Early Views Of Instrumental Conditioning

        We will look at two quite different claims about the nature of instrumental conditioning. One comes from Watson, the author of the 1913 behaviorist manifesto, Psychology as the behaviorist views it, and the second comes from Thorndike, who can probably safely be credited with conducting the first truly sophisticated and careful observations of complex animal learning. Their accounts differ in ways that prefigured an important debate about what was needed for learning to occur.

        First, however, let us distinguish instrumental conditioning from classical conditioning. In instrumental conditioning, an animal makes one of a number of possible responses in the presence of some stimulus complex or context. That response may lead to some outcome. We typically define learning in this circumstance as an alteration in some observed characteristic of the response such as its frequency, latency, or amplitude. We will revisit this definition in more detail later, once we have examined several theories of what gets acquired, and why. For now, we can talk about instrumental conditioning as the type of learning involved in navigating a maze, choosing the correct one of several doors to run to, or even performing some response that will be successful in avoiding a future shock. In instrumental conditioning, new responses may be taught that differ from any reflexive response already in the animal's behavioral repertoire.

Watson: Contiguity of S & R

        As you know from Chapter 1, Watson attempted to redefine the field of psychology in response to then-current mentalism. We have looked at several of the assumptions he brought along. Basically, he was an extreme environmentalist who believed that most -- if not all -- of our actions were under the learned control of associations. Based on his knowledge of work being done by people like Pavlov, Watson certainly believed that living things were born with a repertoire of reflexes. However, association quickly acquired control of previously reflexive responses, and indeed helped modify those responses to create new responses. Some idea of Watson's radicalism on this point may be gathered from a very famous quote (1926, p. 10):
Give me a dozen healthy infants, well-formed, and my own specified world to bring them up in and I'll guarantee to take any one at random and train him to become any type of specialist I might select -- doctor, lawyer, artist, merchant-chief and, yes, even beggar-man and thief, regardless of his talents, penchants, tendencies, abilities, vocations, and race of his ancestors.
This was radical in at least two related ways. First, from a scientific perspective, it clearly denied the relevance of genetic or inherited influences on current behavior. And second, from a social perspective, it was about as different a position as one could expect to find from the then-prevailing attitudes about race and class.

        In any case, Watson's primary idea was that an association could form between a stimulus and a response (in addition to the type of association found in classical conditioning). But he was a strict contiguity theorist on the issue of S-R associations: A response made in the presence of a stimulus might associate with it, and under certain circumstances, would be likely to be seen when that stimulus recurred. Those circumstances were defined by essentially two principles. The first, a principle of frequency, stated that the association strengthened each time the response was made to the stimulus, so that all things being equal, a frequent response was much more likely to be emitted by the animal than a less frequent response. In addition, however, there was a principle of recency: All things being equal, a recent response was more likely to be emitted than a less recent response.

        What you should particularly note about the brief description of Watson's system above is the complete lack of any reference to a reinforcer, a perhaps surprising omission to students who have been introduced to the idea that instrumental/operant conditioning is in large part about the effects of rewards and punishments. That wasn't so for Watson, and it has not always been so for later theorists as widely divergent as Guthrie and Tolman (see below and in the next chapter). But on a preliminary and casual analysis of classical conditioning, the notion of a reward or punishment does not seem greatly relevant in discussing whether the association forms. (Nevertheless, some theorists refer to the UCS as a reinforcer on a broad definition that a reinforcer is what makes a response more likely; presence of the UCS, whether in excitatory-appetitive or excitatory-aversive conditioning, certainly accomplishes that!) Why, then, ought we to include it in instrumental conditioning?

        And even though Watson talked about associations between stimuli and responses, he also allowed for the possibility of associations between responses themselves. Thus, in the case of animals learning to run a maze, the analysis of what is going on will involve a complex series of muscle movements involving motor responses. (Much of our behavior is complex, rather than the execution of simple responses to individual stimulus triggers.) Rather than talk about external stimuli controlling each succeeding muscle movement that gets the animal from Point A to Point B in the maze, Watson claimed that a chain of responses could be linked together that would be initially set off by an external stimulus. Of course, to the extent that any response also involves internal stimulation, one could still analyze chains in terms of stimulus-response links, so that each muscle movement in the chain serves as the response to the previous movement, and the stimulus for the next.

        We have already discussed in Chapter 1 Watson's insistence that thinking could be reduced to subvocal speech. He also conducted experiments in emotional conditioning. In a famous study with Rayner, Watson conditioned a young child, Little Albert, to be afraid of a white rat. Every time Albert played (apparently happily, at first) with the rat, an experimenter would creep up behind Albert and strike a metal bar, making a loud clanging noise that frightened Albert and caused him to cry. After several such occasions (six, in fact), Albert started to cry at the sight of the rat. Note how this could be analyzed from the point of view of classical conditioning: The noise caused the apparent emotional response of fear, whereas the rat served as the CS.

        Given what you know of Watson's views on mentalism, you may be somewhat surprised to discover him talking about the topic of emotions. However, for Watson, emotions were not underlying mentalistic events, but rather, the behavioral components (the crying, the whimpering, the shaking, etc.) observed in reaction to certain stimuli. Thus, Watson maintained a perfect consistency with respect to his position that positivism required dealing strictly with a behavioral level. Perhaps that is why he did not talk about reinforcers. Thorndike, a contemporary of Watson's, was developing a theory of learning based on reinforcers, and although he defined them in a sufficiently behavioristic fashion, he was nevertheless attacked by others for apparently sneaking mentalistic terms back into hard-nosed scientific psychology.

        To reiterate a point I made earlier, later behaviorists have on occasion adopted a strict contiguity approach to learning. Most notable among these as a successor to Watson was Guthrie, whose principle of conditioning stated (1952 , p. 23):

A combination of stimuli which has accompanied a movement will on its recurrence tend to be followed by that movement. Note that nothing is here said about...reinforcement or pleasant effects.
As we will see later, such approaches were in part a reaction to work by Tolman and his colleagues suggesting that learning could occur in the absence of rewards or punishers. The question that faces such theorists then becomes one of explaining how and why rewards and punishers seem to influence the course of learning.

Thorndike & Puzzleboxes: Reinforcement-Based Learning

        Rewards and punishers, in contrast, played a pivotal role in the work of Thorndike, who is often credited with founding the field of instrumental conditioning. Thorndike published a monograph in 1898 on his studies with animals such as cats. He set up an experimental apparatus termed a puzzle box: a cage in which the animal was placed, and which could be escaped through the performance of a simple response such as pulling on a rope attached to a door. These studies really involved the first careful, detailed observations of what animals in general learned, as opposed to anecdotal stories collected of amazing things animals did that obviously proved their intelligence. (Television still plays into that sort of approach, needless to say!)

        Thorndike asked a very simple question: Would escape from a puzzle box exhibit any signs of intelligence? Would it display evidence of insight, in which the animal would be able to glance about its environment, understand that the rope was attached to the door, and realize that it needed only to pull on the rope to get out? To answer this question, Thorndike repeatedly placed animals in the same puzzle box, and measured how long it took them to escape. And what he found was that the time to escape decreased only gradually. By the end of the experiment, after 20 or so trials, cats would easily leave the box by performing the appropriate response as soon as they were placed in it. But, their history clearly demonstrated that this had to have been a learned response. In particular, Thorndike pointed out that an animal making the correct response on a given trial early in training would not necessarily choose that same response as its first response on the next trial. So, rather than insight, he concluded that learning involved trial-and-error.

        Trial-and-error refers to the gradual accumulation of correct responses through a slow process of trying out all sorts of possibilities, and slowly weeding out the ones that do not work. As did Watson, Thorndike thought animals were acquiring associations between stimulus configurations (such as the puzzle box) and certain responses. But unlike Watson, he claimed that an additional factor was important in the acquisition of these associations: They would depend on the outcome of the animal's actions. This involved a principle Thorndike termed the Law of Effect. Put briefly, this law claimed that an association between a stimulus and a response would strengthen if the response were followed by a satisfactory state of affairs, and would weaken if the response were followed by an unsatisfactory state of affairs. Thus, Thorndike deliberately included Bentham's notion of hedonistic value as a principle governing the formation of an association, in contrast to Watson. Rather than being a simple contiguity theory, this was a reinforcement theory: In modern terms, learning of an association will occur when there is a reinforcer following a response.

        There are, of course, a number of interpretations available to account for how a reinforcer might operate according to the law of effect. One of the first to come to most people's minds is a teleological or purposive explanation: The animal performs a response because it desires the outcome. But of course, desiring an outcome is a mental state that involves an object not present at the time the animal is performing the response. That type of an explanation would violate the positivist program Watson insisted everyone follow. Thus, as an alternative, we might propose that a positive outcome has an automatic effect of strengthening the association: The animal does not perform the response because it wants the outcome, but rather because the response is strongly associated to the stimulus that is present.

        Here is what Thorndike actually said regarding satisfying and unsatisfying states (1913, p. 2):

By a satisfying state of affairs is meant one which the animal does nothing to avoid, often doing things which maintain or renew it. By an annoying state of affairs is meant one which the animal does nothing to preserve, often doing things which put an end to it.
Although he was accused of using hopelessly mentalistic terms in describing learning as depending on satisfactory or unsatisfactory states, his actual definition provided a clear behavioral test for determining when one or the other state was present. In that sense, it ought to have troubled people no more than Watson's use of the term "emotional."

        Note too that Thorndike did not include the outcome in the association. As we will see, other theorists have claimed that associations to the outcome may also form, so that we can have S-R associations, R-O associations, and even S-O associations. To anticipate how such a model might differ from Thorndike's, a strong S-R association may exist despite a highly unpleasant or unsatisfying outcome: The presence of an R-O association in that event may serve to inhibit the R excited by presence of an associated stimulus.

        Thorndike also proposed another principle, the Law of Exercise (sometimes called the Law of Use). This was essentially a principle of practice, somewhat similar to Watson's notion of frequency: An association would strengthen if practiced. Both laws were revised in his later work: the Law of Effect was essentially restricted to satisfactory outcomes, and the Law of Use was modified to include outcomes rather than simple exercise.

        Thorndike also spoke of the value of different satisfactory states, so that strong satisfiers would do a better job of strengthening an association than weak satisfiers. And as an interesting historical footnote, he actually contradicted one of the major principles of strict contiguity by proposing an early version of belongingness by which some things would be more likely to associate together than others.

        In some sense, Skinner may be regarded as Thorndike's intellectual successor. Skinner proposed similar ideas involving the law of reinforcement and the law of punishment. According to Skinner, a reinforcer was any event that, following a response, made that response more likely, whereas a punisher was any event that had the opposite effect. To try to identify reinforcers and punishers in a way that wasn't completely circular (and also wasn't mentalistic), Skinner imposed a condition of transituationality: A reinforcer or punisher, once identified in terms of its effects on one response, also has to be shown capable of having a similar effect in other situations, on other responses. Otherwise, we find ourselves defining a response as that which, when followed by a reinforcer, increases in frequency. And that type of definition, of course, reciprocally defines responses and reinforcers in terms of one another in an uninteresting, circular fashion.

        With this as background, let us look at some of the basic findings in instrumental conditioning.
 

B. Some Basic Findings

Generalization, Discrimination, & Contrasts

        Many of the basic findings will prove familiar, although there will also be some additional results of interest. But in any case, as was true of classical conditioning, we obtain generalization, discrimination, and contrasts.

        The usual procedure for obtaining generalization involves pairing a response with an outcome in the presence of a specific stimulus, and then presenting other stimuli to see whether there is a similar response to them. As outcomes may be of two sorts (reinforcers and punishers), we may obtain two different types of generalization gradients. The gradient associated with use of a reinforcer is termed the gradient of excitation, whereas the gradient associated with use of a punisher is termed the gradient of inhibition. In an excitatory gradient, we look for responding to novel stimuli that is above the background or baseline or operant level; and in a gradient of inhibition, we look for responding below normal.

        Typically, when a response has been reinforced in the presence of the stimulus, that stimulus is referred to as S+. Similarly, when the response has been punished, the stimulus is referred to as S-. Watson and Rayner used an S- with Little Albert. In their work, they also reported obtaining generalization: Albert developed fear reactions to other stimuli (such as rabbits and coats) involving the features white and fur. Although they had planned on reversing the fear conditioning, Albert's mother removed him from the daycare where they were doing their experiments.

        A good example of a gradient of excitation may be found in the work of Guttman and Kalish. They took four different groups of pigeons and trained them to peck at a colored key. The key differed in color for the four groups (530, 550, 580, or 600 nanometers). Then, in a generalization test, Guttman and Kalish presented a series of 11 colors, one at a time, and simply counted the number of pecks per 6 minute period that each color received. These colors included the original (the S+), 5 colors above the S+ in wavelength, and 5 colors below. Their results appear in Figure 1.

        Several features of these results should be noted. First, the stimulus that received the most pecks for each group was S+: That is where the peak of each generalization gradient may be found. (As you will see later, this need not always be the case. Certain experiences such as discrimination training may alter the peak and shape of a generalization gradient.) Second, there was a relatively smooth drop-off of responding as the wavelengths of the stimuli increasingly differed from the S+. And finally, the curves were symmetric: The left-hand side of each curve looked approximately like the right-hand side.

        Similar features may be found in a gradient of inhibition. Rather than look for a peak, however, we search for a valley representing the lowest level of responding. Here, as the stimuli increasingly differ, we ought to find increasing recovery of responding. Thus, a gradient of inhibition looks a bit like an upside down gradient of excitation. In each, the idea is that similarity of stimuli maps into similarity of responses.

        As was true of classical conditioning, there will be occasions in which we want to train an animal to treat apparently similar stimuli as if they were different. As a parent, you might think there to be good reason to train a child to fear rats without desiring that such reactions extend also to rabbits or cats. The standard technique for teaching a discrimination in instrumental/operant conditioning will prove similar to that introduced in classical conditioning: We present the outcome whenever the organism makes the response in the presence of one stimulus, but not in the presence of another. To introduce technical terms, the stimulus that signals an effective response (effective in the sense of producing an outcome) is called the discriminative stimulus or discriminative cue (SD). The stimulus that should come to signal an ineffective response is normally represented with a delta symbol. As I am posting this to the web where delta symbols are a bit tricky to insert into normal text, I will adopt the practice of using S+ and S- in this situation, as well.

        There are other techniques to train a discrimination, as you will see in a later chapter. Rather than associate one stimulus with no outcome, we can associate it with the need for a different response. Thus, perhaps the animal will need to turn to the left for food when a red light is present, but will need to turn to the right for food when an orange light appears. Such a technique is referred to as choice discrimination. Alternatively, we might slowly introduce the second stimulus into the animal's environment, presenting it initially at very low levels of intensity. If the intensity is slowly increased, we may find that our animal has never responded to it, thus foregoing generalization. (You should be wondering whether there is something like latent inhibition going on with this procedure!) This technique is referred to as errorless discrimination. Each technique appears to have different effects on the generalization gradient. In particular, the standard technique using S+ and S- seems to cause the peak to move away from S+, and to the side opposite S-, a phenomenon referred to as peak shift. Moreover, peak shift is typically associated with a gradient that is no longer symmetrical: The gradient appears to be 'bunched up' on the S- side.

        Finally, we may also note the existence of contrast effects associated with these phenomena. There are several types of contrasts found in instrumental conditioning. One that accompanies peak shift is termed behavioral contrast. Hanson, for example, compared discrimination learning and non-discrimination learning groups. The discrimination-learning group displayed a peak shift. Their responding to the S+ dropped off considerably. But the responding to the stimulus that was the new peak increased dramatically. This group displayed about twice as many responses to this untrained stimulus compared to the control group that did not have discrimination training. Thus, in a behavioral contrast, responding occurs to a novel stimulus at a greater level than would be expected on the basis of simple generalization.

Inhibition In Extinction & Punishment

        Extinction in instrumental conditioning will involve essentially the same process of decoupling that we saw in classical conditioning. That is, we generally remove the outcome after which we ought to see the response return to normal or baseline levels (sometimes referred to as operant levels). As is the case for acquisition (and for classical conditioning), the learning curve for extinction typically involves diminishing returns. How long it takes a response to return to pre-learning (baseline) levels is referred to as its resistance to extinction: Responses that quickly return to baseline levels have a low resistance to extinction, whereas those that take a long time to return to baseline have a high resistance. Resistance to extinction will depend on a number of factors including the value of the outcome (see below), the energy required to make the response (more physically demanding responses generally have lower resistance), and the past history of training (responses learned under partial reinforcement conditions generally have greater resistance to extinction than those acquired under continuous reinforcement conditions: see below and the later chapter on partial reinforcement and extinction).

        As was true in classical conditioning, extinction in instrumental conditioning is viewed as a type of inhibitory learning. Following extinction, we obtain similar patterns of spontaneous recovery and relearning that we did with classical conditioning: An extinguished response tends to recur after a while (spontaneous recovery), arguing against any claim that the association acquired during the acquisition phase had actually been destroyed or forgotten. Similarly, pairing the extinguished response with the outcome results in much faster acquisition (relearning), another argument suggesting extinction does not destroy the original learning.

        Moreover, a stimulus associated with extinction appears to act as an aversive stimulus for the animal, suggesting some degree of inhibition. Daly, for example, found that rats would learn to escape from a place where they had earlier expected a reinforcer. When the reward was no longer available, the contextual cues associated with that location were sufficient to motivate the animal to avoid them by learning some new response getting it out of that situation.

        Complicating the picture somewhat is the fact that an outcome may be a punisher. Punishment, of course, often suppresses a response. There has been an argument extending as far back as Thorndike concerning the effectiveness of punishment. Many people have reported that punishers seem to have, at best, temporary suppressive effects on on-going behavior. However, that issue appears to involve the intensity of the punisher. There is now plenty of evidence that highly aversive punishers may have long-lasting effects. According to Bolles, stimuli present when a punisher occurs may become conditioned danger signals that will tend to interfere with on-going behavior by activating the animal's instinctive defenses (SSDRs: species-specific defense reactions). Rats, for example, will run, freeze, or fight. So, in this case, a conditioned suppression-like reaction may occur because one of these responses will be incompatible with other excitatory responses such as pressing a lever for food.

        In the case of a punished response, of course, extinction of that response by no longer associating it with an aversive outcome ought to inhibit the stimulus's ability to act as a danger signal triggering an SSDR. Inhibition of aversion in this case means seeing less aversion.

        One more point while we are (briefly) on the subject of punishment: One of the difficulties theorists have had with the effects of punishment (and with positing a general principle that punished responses decrease in frequency) may be seen from a study by Brown, Martin, and Morrow. They taught rats to run an alleyway to escape shock. Basically, the alleyway was electrified, so the animals needed to run to the goal box (the only non-electrified, safe portion of the alleyway). When the shock was turned off, there was fairly quick extinction of running.

        However, two other groups of rats were also put through an extinction procedure. For one of these groups, the shock was also turned off in the start box, so that they would actually be punishing themselves for venturing out of the start area. The other group had the final 2 feet (of a 10-foot alleyway) electrified, so that they would be punished by trying to get to the goal box. Curiously enough, these two groups did not extinguish anywhere near as rapidly: By the 6th day of extinction, they were still running to the goal box, giving themselves needless shocks. Thus, punishment sometimes can actually prolong the response being punished. This effect is called vicious circle behavior.

        Within the framework of a model such as Bolles's theory, a finding like that of Brown et al. may be accounted for in terms of shock continuing to trigger the rat's running SSDR. It is also possible that vicious circle behavior may ensue because of multiple mechanisms. Thus, in another experiment, Badia and Culbertson set up a situation in which shock could be signaled or unsignaled. In signaled shock, a stimulus will come on slightly before the shock. In this study, they allowed rats to learn a response whose only reinforcer involved shocks being signaled. Their animals acquired the response. Moreover, Badia, Culbertson, and Harsh found that given a choice between unsignaled mild shocks of short duration and signaled shocks of longer duration and higher intensity, the animals still performed the response, thus apparently subjecting themselves to more punishment than was necessary. This type of vicious circle behavior seems different from that of Brown et al. Rather than involve danger signals triggering SSDRs, it seems to implicate a tradeoff between severity of the shock and predicting when it ought to occur. On the other hand, Bolles also talks about safety signals that indicate a period free from danger. In the unsignaled condition, there are no safety signals. Thus, this type of vicious circle behavior may well result from an organism's search for safety signals. That ought to remind you a bit of the work on compensatory or antagonistic conditioning, and its adaptive value: Being in a highly aroused and tense physiological state because of the continual presence of danger is physiologically stressful; safety signals help moderate the wear and tear.

Mediated Learning & Secondary Reinforcers

        In classical conditioning, we discussed several types of mediated learning (higher-order conditioning; sensory preconditioning) involving building chains of associations that would allow distant events to become associated together (recall also the Dwyer et al. study presented at the end of the last chapter). One of the major mechanisms for mediation in instrumental conditioning is secondary reinforcement (although, as we will see, there are certainly aspects of classical conditioning that govern this mechanism). Secondary reinforcement is learned reinforcement: an otherwise neutral stimulus that acquires the ability to motivate new learning or performance. Primary reinforcement, by way of contrast, is assumed to operate reflexively because of an organism's genetic makeup.

        Skinner may be credited with first making the distinction between primary and secondary reinforcers. The standard example of secondary reinforcers operating in human societies is the use of money. In our society, money includes round pieces of metal and rectangular pieces of paper that have an extraordinary power to motivate behavior. In other societies, different objects serve a similar function (tooled shells, for instance). These objects are not valuable in themselves (aside from aesthetic considerations of design, etc.), but supposedly take on their value by means of serving as a medium of exchange for intrinsically valuable goods such as food or drink. Presumably, they acquire their reinforcing properties by being associated with primary reinforcers.

        Essentially, then, secondary reinforcers are believed to be conditioned through a process of classical conditioning involving the following set-up:

            CS (neutral stimulus) & UCS (primary reinforcer such as food)

Once we have established a pairing between the CS and a primary reinforcer, we may then test for its value as a secondary reinforcer. Our experimental design would be as follows:

            Group                 Classical Conditioning         Instrumental Acquisition

            experimental     CS & primary RF                 R in presence of S+ followed by CS
            control               (Nothing)                               R in presence of S+ followed by CS

If we see an increase in responding in the experimental group compared to the control group, then our CS has acquired reinforcing properties. This example should make clear why this is an instance of mediated learning: The effect of the CS in the experimental group occurs by virtue of its link with the primary reinforcer or UCS. When that link weakens, the value of the CS as a secondary reinforcer ought also to weaken. Thus, in times of inflation when more money is required to buy the same food, the reinforcing properties of a dollar or five dollars weakens.

        On a classical conditioning analysis of secondary reinforcement, we would expect to obtain findings like those we've already seen in the previous chapters. Several examples of such findings might be mentioned. One involves a study by Egger and Miller. They trained pigeons using a design similar to this one:

            Group         Classical Conditioning                 Instrumental Acquisition

            1                 CS1 --> CS2 --> primary RF         R followed by CS1 or CS2
            2                 CS1 --> CS2 --> primary RF         R followed by CS1 or CS2
                               CS1 --> ..... --> no RF

As you can see from this design, Group 2 had discrimination training in the sense that presence or absence of CS2 was relevant to predicting presence or absence of the UCS. Not surprisingly, given its better signal value, CS2 turned out to be the secondary reinforcer for the instrumental acquisition phase in this group. But what about Group 1? CS2 certainly has better contiguity with the UCS. However, in terms of signal value, it is not adding anything to what CS1 already predicts. Thus, it is redundant, and we would predict from models like Kamin's or Mackintosh's that CS2. be blocked. Consistent with this prediction, the secondary reinforcer for instrumental acquisition in Group 1 is CS1, and not CS2.

        Another example is rather cute. It comes from the Brelands, former students of Skinner's who tried to train animals to perform in commercials using the principles they had learned. It is also cute because it involves the notion of money as a secondary reinforcer. In one instance, they attempted to train a pig to roll a (fake) coin into a piggy bank. During the training, the coin was paired with a primary reinforcer, since they wanted to use the coin as a secondary reinforcer for the responses involving in rolling. The procedure worked for a brief while, but then the pig started treating the coin as if it were similar to the food it had been paired with: It tried to root the coin just as it would have rooted real food. This result, termed instinctive drift, is perhaps one of the clearest demonstrations of the involvement of classical conditioning in secondary reinforcers, though it also serves to remind us that Watson's claim about reflexes quickly becoming overwhelmed by learned associations radically overstates the case.

        Secondary reinforcers play an important role in certain aspects of therapy and classroom behavior. In clinical and educational settings, behavior modification techniques based on principles of conditioning are used to try to change unacceptable behavior. These techniques typically include a component of secondary reinforcement by which objects such as poker chips may be accumulated for making desired responses (or avoiding undesired responses), and later traded for privileges such as snacks, movies, pencils, etc. Use of such secondary reinforcers involves the construction of what is called a token economy.

        Other findings relevant to the involvement of classical conditioning in secondary reinforcement include the intensity of the primary reinforcer (more intense primary reinforcers yield more effective secondary reinforcers), the number of times the putative secondary reinforcer is paired with the primary reinforcer, and the delay between these events. You should be able to figure out why models such as Rescorla-Wagner or Wagner's rehearsal model, for example, would support these findings.

        One more important phenomenon while we are on the subject of mediated conditioning and secondary reinforcement: Most behavior involves a complex series of responses executed in a certain rapid and relatively smooth order. How is it that each single response can be reinforced? There hardly seems time for that. And how is it that organisms in real environments (rather than the laboratory where a researcher can control reinforcers and stimuli) acquire such complex organizations? The answer to these questions involves the concept of chaining, and will prove to rely heavily on secondary reinforcers.

        We briefly introduced the notion of response chains earlier. An example will illustrate this concept. Let's set ourselves the task of teaching pigeons to Time Warp. The Time Warp is the dance from the Rocky Horror Picture Show. It (as is true of all dances) may be regarded as a series of steps in a chain. In the case of the time warp, there are 5 steps (The Rocky Horror Show, 1975):

It's just a jump to the left, and then a step to the right. With your hands on your hips, you bring your knees in tight. But it's the pelvic thrust, They really drive you insane. Let's do the Time Warp again.
Normally, we would try to teach a chain backwards, So, we will train the last step first. That involves teaching the pigeon a pelvic thrust. We have our response here, but we need a stimulus and a reinforcer. Let's use a red light for the stimulus (seems appropriate, huh?), and some drink for the reinforcer. Our design then is:

            Phase         Stimulus (CS)         Response                 Reinforcer (UCS)

            1                 red light                  pelvic thrust             drink

        Note particularly that I have also labeled the stimulus a CS, and the reinforcer a UCS. This is meant to suggest that classical conditioning will be going on simultaneously with instrumental conditioning: The stimulus is paired not only with the response, but also with the outcome. Thus, as a result of instrumental conditioning, the animal should do a pelvic thrust to the red light. But, as a result of classical conditioning, the red light ought to become a secondary reinforcer. And that should suggest to you the rest of the design. Here it is in full:

            Phase         Stimulus (CS)         Response                 Reinforcer (UCS)

            1                 red light                  pelvic thrust             drink
            2                 blue light                 knees in tight           red light
            3                 green light              wings on hips            blue light
            4                 yellow light             step to right              green light
            5                 white light               jump to left               yellow light

So, if you look for the moment just at Phase 2, notice that we will reward the pigeon for bringing its knees in tight by following that response with the red light. If the red light is a secondary reinforcer, then the animal will acquire the response. And note too that the red light also serves as the signal for the next step after knees in tight: the pelvic thrust. And finally, note that in Phase 2, we ought to obtain second-order conditioning: Two CSs (the blue and red lights) are being paired. If successful, this means that the blue light now also becomes a secondary reinforcer.

        At the end of this, the sequence will be that a white light serves as the signal for a jump to the left; that's reinforced by the yellow light (thanks to fourth-order conditioning) which also signals Step 2 (a step to the right); that's reinforced by the green light (thanks to third-order conditioning), which signals Step 3 (wings on hips); that's reinforced by the blue light (thanks to second-order conditioning), which signals Step 4 (knees in tight); and that is reinforced by the red light (thanks to first-order conditioning), which finally signals the last step of the dance.

        We haven't talked about a control experiment for this, but our control would be something like the following:

            Phase         Stimulus (CS)           Response                 Reinforcer (UCS)

            1                 red light                    pelvic thrust               drink
            2                 blue light                   knees in tight            green light
            3                 yellow light               wings on hips             white light
            4                 orange light              step to right               purple light
            5                 white light                 jump to left                green light

        In this control experiment, only the first step ought to be acquired. The secondary reinforcer from the first phase is never used in the later phases, and none of these is ever paired with a primary reinforcer. Indeed, based on work in discrimination training, we might predict that the other colors would become somewhat inhibitory (since they tend to signal absence of UCS).

        But in real-world chains, of course, such individual discriminative cues and reinforcers do not always appear to be present (although you could argue that they are present in a dance in terms of the auditory stimuli represented in the music!). And we can solve that mystery by going back and analyzing responses as having stimulus components. Responses are also being associated with a UCS, so that doing a response can act as a secondary reinforcer! Thus, jumping to the left may be reinforced by stepping to the right, eliminating the need for all of these intervening light stimuli. If you thought our pigeon caught in a very awkward situation, you were right: By considering the stimulus components of a response, we find a way to make the concept of response chains a lot more realistic, and their execution smoother.

Interference

        Because of the nature of instrumental conditioning, it is possible to have several different responses associated with the same stimulus, or the same responses and stimuli associated with several different outcomes. Under those conditions, sometimes complex patterns of results may be found. In particular, certain combinations of events appear to result in interference. We will consider two sorts of interference briefly in this section, and then revisit the issue in later chapters. The two involve response competition and approach-avoidance conflicts.

        Response competition involves one response interfering with or competing with another. In fact, response competition is one of the theories regarding the process of extinction (see the chapter on partial reinforcement and extinction). The basic idea here is that the animal is being cued to perform incompatible responses.

        An excellent example of response competition occurs in an experiment by Fowler and Miller. They trained rats to run to a goal box. During extinction, all of the rats were shocked on entering the goal box, but half of them were shocked on their front paws, and the other half were shocked on their rear paws. The animals shocked on their front paws jerked back, whereas the animals shocked on their rear paws jerked forwards. Moving forward is a response compatible with running into the goal box, but moving backwards is an incompatible response. Despite the fact that both groups received shock or punishment for entering the goal box, the front-paws group extinguished more rapidly. The new response caused by the shock in this case interfered with the old response.

        Other examples of response competition come from work with humans in the verbal learning paradigm. Here, subjects are often asked to learn a list of word pairs, and tested on how successful they are at recalling the second word when presented with the first as a retrieval cue. So, if you studied a pair such as SHORT-LAKE, the experimenter might say SHORT, and you would need to reply with LAKE. As you may gather, we can identify the first word of a pair as the stimulus term, and the second as the response term. Numerous studies show interference when we ask people to learn several lists in which the same stimulus words are present, but there are different response words. Response competition will certainly not turn out to be the sole explanation of these findings (see, for example, Melton & Irwin, and Postman's review). But it assuredly handles some of what is going on, as we find intrusions of the earlier responses during learning of the later responses.

        As for approach-avoidance conflicts, we may ask what happens when a response is associated with both an aversive and an appetitive outcome. That situation happens more frequently than you might think. In discrimination training, for example, we try to alter the excitatory generalization to the S- by associating it with lack of a reward. But, that means that the inhibition building up for S- may also generalize to the S+, canceling it out, to some extent (one of the explanations for peak shift). Thus, discrimination training involves two stimuli, each of which may be claimed to have some excitatory and some inhibitory components.

        What ought to happen should thus reflect, in some sense, the summation of the excitation and inhibition, as was the case in use of the summation test in classical conditioning (see the discussion of algebraic summation theory in the chapter on attention and categorization for more details).

        As an interesting footnote, Dollard and Miller tried to combine aspects of Freudian psychoanalytic theory and learning theory to describe some of the conflicts humans might be subject to. They identified several different types of conflicts, but one they termed an approach-avoidance conflict. In this situation, there is a tradeoff between the positive and negative components of making a response. As an example, we might take a rat running down an alleyway to obtain some food. Suppose that the goal box is associated both with food and with a shock. What will the rat do? One analysis of this situation (culled from several different studies) appears in Figure 2.

        In this figure, we look at some measure of strength of a response as a function of how far from the goal the animal is. There are actually two opposed tendencies graphed in this figure: The tendency to approach the goal for a reinforcer, and the tendency to go away from the goal due to punishment. The solid line represents a typical, idealized avoidance gradient: The closer an animal is to an aversive or noxious stimulus, the more vigorously it leaves. As it gets further and further away, its response (running, for example) gets weaker and weaker. In contrast, the approach gradient graphed by the dotted line demonstrates the reverse finding: the closer to a desired reinforcement, the faster or more vigorously the animal approaches it.

        Several additional features of Figure 2 are important. One is that the avoidance gradient is typically steeper than the gradient of approach, And the other is that in this figure, the lines cross. And because they do, we obtain an approach-avoidance conflict, with the spot at which the lines cross representing the conflict point.

        If you look to the right of the conflict point, you will see that approach is stronger than avoidance. Thus, right of this spot, the animal should tend to head towards the goal. But once it passes the conflict point and approaches, then avoidance becomes stronger, driving it back. So, the model predicts that an animal will waver around the conflict point, developing large amounts of frustration in the process. There will be some tendency here for the animal to simply escape this situation, if that is at all an option.

        Finally, an increase or decrease in the amount of reinforcement or punishment in this model will essentially move the relevant gradient up or down. Increasing the punisher, for example, should move the solid line up, and that will result in the conflict point (the spot at which the lines cross) moving further away from the goal. In like fashion, increasing reinforcement moves the approach gradient up, causing the spot at which the lines cross to occur closer to the goal.

        This is by no means all that Dollard and Miller have to say about what approach-avoidance conflicts entail. You may be interested in reading their book on personality and psychotherapy for more information.

The Partial Reinforcement Effect

        A final basic phenomenon we will discuss in this section involves the partial reinforcement effect. Skinner and his colleagues studied various aspects of reinforcement-based learning under what they claimed were 'real-world' conditions. Specifically, they asked what would happen to learning when reinforcers appeared on only some trials. The findings were quite interesting. Namely, with partial reinforcements, there was greater resistance to extinction.

        We will look at partial reinforcement in more detail in a later chapter. For now, let me mention that a number of variables interact with partial reinforcement. In particular, amount of reinforcement will prove to play a pivotal role. In studies such as those conducted by Roberts, we find that animals that have been continuously reinforced will display increased resistance to extinction with small reinforcers. Roberts looked at extinction of alleyway running in rats whose reinforcers ranged from 1 to 25 food pellets. Over 36 extinction trials, there was little evidence of a change in the 1-pellet group, whereas the 25-pellet group was performing at well less than half their rate prior to extinction. However, this effect appears to depend on how much training an animal has had during the acquisition phase; it assumes a fairly substantial amount of acquisition (see D'Amato). In contrast, animals that have been partially reinforced will display increased resistance with large reinforcers (see, for example, Ratliff and Ratliff). The first result, in particular, strikes many people as counterintuitive on first coming across it. After all, shouldn't large reinforcers result in better learning, and shouldn't better learning be longer-lasting learning?

        There are in fact a number of explanations for the partial reinforcement effect. For the moment, however, I will mention one to help you remember the results. This is Amsel's Frustration Hypothesis, cited in the first chapter. According to Amsel, continuously reinforced animals will experience more frustration when they lose a large reinforcer. And since frustration acts as an aversive stimulus, these animals will avoid whatever it is that is causing the frustration. So, with more frustration, there is faster learning of avoidance. But in contrast, animals in partial reinforcement are being trained to tolerate frustration. With larger reinforcers, they are trained specifically to handle greater and greater frustration. Thus, when they are placed in the highly frustrating situation of extinction (in which the expected reward fails to materialize), they will be better able to adapt to this situation.

        Number of trials during acquisition also has different effects for continuously and partially reinforced groups. For continuously reinforced groups, more training results in lesser resistance, whereas for partially reinforced groups, more training results in greater resistance. The increased training in partially reinforced groups translates into better training to tolerate frustration, but the increased training in continuously reinforced groups translates into higher expectation of a reward (and thus, a ruder awakening when it is no longer there).

        In short, extinction is frustrating, because expected rewards don't occur. How much resistance to extinction you will have will thus depend partly on how much frustration you experience during extinction, and on how much frustration you have been trained to tolerate during acquisition. The amount of frustration experienced during extinction depends on the size of the reinforcer you expected. (Not getting an expected $50 is a lot more frustrating than not getting an expected $1.) In addition, continuously reinforced animals have not been trained to tolerate any frustration whatsoever.
 

C. Some Basic Paradigms

        We have already introduced two general paradigms involving acquisition and extinction. In acquisition, an outcome is typically paired with making a response in the presence of a stimulus; in extinction, that pairing typically ceases. Within this broad framework (particularly with respect to acquisition), we may distinguish several additional paradigms.

        In appetitive or approach learning, the animal makes a response that results in a desired reward. This is the type of learning involving reinforcement that we have implicitly and explicitly discussed so far. But it is not the only paradigm based on reinforcement. Another that deserves particular note is omission training, in which an animal has to suppress or withhold a response in order to get its reward. Sheffield, for example, trained dogs to salivate in the presence of a tone associated with food, and then shifted them to omission training. In this latter phase, the dogs had to avoid salivating to the tone for several seconds to get the food. Omission training is initially typically difficult, and displays a relatively slow learning curve. However, there are several studies suggesting that in the long run, it will be as effective as extinction in decreasing the frequency of a response. Omission training is sometimes referred to as negative punishment to indicate that making the response is associated with removal of a reinforcer (which thus acts as a punishment).

        Another paradigm based on reinforcement is escape learning. In escape learning, the animal learns a response that gets it away from punishment, either by turning off the punisher, or by allowing the animal to leave the area where the punishment was administered. Escape learning is closely associated with another paradigm, avoidance learning. In avoidance learning, the punishment is intermittent rather than continuous. If the animal makes the proper response before the punishment comes on, it will succeed in canceling that punishment. In avoidance learning, animals typically start out by escaping the aversive stimulation (making a response during the punishment that stops it), and then come to make the response early enough that they subsequently successfully avoid the aversive stimulation.

        Punishment training (or aversive learning), of course, involves the administration of an unpleasant, aversive outcome following a response. Thus, punishment training, omission training, and extinction all have in common reducing the level of a given response, whereas appetitive learning, escape learning, and avoidance learning attempt to increase response level. There are some obvious interplays in paradigms here, depending on which response you focus on. Often, aspects of several different paradigms combine: One response may be punished while another is reinforced.

        We may also distinguish between signaled and unsignaled learning. A discrete, distinct stimulus is present in signaled learning, but not in unsignaled learning. Thus, for example, in unsignaled avoidance, shocks can occur at regular intervals that could be avoided if the animal responds shortly before the shock's onset. There is no physical stimulus signaling the shock; the animal in this case needs to rely on an internal sense of time. In unsignaled conditions, features such as time or the contextual cues presumably act as stimuli.

        Another paradigm, transfer training will prove important, especially when we focus on discrimination in a later chapter. In transfer training, we look at the effects of learning one task on another. Transfer might be nonexistent (zero), positive (facilitation: the learning is faster), or negative (inhibition: there is interference). In addition, transfer effects might be proactive (in which we look at the effect of an earlier task on the learning or performance of a later task), or retroactive (in which we saw how the later task influences performance on the earlier one).

        A final paradigm involves shaping. Normally, approach learning applies to responses that are not especially frequent to start with, since we want to track an increase in frequency as one of our measures of learning. Thus, we find ourselves in the following situation: We sit in the lab, watching our animal subject, waiting for it to make the desired response so that we can administer the reinforcer.

        Such a procedure will obviously be inefficient. In some cases (such as a pig rolling a coin), the wait may be very long indeed! Hence, a technology has developed that involves increasing the probability of having the animal emit that response so that we can then train it further through reinforcement. This technology, called shaping, requires reinforcing successive approximations to the desired response.

        Shaping works as follows. We start out by identifying a high-frequency component of the response we want, and we reinforce that. So, if we want our rat to press a bar on the left side of an experimental chamber, then a high-frequency component would involve having the rat be in the left half of the chamber. While it is exploring its environment, we reinforce for crossing over to the left. Then, as it increases its time on the left, we drop the reinforcer. That will cause the behavior to become more variable. We await some response yet closer to what we want to train (such as being near the bar), and when that occurs we reintroduce the reinforcer. And then, of course, we cycle the process through again in order to obtain yet a closer approximation (such as touching the bar). Shaping is a very powerful technique, not only because of its ability to 'coax' low frequency responses out of an animal, but also -- and especially -- because of its ability to mold a response that is not normally part of the animal's repertoire! Thus, by combining shaping and chaining, instrumental conditioning allows us to train totally new responses, rather than just transfer stimulus control of an old response to a new stimulus.
 

D. A Note About Terminology: Operant vs. Instrumental Learning

        Finally, we ought to note a distinction that is sometimes made between what is termed instrumental conditioning, and operant conditioning. In instrumental conditioning, the emphasis is on a discrete trial, a situation in which there is a clear starting point and a clear terminus. We may measure how long it takes the animal to make the response during the trial, or we may measure the relative probability of the animal's success. So, to take Thorndike's puzzlebox apparatus, the start of the trial occurs when a cat is placed in the puzzlebox, and it ends when the cat has made the escape response. How long this takes is what we are interested in. Similarly, in maze learning, the trial starts with the animal being placed in the start box, and ends when the animal has found its way to the goal box. (Or alternatively, we can specify the trial as what happens in some amount of time from when we have placed the animal in the start box. Where has it gotten to in, say, an allowed 30 seconds? Learning here will show up as increased probability of having made the correct response within the time frame of the trial.) In a third example, choice discrimination, the trial starts when the animal is exposed to two stimuli, and ends when the animal makes a response relevant to one of them. Our interest in this situation typically involves whether the animal has chosen the correct stimulus.

        There are no discrete trials in operant conditioning, on the other hand. A standard apparatus for operant conditioning involves a Skinner Box, a chamber with something that can be manipulated (a key to peck; a bar to press; a lever to move); various discriminative stimuli that may be turned on or off (lights; noises); and means to automatically administer reinforcements or punishments (food or shock dispensers connected to the bar, for example). Particularly with respect to such simple responses as pressing a bar to obtain food, the interest will be more in how rapidly those responses are executed. We don't stop the animal between responses in order to set up another trial. Rather, we typically look at characteristics of response rate over time.

        This distinction between discrete and continuous trials might also be expressed in a slightly different manner. On a discrete trial, you can succeed only once (or perhaps 8 times if we use an apparatus like Olton's radial maze, discussed in the previous chapter), whereas on continuous trials you have the opportunity to obtain virtually unlimited reinforcements. So, the difference between instrumental and operant conditioning in part involves whether there is a constraint on how many reinforcing events an animal can seek out. That having been said, I will generally treat these as equivalent.
 
 

II. Basic Requirements For Effective Conditioning

        Many of the principles for effective conditioning will prove familiar from our discussion of classical conditioning. Thus, number of pairings of a response with an outcome will prove important in characterizing how quickly we see changes in characteristics of the response (its strength, its amplitude, its latency, its probability, etc.) that signal evidence of learning. By the same token, number of times a response fails to be followed by an outcome will be important, not only in describing the course of extinction, but also (as in classical conditioning) in describing the contingency between a response and an outcome. Below, we will briefly consider additional principles having to do with temporal contiguity, outcome characteristics, and contingency.

        Before we do, however, we ought to note several features that make the situation a bit more interesting. First, of course, is the issue of partial reinforcement. We will delay fuller discussion of that to a later chapter. Second, there is the fact that in operant conditioning, an animal is effectively in charge of whether to emit the response or withhold it. Obviously, researchers in classical conditioning may easily arrange pairings of the CS and the UCS to achieve any desired contingency. But in operant conditioning, controlling how many times the reinforcer occurs when a response is emitted versus when a response is not emitted is clearly trickier. Third, because of the presence of three events (stimulus, response, outcome), there are three potential associations to worry about (S-R, S-O, and R-O). That means that we can ask about temporal contiguity (or contingency) not just of response and outcome, but also of stimulus and response, and of stimulus and outcome. The situation thus becomes significantly more complex.

        Not all theorists believe that all three associations form. Thorndike, to remind you, accepted only an S-R association, as did Watson. But, researchers such as Rescorla have made a very strong case that the other associations are there, as well. Thus, Colwill and Rescorla used the devaluation paradigm on a reinforcer after the response had been acquired. If a reinforcer's only function is to stamp in the association (as claimed by Thorndike), devaluing the reinforcer ought not to influence the response the animal gives to the stimulus. In the abstract, the design for this type of experiment would be similar to the following:

            Group                     Acquisition Phase     Phase 2                 Test Phase

            experimental         R to S for RF             RF & LiCl            R to S?
            control                   R to S for RF             (Nothing)              R to S?

However, Colwill and Rescorla found a much less vigorous response following devaluation. This must have its effect on an R-O or S-O association.

        What about an S-O association? From our discussion of chaining and higher-order conditioning, you already know that this association forms. Further evidence of this comes from the Rescorla study mentioned in the previous chapter, in which a stimulus that caused higher levels of responding during extinction became inhibitory, as measured by the summation and retardation tests. We had earlier read about a classical conditioning version of that study, but Rescorla also ran the same study with an instrumental conditioning set-up, and obtained the same results. Because the S-R association in these types of experiments is rapidly relearned following extinction while the S remains inhibitory, Rescorla claims that the inhibition doesn't involve the S-R link! And as a final example, consider a classic study by Seward and Levy on a phenomenon termed latent extinction. In their study, two groups of rats learned to run to a goal box for a reward. Following acquisition, one group had the experience of being placed directly in the goal without the reward. Then, both groups were put through extinction:

            Group                  Acquisition         Phase 2                         Phase 3

            experimental       run for RF         put in goal, no RF         extinction
            control                 run for RF         (Nothing)                       extinction

In this experiment, the control group extinguished more slowly than the experimental group. Presumably, the stimulus elements of the goal box had now become associated with some inhibition for the experimental group, making their running to it less desirable.

        Below, given the theoretical importance of reinforcement in operant conditioning, we will concentrate on principles having to do with its presence relative to the response.
 

A. Temporal Parameters

        A principle of fairly long standing (and which forms a part of many behavior-level theories) has been that there must be temporal contiguity between the response and the outcome. In fact, many studies report what is called a gradient of reinforcement in approach or appetitive learning: The longer the delay between the response and the reinforcer, the weaker the learning.

        A well-known experiment demonstrating the gradient of reinforcement was conducted by Grice. Grice used a choice discrimination paradigm in which rats had to enter one of two rooms or chambers. The rooms were different colors (black or white), and the rat was reinforced for entering one of these but not the other. However, there were several groups of rats who differed in terms of how long it took to get the reinforcer after choosing the correct color. All rats were immediately placed in a neutral-color room where the reinforcer was given, but one group received their reinforcer immediately, while others had to wait. The group with the longest wait was reinforced after 10 sec. Essentially, Grice found a very rapid fall-off of learning. After about 1 sec, there was no evidence that the discrimination had been learned.

        Depending on the response and the circumstances (see the next major section below), longer delays in which learning still occurs have been reported. As an example, consider a study by Capaldi, in which two groups of rats were trained to run to a goal box. One group was rewarded as soon as it reached the goal box, but the other group had to wait 10 sec for its reward. Both indeed learned to run to the goal box, but the running speed (and the initial velocity out of the start box) was significantly depressed for the 10-sec delay group. Thus, their learning seems to have been affected by the delay.

        Sometimes, a delay of 1 or 5 sec seems to result in no learning, and at other times (as in Capaldi's experiment), longer delays will be tolerated. Generally, however, the speed of learning as measured by vigor or probability of the response (or number of trials to acquire it) will be influenced by the response-reinforcer delay. Extrapolating from Skinner's claims, we may present one theory for why this is so: Namely, as the delay period increases, the odds increase that the animal will perform some other piece of behavior before the reinforcer is given. The association may then form between that response and the outcome, rather than between the effective response and the reward.

        According to Skinner, temporal contiguity by itself is all that is needed for the formation of an association. Skinner cites the example of superstitious behavior to demonstrate this. In superstitious behavior, animals are reinforced at random, and need perform no response whatsoever. Yet, Skinner in one of his studies reported that pigeons in this circumstance were displaying apparently learned behaviors such as head shaking. He claimed that the reinforcer by dumb luck must have been presented just after the pigeon had tossed its head, so that head tossing was strengthened as a response in this situation. The increased possibility of acquiring superstitious behavior that interferes with other learning might thus partly explain why temporal contiguity is important.

        From the perspective of a more cognitive, representational-level approach, we may posit a similar idea expressed in very different terms. Given the presence of a reinforcer, the animal's task is to determine which of a number of previous responses might be the one that worked. As the number of responses increases, the task becomes more difficult. Moreover, because causes normally result in relatively immediate effects (excepting, of course, situations such as illness or food poisoning: note the relevance to the taste aversions paradigm), organisms may be genetically predisposed to connect recent behavior with the current outcome (a principle of causal recency).

        A similar principle, of course, applies to aversive situations. Fowler and Trapold in an experiment on escape learning varied how long it took for shock to turn off once the rat had run to a goal box. The best learning/performance occurred for a group of rats whose shock was turned off as soon as they entered the goal box. Animals that had to wait a bit for shock to turn off did worse.

        Finally, Boe and Church found that the effectiveness of punishment decreased with delay. Unless punishment is administered very shortly after an animal's response, it will not prove very effective. Dog owners who come home and punish puppies for earlier 'accidents' are most likely to be associating themselves with the aversive outcome, and training fear of the owner and the spot where the dog was punished. That is certainly not the same thing as housebreaking a pet.
 

B. Outcome Strength

        Two types of outcomes have generally been discussed: reinforcement and punishment. Each, however, may be further subdivided into two sorts, positive and negative. Positive outcomes generally involve the presentation of a stimulus that changes a relatively neutral state into the state specified by the outcome. Thus, positive reinforcement (generally referred to without the use of the word "positive") involves the provision of something desirable that normally results in appetitive behavior, and positive punishment (also typically referred to without the modifying adjective) involves the provision of something undesirable that normally results in aversive behavior.

        The other two types of outcomes are negative reinforcement and negative punishment. It will help you to keep these straight by recalling that anything that is a labeled reinforcer, positive or negative, should operate by the law of reinforcement: It ought to increase the response that it follows. Similarly, anything labeled a punisher, positive or negative, ought to work by the law of punishment: It ought to decrease the response that it follows. That having been said, a negative reinforcer takes on its reinforcing properties because some response the animal makes results in removal of aversive stimulation. Negative reinforcement, of course, is the basis for escape learning. And in similar fashion, a negative punisher acquires its punishing properties by virtue of the fact that the animal makes a response leading to removal of a reward or privilege. Thus, positive outcomes involve the presentation of stimulus events, and negative outcomes involve the removal of certain stimulus events.

        With respect to each, there appears to be a general principle that higher levels of strength result in stronger or faster or more vigorous responding, consistent with a claim that outcome strength influences speed of learning. Concerning positive reinforcement, for example, Kraeling taught three groups of rats to run an alleyway for a drink reinforcement that varied in the amount of sucrose concentration (recall that rats have a sweet tooth, so higher sucrose concentrations act as more effective reinforcers). Each group was given one trial per day for 99 days. At the end, they had each reached asymptote as measured by how fast they ran. However, the asymptotes differed for the three groups: The group with the highest sucrose concentration had the fastest asymptotic running speed whereas the group with the lowest concentration had the slowest speed. Crespi found similar results (see below, Figure 3): Rats given large amounts of reinforcement on each trial (64 pellets) showed faster running than rats given small amounts of reinforcement (4 pellets).

        An experiment by Trapold and Fowler can illustrate the operation of this principle with amount of negative reinforcement. They conducted an experiment in which rats had to run to escape shock. Five groups of animals were given 20 trials of escape learning. The groups differed in the intensity of the shock (varying from 120 volts up to 400 volts). Faster acquisition of the escape response occurred with the larger shocks.

        Finally, a classic experiment by Boe and Church may be used to illustrate the principle with positive punishment. Boe and Church trained four groups of rats to press a bar for a reward, and put each through extinction. Prior to extinction, however, three of these groups were put through punishment training in which, for 15 minutes, a bar press gave the animal a shock. The groups differed in intensity of the shock (35, 75, or 220 volts). Thus, the design was as follows:

            Group         Acquisition Phase             Punishment               Extinction Phase

            1                 RF for barpress (bp)         (None)                       No RF for bp
            2                 RF for barpress                 bp --> 35 Volts         No RF for bp
            3                 RF for barpress                 bp --> 75 Volts         No RF for bp
            4                 RF for barpress                 bp --> 220 Volts       No RF for bp

The question, of course, was how punishment of bar pressing would help speed up removal of that response. Over 9 sessions of extinction training, the group with the weak shock proved not all that different from the group with no punishment: Each engaged in a substantial number of responses during the course of extinction. However, quite different results occurred for the 75 and 220 volt groups: They showed a much lower level of responding during extinction. Indeed, the 220 volt group hardly responded at all! Thus, effectiveness of punishment in suppressing behavior will depend in part on severity of punishment. As the contrast between the control group and the 35 volt group demonstrates, weak punishers may have little permanent effect compared to extinction.

        Of course, there are other variables that will influence the operation of an outcome. As you know from an earlier discussion, aversive stimulation can have the paradoxical effect of increasing the response it is meant to stamp out (vicious circle behavior). Also, the same amount of an outcome packaged in different ways may effectively act as different amounts. Thus, for example, Campbell, Batsche, and Batsche found that a reinforcer divided into smaller amounts worked better than a reinforcer presented as one large amount. And to remind you, those manipulations that seem to promote higher asymptotic levels during acquisition (in continuous reinforcement) generally also promote the fastest extinction.

        There are also contrasts that may occur when an organism experiences several different levels of a reinforcement. An experiment by Crespi will illustrate these. Crespi trained rats to run to a goal box for food (the apparatus here involved a straight alleyway in which rats are released at one end of a corridor or tunnel, and have to run to the other end). One group was given a large reward, a second group was given a medium reward, and a third group was given a small reward. In each case, their running speed was measured. Then, the large-reward and small-reward groups were shifted to the medium reward. Thus, the design was something like the following:

            Group         Phase 1 (acquisition)                 Phase 2 (maintenance)

            1                 64-pellet reward                         16-pellet reward
            2                 16-pellet reward                         16-pellet reward
            3                 4-pellet reward                           16-pellet reward

        You will note that I have labeled the two phases here acquisition and maintenance. The rats in the acquisition phase received 20 learning trials, and their average running speed at the end of training was measured. In a maintenance phase, on the other hand, we look at performance after learning has occurred (that is, presumably after the association has formed). In Crespi's study, there were 8 maintenance trials. Figure 3 presents the results after learning, and on the eighth maintenance trial. As you can see from this figure, Group 2 showed some slight increase; not surprising, since additional reinforced pairings ought to result in a stronger association according to most standard theories of instrumental conditioning. But notice what happened to the other two groups: A shift to a much smaller reward caused a negative contrast by which running speed slowed down considerably, whereas a shift to a much larger reward resulted in a corresponding increase (a positive contrast).

        It is important to note that all groups received the same amount of learning in Phase 2 (in terms of number of trials and what the reinforcer was). Thus, we might have expected each group to display the same relative improvement. But, that did not happen. Because these contrasts occurred during a post-acquisition period involving identical additional training, they are generally interpreted as demonstrating an effect not on learning (or acquisition), but rather on performance. That argument is particularly compelling for Group 1: They continued to receive additional reinforced training during Phase 2, yet they apparently got worse! Contrasts of this sort are termed incentive contrasts.

        Such contrasts should suggest that perhaps outcome amount is related more to an animal's motivation to perform a response than whether that response gets learned in the first place.

        Contrast effects may occur under a variety of conditions (see, for example, Flaherty's review). They do tend to be temporary, however. Flaherty suggests, in particular, that negative contrasts may reflect frustration at obtaining the less desired reward. Consistent with this, tranquilized animals generally do not exhibit contrasts.
 

C. Contingency

        We have already briefly alluded to the difficulties inherent in controlling contingency between a response and an outcome in operant conditioning. We have also briefly talked about the fact that Skinner (and Watson and Thorndike) claimed that contiguity was all that was needed for learning. As in classical conditioning, however, there are claims that contingency is also required for learning to occur. We will close out this section by considering the evidence that contingency influences learning and performance.

        To start, let us adapt the notion of a contingency space discussed in the previous chapter. Previously, we had looked at the relative probabilities of the UCS when the CS was present or absent. Now, we look at the relative probability of an outcome when the animal makes a response (Probability 1), or withholds it (i.e., does not make the response: Probability 2). Essentially, analogous to what happens in classical conditioning, many theorists will claim that when Probability 1 exceeds Probability 2, there ought to be excitation: The animal is more likely to make the response because the odds of getting a reward increase. (We are assuming rewards outcomes rather than punishers here!) In contrast, when Probability 1 is below Probability 2, then it makes more sense for the animal not to respond: The response ought to be inhibited. Finally, when the two probabilities are equal (so that there is zero contingency), we would expect to find no evidence of acquisition.

        How well does this notion of contingency hold up? One interesting study that attempts to assess this notion of an operant contingency space was performed by Hammond. Hammond set up different contingencies between bar pressing and a reinforcer for several groups of rats. Whenever the two probabilities were equal, the rats failed to display any change in bar pressing. In these circumstances, there were always some pairings of the response with the outcome that ought to have resulted in the association forming, if Skinner's claims made in the section on superstitious behavior were correct. Indeed, we might regard such a circumstance as similar to a partial reinforcement schedule. Nevertheless, no evidence of learning occurred. In contrast, rats for whom Probability 1 exceeded Probability 2 did increase their bar pressing.

        Moreover, in a follow-up, Hammond found that rats that had already acquired bar pressing stopped when the two probabilities were made equal. This pattern of findings certainly goes against Skinner's claims that contiguity is sufficient. Instead, it strongly suggests an exquisite sensitivity to contingency on the rats' part.

        Why, then, do we obtain superstitious behavior? According to many people, Skinner has probably failed to consider the fact that reinforcers such as food also act as UCSs that elicit certain responses, and that the contextual cues act as a CS that gets conditioned to the UCS. On such an account, superstitious behavior really isn't: It is fairly straight-forward classical conditioning of the sort we have been discussing in the last two chapters.

        An example may serve to drive this point home. One claimed example of superstitious behavior has been autoshaping. In autoshaping with pigeons some food is placed next to (or on) a lit key. After a while, pigeons peck the key, although they need not do so to obtain the food. Autoshaping is a very useful tool for training pigeons, because it seems that the pigeons will train themselves, and save you the work. But that avoids analysis of this situation as classical conditioning in which the lighted key serves as the CS. Staddon and Simmelhag, for example, use either food pellets or drink as the reinforcer for different groups of pigeons in an autoshaping paradigm. Pigeons will peck at solid food with a closed beak, but will open the beak slightly to drink the liquid. Staddon and Simmelhag found that pigeons autoshaped with these different reinforcers pecked at the key with the appropriate conditioned response: a closed beak for solid food, and an open beak for liquid. Thus, what appears to be contiguity without contingency in operant conditioning is really a classical conditioning situation in which there is both contiguity and contingency (but the contingency involves the CS and the UCS, rather than a response and an outcome).

        Another experiment done with pigeons was performed by Killeen. The pigeons faced a horizontal array of 3 keys, the middle one of which was lit. They were trained to peck at this middle key. About 5% of the time, one of their pecks at the key would cause its light to go off, and the lights of the two surrounding keys to come on. Another 5% of the time, a computer would automatically turn off the center key while turning on the others. The question Killeen asked was whether the pigeon was aware that it was responsible for this change.

        How can we assess a pigeon's knowledge of the circumstances? Killeen reasoned that a pigeon aware of whether its action had turned off the light would easily be able to learn another response that would depend on that action. So, the experiment was arranged so that the pigeon would get rewarded for pecking at one of the surrounding lights when it was responsible for turning them on, but would have to peck at the other light when it was the computer that had turned them on. That would be a difficult task to accomplish without sensitivity to contingency, but Killeen's pigeons came through. They did indeed show they could discriminate events caused by their own behavior from events caused by some other external cause.

        Such sensitivity is not reported in all studies, however. In a quite clever study by Thomas, rats obtained random free reinforcements, but could also make a response (bar pressing) that would give them a reinforcement on demand. But the catch was, the number of random reinforcers would drop some after the rat made that response. In other words, more reinforcers were available if the rat did not respond. In contrast to the studies above, the rats in this experiment actually did learn to bar press, which resulted in their getting less food!

        One more study on reinforcement-based contingency may be mentioned. This study, done by Watson (not the same Watson who redefined psychology as behaviorism!), used 3-months-old human infants. Watson set up a contingency between their turning their heads and a reinforcer of a mobile above their cribs turning for several seconds. A second group had the same experience of mobile reinforcer, but the second group's mobile movements had nothing to do with any of their responses. Although both groups initially displayed a great deal of interest and pleasure in the mobile when it started moving, only the contingent group maintained this reaction. Thus, Watson argued that the contingent infants had some sense of mastery over the mobile that the non-contingent infants did not, some awareness that the mobile's movements were due to their own actions. They were sensitive to contingency.

        Contingency will also prove important in escape and avoidance learning. In particular, Seligman and Maier have studied a phenomenon termed learned helplessness. The experimental set-up for learned helplessness typically involves something like the following design:

            Group                     Phase 1                                                     Phase 2

            Experimental         inescapable, noncontingent P                  escape learning
            Control                  (Nothing)                                                   escape learning

They find that unavoidable, non-contingent punishment results in the animals in the Experimental Group not learning to escape, once a contingency is set-up between an escape response and avoidance of the shock. The Control Group, in contrast, readily acquires the response. According to Seligman's explanation of these results (the cognitive deficit hypothesis), the animals in the Experimental Group have acquired a mistaken belief. Based on the randomness of the shocks and their inability to escape them in Phase 1, they have mistakenly learned that there is no response that will be effective in avoiding or escaping shock. (You may want to compare this to various explanations for learned irrelevance in classical conditioning.) Thus, they cease trying to discover an effective response, so that learning is no longer attempted in Phase 2. Seligman has argued that some similar mechanism in humans may account for certain episodes of depression.
 
 
 

III. Exceptions & Complex Interactions

        We have seen, at least implicitly, an emphasis on reinforcement theories in which a reinforcer is necessary for appetitive or approach learning (see also Thorndike). Indeed, the notion of reinforcement is so pervasive that, as you will see in the next chapter, some theorists have even claimed that extinction depends on there being a reinforcer present during the animal's extinction training! Although there are a variety of associational or behavior-level theories to account for instrumental and operant conditioning, we will direct the exceptions below to the most conservative of these: the claim that learning requires a reinforcer, needs only temporal contiguity, operates through an automatic process of slowly strengthening an association, and stamps in specific muscle movements that increase in likelihood in the presence of the S+.
 

 A. Long-Delay Learning

        We will start with the issue of temporal contiguity, As was true in classical conditioning, we will also find instances of long-delay learning in operant conditioning. Indeed, we might start by noting that the work of Garcia and Koelling, presented previously as an instance of long-delay classical conditioning, might just as easily have been labeled instrumental conditioning: An animal makes a response of drinking saccharine -flavored water, and experiences an outcome of becoming ill several hours later. As this reinterpretation of the learned taste aversions paradigm illustrates, it may sometimes be difficult to draw sharp, clear boundaries between classical and instrumental conditioning (see also the last section in this chapter).

        Leaving aside the taste aversions work, however, there are other studies suggesting relatively long delays are possible. Lieberman, Davidson, and Thomas, for example, presented a series of experiments in which pigeons had to peck the right or left side of a key. They found that some of their animal subjects were able to learn the response even with delays of 7 sec or longer (an extraordinarily long delay for a pigeon). The animals that were able to learn were the ones whose correct response was followed by an unusual event (a marker). In their experiment, the marker involved the key briefly turning a different color (from white to red on its left half and green on its right half) after it had been pecked. Other work by Lieberman and his colleagues has demonstrated that such marking can result in animals learning a discrimination even when the reinforcement is delayed a full minute. Since the marker involved a non-reinforced stimulus occurring after the relevant response but well before the reinforcement, the existence of a marking effect poses a challenge to the idea that temporal contiguity is always necessary for learning.

        Indeed, this study ought to remind you a bit of some of the work we discussed regarding rehearsal and surprisingness (in particular, the work by Wagner, Rudy, and Whitlow and that of Hall and Pearce). Surprising events are apt to be rehearsed more. So, a distinctive surprising event following a response may result in that response being rehearsed for a longer period of time or becoming more distinct in memory (and thus more likely to be sampled as the cause of the reinforcement). Lieberman et al.'s take on this (the marking hypothesis) combines elements of both of the above (1985, p. 622):

[T]he effect of the marker proved to depend critically on what response preceded it: If a correct response was marked on food trials, then correct responding increased; if an incorrect response was marked, then incorrect responding increased. The most plausible explanation for this result, we believe, is that the marker triggered a memory search that focused attention on the preceding response, thereby increasing the likelihood that it would be remembered. [emphasis added]
        Another example of long-delay learning concerns a study by D'Amato, Sarafin, and Salmon. They delayed reinforcers by at least 30 minutes in training trials with rats. In one experiment, animals were placed in one of the two goal boxes of a T-maze (an apparatus that looks like a T, in which the animal runs from the start box at the base of the T to one of the two arms at the top), then put in the start box and fed 30 minutes later. Despite this delay, the animals exhibited differential running to the arm in which they had been placed. Note, in particular, that no additional events or stimulus cues were present during the wait in the start box that may have become associated with the food: Once the animal was let out of the start box, the stimulus cues around the correct arm may have primed the memory of being in that arm.

        Finally, note too that the work we have already mentioned by Olton using the 8-arm radial maze suggests that rats are quite adept at finding food in the maze without retracing their steps, and without generally revisiting an already-visited arm. As they visit arms at random, they would appear to maintain some information in short-term memory concerning which responses have already been made. Given the length of time it takes to visit all 8 arms, this clearly qualifies as a type of long-delay learning.

        Numerous mechanisms for long-delay learning have been proposed. One that plays off of the notion of secondary reinforcers has been proposed by Spence. This involves the mechanism of an anticipatory fractional goal response (rg). Note that the response in this instance is written with a lower-case r rather than an upper-case R. The reason is that the r is treated as one component or fraction of a more complex response, the goal response (Rg), the animal makes on reaching the goal and obtaining its reward. There will be numerous fractions or component responses such as chewing, swallowing, salivating, etc. These get conditioned to the stimulus cues present shortly before the animal enters the goal box. So, the association involves:

            SGoalCues ----------> rg

But since these components or fractions are associated with food, they also become secondary reinforcers through higher-order conditioning. Thus, the cues present as the animal enters the goal area act as a reinforcer before the animal has actually received any food on that trial.

        In addition, these components also have stimulus properties they are associated with, although these, of course are unlearned: Chewing, swallowing, etc., all cause certain physical sensations. So, the association ought also to include these, as follows (the dots indicate an unlearned association):

            SGoalCues ----------> rg ..... sg

And as was true of the response fraction, we indicate the stimulus fraction with a lower-case s. The rg ..... sg is termed a mediator because it is a unit that may come between a stimulus and a response in a chain of associations.

        In Spence's theory, these mediators become anticipatory; that is, they start being conditioned to earlier and earlier spots in a sequence. Thus, in a complex maze, the stimulus cues right before the cues that led to the goal box also take on secondary reinforcing properties. If we regard the goal cues as being at spot X, we will take the cues before these as being at spot X-1. Then, through classical conditioning we have:

            SX-1 ----------> SGoalCues ----------> rg ..... sg

Or, to represent this by a shortcut:

            SX-1 ----------> rg ..... sg

And if at this point the animal needs to make a left turn to get to the area of the goal box, then the associations at this point involve:

            SX-1 ----------> rg ..... sg ----------> RLeft

And of course, we may now carry the procedure through to spot X-2 (the cues present before the X-1 cues). Thus, fractional goal responses are effectively conditioned throughout a complex chain in a process that should remind you of our example of the Time Warp. Consistent with this theory, animals do tend to learn a complex maze backwards (although not all results support the theory: In particular, researchers have not found evidence of anticipatory drooling at the various spots or choice points of a maze).

        Thus, in theory, the presence of secondary reinforcers may help to bridge what appears to be a long delay. That would mean that long delays are really much shorter, since we need to assess the delay in terms of the first reinforcer present after a response. In this case, that first reinforcer will be a short-delay secondary reinforcer.

        While secondary reinforcers and anticipatory fractional goal responses might account for some of the long-delay results, however, they cannot account for all of them. In particular, the study by D'Amato et al. would seem difficult to explain, since the animal is being fed in the start box, so that any secondary reinforcers ought first to be associated with it, rather than the arm the animal runs to. Similarly, Olton's results would not fall under this mechanism, because secondary reinforcers ought to become associated with an arm the animal has already visited, making it more likely the animal will revisit the same arm on the next trial. But that generally doesn't happen. And that it doesn't happen makes sense, according to foraging theory: An animal foraging in the wild for food is likely to deplete a food source, so there is adaptive value in searching for food in different locations. But, searching for food in this manner also requires a memory system that can keep track of where food was found previously, so as to avoid that spot.

        Other factors that make long-delay learning possible include the presence of something to make the correct response distinct, or the occurrence of little intervening activity between the effective response and the reward. We have already discussed distinctiveness in terms of Lieberman's marking hypothesis: A distinct response is likely to be more salient, rehearsed more, and thus more readily available in memory when the occurrence of a reward triggers a search for events that might have been responsible for it. The notion of little intervening activity will similarly play off of a memory mechanism. With little or no intervening activity, the last response will still be the one most likely to be recovered from memory. But as activity increases, so do the possibilities of disruption (recall what Wagner, Rudy, and Whitlow found with post-trial episodes!), and choosing the wrong response as being the cause of the reward (response competition).
 

B. Belongingness

        On a strict associational account, any association ought to form between responses and effective outcomes. However, several studies (in addition to the work we have already discussed in learned taste aversions) suggest that need not be the case.

        One of these studies, conducted by Shettleworth, involved an experiment with hamsters. Shettleworth identified six high frequency activities in hamsters that included face washing, digging, scent marking, hind leg scratching, rearing, and front paw scraping. When each of these was subsequently paired with a food reinforcer, only digging, rearing, and front paw scraping were affected. Such restriction of the operation of a reinforcer represents an violation of the requirement that reinforcers be transituational: Here, at least, are three responses that a reinforcer of food will not affect.

        Another study illustrating belongingness comes out of the work of Premack. By allowing kids to play with gumball machines (dispensing candy) and pinball (game) machines, Premack identified kids who were players or eaters (based on the relative proportion of time they spent with each machine). He then set up an experiment using the following design:

            Group       Subjects             Response & Reinforcer

            1A             players                     play to eat
            1B             players                     eat to play
            2A             eaters                       play to eat
            2B             eaters                       eat to play

Thus, there was now a contingency between responding on one machine, and responding on the other: In one case (play to eat), kids would have to increase their time on the pinball machine to get an opportunity to use the gumball machine; in the other (eat to play), the reverse was required: kids would have to increase responding to the gumball machine to get a shot at the pinball machine. Only groups 1B and 2A showed learning. Thus, what counts as an effective reinforcer for one child may be completely ineffective for another. (Similar results hold up for animals: see the next chapter).

        Also as a potential illustration of belongingness, we might mention the fact that certain responses that are easy to acquire with positive reinforcement become very difficult to acquire with negative reinforcement. Pecking a key for pigeons, for example, is difficult to train with negative reinforcement (e.g., MacPhail). The explanation for this latter result may have to do with Bolles's theory of safety and danger signals. Negative reinforcement involves the presence of danger signals that trigger SSDRs. Such responses may well interfere with the desired response, particularly if that desired response involves approaching the danger signal or aversive stimulus! Such a notion is similar to a more general principle of preparedness posited by Seligman: Responses may be ordered on a continuum ranging from prepared responses at one extreme to contraprepared responses at the other. Prepared responses are responses quite similar to what an animal would naturally do in a given situation, whereas contraprepared responses are those the exact opposite of what the animal would normally do (approach danger rather than flee from it, for example). According to Seligman's principle, the closer a response is to the prepared end of the continuum, the easier it should be learned. So, the exact same response may be acceptable in one circumstance, but not in another.
 

C. Acquisition Without Direct Reinforcement

        A number of studies question whether a reinforcer is necessary for forming or strengthening an association. The classic experiment illustrating this phenomenon was done by Tolman and Honzik. Their design involved having rats learn a maze. The rats were given one trial per day for 17 days. The experiment involved the following design:

            Group             Treatment

            1                     no RF; removed when reach goal box
            2                     RF on each day when reach goal box
            3                     no RF until the 11th day

The question Tolman and Honzik asked was how the third group would perform on days 11 through 17. Since these days represented the first time this group had experienced reinforcement, a reinforcement-based account of learning would suggest that these animals started learning only on Day 11. But in fact, on the 12th day, these animals were performing as well as (in fact, slightly better than) the animals reinforced from the beginning (Group 2): They had learned to navigate the maze in the absence of a reinforcer. This finding, termed latent learning, suggests that reinforcement may be more important for performance (motivating an animal to show its knowledge) rather than acquisition.

        Another similar result involves a study by Butler in which monkeys learned a response whose consequence involved being given access to a window looking out on a parking lot. While curiosity might be called a reinforcer, it seems a bit of a stretch in this case. The problem is that we have no way of independently identifying when learning would be expected to occur in the absence of any other reinforcer such as food, and when it would not. When is the animal curious?

        A third study involves the area of observational learning. In a famous experiment by Bandura, kids watched a tape of a clown playing with toys. Children in the vicarious reinforcement group saw the clown being rewarded, but children in the vicarious punishment group saw the clown being punished for the way he played. Later, when these kids were given a chance to play with the same toys, the kids in the vicarious reinforcement group displayed the same behaviors: evidence that they had learned by watching. The kids in the punishment group played in a very different manner. But they had acquired the responses as well: When the experimenter asked them to show what the clown had done, they were able to do so. Thus, we find from this study that reinforcement and punishment may have an effect at a distance: Watching others be reinforced can serve as a reinforcement. Such a notion takes us far afield from the original idea of an appetitive stimulus that follows an emitted response (note that the children had not made the response themselves, and note also that all children had learned the response, though some of them had suppressed it until given permission by the experimenter to play the way the clown had played).

        A final study on observational learning illustrates that the notion is not restricted to humans. Kohn and Dennis taught one group of rats a choice discrimination. A second group that were able to watch the training of the first group learned the choice discrimination faster. Both the Kohn and Dennis and the Bandura studies suggest a point we will explore in the next subsection; namely, that responses do not need to first be emi