Chapter 4: The Basic Findings In Instrumental/Operant Conditioning⁽¹⁾

Overview: This chapter is arranged in four major sections. The first presents the background to Instrumental Conditioning, covering the early work by Watson and Thorndike, the basic findings, what they thought was going on, and what some of the standard paradigms for Instrumental Conditioning are. The second presents many of the basic principles that determine when conditioning will occur, and whether it will be excitatory or inhibitory. The third discusses important exceptions to these principles, and examines the complex interactions that can arise. Finally, the fourth section briefly examines several alternative accounts of the type of an association that can form. Several additional accounts of learning are introduced here; most notably, Tolman's Cognitive Expectancy Approach and Skinner's Radical Behaviorism. This section closes by examining some of the interrelationships between Instrumental and Classical Conditioning.

I. Introduction To Instrumental Conditioning

We now shift from the topic of classical conditioning to that of instrumental (or as Skinner terms it, operant) conditioning. This topic will prove a bit more complex in its findings. As you will see, however, many of the ideas that were important in classical conditioning will prove relevant here. Indeed, there has long been a debate over whether classical and operant conditioning ought to be regarded as truly different forms of learning. They appear to differ in the sense that classical conditioning generally involves the presence of reflex actions, whereas instrumental conditioning generally involves modifications of voluntary behavior contingent on presence of reinforcers or punishers. Whether that is a sufficient reason to distinguish them is arguable, as we will see later. My sense of the field today is that most theorists would like to see similar theories explain the results in both. Thus, it will not surprise you, for example, that a modified version of the Rescorla-Wagner model has also been proposed for instrumental conditioning.

Let's start with some historical background.

A. Background: Two Early Views Of Instrumental Conditioning

We will look at two quite different claims about the nature of instrumental conditioning. One comes from Watson, the author of the 1913 behaviorist manifesto, Psychology as the behaviorist views it, and the second comes from Thorndike, who can probably safely be credited with conducting the first truly sophisticated and careful observations of complex animal learning. Their accounts differ in ways that prefigured an important debate about what was needed for learning to occur.

First, however, let us distinguish instrumental conditioning from classical conditioning. In instrumental conditioning, an animal makes one of a number of possible responses in the presence of some stimulus complex or context. That response may lead to some outcome. We typically define learning in this circumstance as an alteration in some observed characteristic of the response such as its frequency, latency, or amplitude. We will revisit this definition in more detail later, once we have examined several theories of what gets acquired, and why. For now, we can talk about instrumental conditioning as the type of learning involved in navigating a maze, choosing the correct one of several doors to run to, or even performing some response that will be successful in avoiding a future shock. In instrumental conditioning, new responses may be taught that differ from any reflexive response already in the animal's behavioral repertoire.

Watson: Contiguity of S & R

As you know from Chapter 1, Watson attempted to redefine the field of psychology in response to then-current mentalism. We have looked at several of the assumptions he brought along. Basically, he was an extreme environmentalist who believed that most -- if not all -- of our actions were under the learned control of associations. Based on his knowledge of work being done by people like Pavlov, Watson certainly believed that living things were born with a repertoire of reflexes. However, association quickly acquired control of previously reflexive responses, and indeed helped modify those responses to create new responses. Some idea of Watson's radicalism on this point may be gathered from a very famous quote (1926, p. 10):

Give me a dozen healthy infants, well-formed, and my own specified world to bring them up in and I'll guarantee to take any one at random and train him to become any type of specialist I might select -- doctor, lawyer, artist, merchant-chief and, yes, even beggar-man and thief, regardless of his talents, penchants, tendencies, abilities, vocations, and race of his ancestors.

This was radical in at least two related ways. First, from a scientific perspective, it clearly denied the relevance of genetic or inherited influences on current behavior. And second, from a social perspective, it was about as different a position as one could expect to find from the then-prevailing attitudes about race and class.

In any case, Watson's primary idea was that an association could form between a stimulus and a response (in addition to the type of association found in classical conditioning). But he was a strict contiguity theorist on the issue of S-R associations: A response made in the presence of a stimulus might associate with it, and under certain circumstances, would be likely to be seen when that stimulus recurred. Those circumstances were defined by essentially two principles. The first, a principle of frequency, stated that the association strengthened each time the response was made to the stimulus, so that all things being equal, a frequent response was much more likely to be emitted by the animal than a less frequent response. In addition, however, there was a principle of recency: All things being equal, a recent response was more likely to be emitted than a less recent response.

What you should particularly note about the brief description of Watson's system above is the complete lack of any reference to a reinforcer, a perhaps surprising omission to students who have been introduced to the idea that instrumental/operant conditioning is in large part about the effects of rewards and punishments. That wasn't so for Watson, and it has not always been so for later theorists as widely divergent as Guthrie and Tolman (see below and in the next chapter). But on a preliminary and casual analysis of classical conditioning, the notion of a reward or punishment does not seem greatly relevant in discussing whether the association forms. (Nevertheless, some theorists refer to the UCS as a reinforcer on a broad definition that a reinforcer is what makes a response more likely; presence of the UCS, whether in excitatory-appetitive or excitatory-aversive conditioning, certainly accomplishes that!) Why, then, ought we to include it in instrumental conditioning?

And even though Watson talked about associations between stimuli and responses, he also allowed for the possibility of associations between responses themselves. Thus, in the case of animals learning to run a maze, the analysis of what is going on will involve a complex series of muscle movements involving motor responses. (Much of our behavior is complex, rather than the execution of simple responses to individual stimulus triggers.) Rather than talk about external stimuli controlling each succeeding muscle movement that gets the animal from Point A to Point B in the maze, Watson claimed that a chain of responses could be linked together that would be initially set off by an external stimulus. Of course, to the extent that any response also involves internal stimulation, one could still analyze chains in terms of stimulus-response links, so that each muscle movement in the chain serves as the response to the previous movement, and the stimulus for the next.

We have already discussed in Chapter 1 Watson's insistence that thinking could be reduced to subvocal speech. He also conducted experiments in emotional conditioning. In a famous study with Rayner, Watson conditioned a young child, Little Albert, to be afraid of a white rat. Every time Albert played (apparently happily, at first) with the rat, an experimenter would creep up behind Albert and strike a metal bar, making a loud clanging noise that frightened Albert and caused him to cry. After several such occasions (six, in fact), Albert started to cry at the sight of the rat. Note how this could be analyzed from the point of view of classical conditioning: The noise caused the apparent emotional response of fear, whereas the rat served as the CS.

Given what you know of Watson's views on mentalism, you may be somewhat surprised to discover him talking about the topic of emotions. However, for Watson, emotions were not underlying mentalistic events, but rather, the behavioral components (the crying, the whimpering, the shaking, etc.) observed in reaction to certain stimuli. Thus, Watson maintained a perfect consistency with respect to his position that positivism required dealing strictly with a behavioral level. Perhaps that is why he did not talk about reinforcers. Thorndike, a contemporary of Watson's, was developing a theory of learning based on reinforcers, and although he defined them in a sufficiently behavioristic fashion, he was nevertheless attacked by others for apparently sneaking mentalistic terms back into hard-nosed scientific psychology.

To reiterate a point I made earlier, later behaviorists have on occasion adopted a strict contiguity approach to learning. Most notable among these as a successor to Watson was Guthrie, whose principle of conditioning stated (1952 , p. 23):

A combination of stimuli which has accompanied a movement will on its recurrence tend to be followed by that movement. Note that nothing is here said about...reinforcement or pleasant effects.

As we will see later, such approaches were in part a reaction to work by Tolman and his colleagues suggesting that learning could occur in the absence of rewards or punishers. The question that faces such theorists then becomes one of explaining how and why rewards and punishers seem to influence the course of learning.

Thorndike & Puzzleboxes: Reinforcement-Based Learning

Rewards and punishers, in contrast, played a pivotal role in the work of Thorndike, who is often credited with founding the field of instrumental conditioning. Thorndike published a monograph in 1898 on his studies with animals such as cats. He set up an experimental apparatus termed a puzzle box: a cage in which the animal was placed, and which could be escaped through the performance of a simple response such as pulling on a rope attached to a door. These studies really involved the first careful, detailed observations of what animals in general learned, as opposed to anecdotal stories collected of amazing things animals did that obviously proved their intelligence. (Television still plays into that sort of approach, needless to say!)

Thorndike asked a very simple question: Would escape from a puzzle box exhibit any signs of intelligence? Would it display evidence of insight, in which the animal would be able to glance about its environment, understand that the rope was attached to the door, and realize that it needed only to pull on the rope to get out? To answer this question, Thorndike repeatedly placed animals in the same puzzle box, and measured how long it took them to escape. And what he found was that the time to escape decreased only gradually. By the end of the experiment, after 20 or so trials, cats would easily leave the box by performing the appropriate response as soon as they were placed in it. But, their history clearly demonstrated that this had to have been a learned response. In particular, Thorndike pointed out that an animal making the correct response on a given trial early in training would not necessarily choose that same response as its first response on the next trial. So, rather than insight, he concluded that learning involved trial-and-error.

Trial-and-error refers to the gradual accumulation of correct responses through a slow process of trying out all sorts of possibilities, and slowly weeding out the ones that do not work. As did Watson, Thorndike thought animals were acquiring associations between stimulus configurations (such as the puzzle box) and certain responses. But unlike Watson, he claimed that an additional factor was important in the acquisition of these associations: They would depend on the outcome of the animal's actions. This involved a principle Thorndike termed the Law of Effect. Put briefly, this law claimed that an association between a stimulus and a response would strengthen if the response were followed by a satisfactory state of affairs, and would weaken if the response were followed by an unsatisfactory state of affairs. Thus, Thorndike deliberately included Bentham's notion of hedonistic value as a principle governing the formation of an association, in contrast to Watson. Rather than being a simple contiguity theory, this was a reinforcement theory: In modern terms, learning of an association will occur when there is a reinforcer following a response.

There are, of course, a number of interpretations available to account for how a reinforcer might operate according to the law of effect. One of the first to come to most people's minds is a teleological or purposive explanation: The animal performs a response because it desires the outcome. But of course, desiring an outcome is a mental state that involves an object not present at the time the animal is performing the response. That type of an explanation would violate the positivist program Watson insisted everyone follow. Thus, as an alternative, we might propose that a positive outcome has an automatic effect of strengthening the association: The animal does not perform the response because it wants the outcome, but rather because the response is strongly associated to the stimulus that is present.

Here is what Thorndike actually said regarding satisfying and unsatisfying states (1913, p. 2):

By a satisfying state of affairs is meant one which the animal does nothing to avoid, often doing things which maintain or renew it. By an annoying state of affairs is meant one which the animal does nothing to preserve, often doing things which put an end to it.

Although he was accused of using hopelessly mentalistic terms in describing learning as depending on satisfactory or unsatisfactory states, his actual definition provided a clear behavioral test for determining when one or the other state was present. In that sense, it ought to have troubled people no more than Watson's use of the term "emotional."

Note too that Thorndike did not include the outcome in the association. As we will see, other theorists have claimed that associations to the outcome may also form, so that we can have S-R associations, R-O associations, and even S-O associations. To anticipate how such a model might differ from Thorndike's, a strong S-R association may exist despite a highly unpleasant or unsatisfying outcome: The presence of an R-O association in that event may serve to inhibit the R excited by presence of an associated stimulus.

Thorndike also proposed another principle, the Law of Exercise (sometimes called the Law of Use). This was essentially a principle of practice, somewhat similar to Watson's notion of frequency: An association would strengthen if practiced. Both laws were revised in his later work: the Law of Effect was essentially restricted to satisfactory outcomes, and the Law of Use was modified to include outcomes rather than simple exercise.

Thorndike also spoke of the value of different satisfactory states, so that strong satisfiers would do a better job of strengthening an association than weak satisfiers. And as an interesting historical footnote, he actually contradicted one of the major principles of strict contiguity by proposing an early version of belongingness by which some things would be more likely to associate together than others.

In some sense, Skinner may be regarded as Thorndike's intellectual successor. Skinner proposed similar ideas involving the law of reinforcement and the law of punishment. According to Skinner, a reinforcer was any event that, following a response, made that response more likely, whereas a punisher was any event that had the opposite effect. To try to identify reinforcers and punishers in a way that wasn't completely circular (and also wasn't mentalistic), Skinner imposed a condition of transituationality: A reinforcer or punisher, once identified in terms of its effects on one response, also has to be shown capable of having a similar effect in other situations, on other responses. Otherwise, we find ourselves defining a response as that which, when followed by a reinforcer, increases in frequency. And that type of definition, of course, reciprocally defines responses and reinforcers in terms of one another in an uninteresting, circular fashion.

With this as background, let us look at some of the basic findings in instrumental conditioning.

B. Some Basic Findings

Generalization, Discrimination, & Contrasts

Many of the basic findings will prove familiar, although there will also be some additional results of interest. But in any case, as was true of classical conditioning, we obtain generalization, discrimination, and contrasts.

The usual procedure for obtaining generalization involves pairing a response with an outcome in the presence of a specific stimulus, and then presenting other stimuli to see whether there is a similar response to them. As outcomes may be of two sorts (reinforcers and punishers), we may obtain two different types of generalization gradients. The gradient associated with use of a reinforcer is termed the gradient of excitation, whereas the gradient associated with use of a punisher is termed the gradient of inhibition. In an excitatory gradient, we look for responding to novel stimuli that is above the background or baseline or operant level; and in a gradient of inhibition, we look for responding below normal.

Typically, when a response has been reinforced in the presence of the stimulus, that stimulus is referred to as S+. Similarly, when the response has been punished, the stimulus is referred to as S-. Watson and Rayner used an S- with Little Albert. In their work, they also reported obtaining generalization: Albert developed fear reactions to other stimuli (such as rabbits and coats) involving the features white and fur. Although they had planned on reversing the fear conditioning, Albert's mother removed him from the daycare where they were doing their experiments.

A good example of a gradient of excitation may be found in the work of Guttman and Kalish. They took four different groups of pigeons and trained them to peck at a colored key. The key differed in color for the four groups (530, 550, 580, or 600 nanometers). Then, in a generalization test, Guttman and Kalish presented a series of 11 colors, one at a time, and simply counted the number of pecks per 6 minute period that each color received. These colors included the original (the S+), 5 colors above the S+ in wavelength, and 5 colors below. Their results appear in Figure 1.

Several features of these results should be noted. First, the stimulus that received the most pecks for each group was S+: That is where the peak of each generalization gradient may be found. (As you will see later, this need not always be the case. Certain experiences such as discrimination training may alter the peak and shape of a generalization gradient.) Second, there was a relatively smooth drop-off of responding as the wavelengths of the stimuli increasingly differed from the S+. And finally, the curves were symmetric: The left-hand side of each curve looked approximately like the right-hand side.

Similar features may be found in a gradient of inhibition. Rather than look for a peak, however, we search for a valley representing the lowest level of responding. Here, as the stimuli increasingly differ, we ought to find increasing recovery of responding. Thus, a gradient of inhibition looks a bit like an upside down gradient of excitation. In each, the idea is that similarity of stimuli maps into similarity of responses.

As was true of classical conditioning, there will be occasions in which we want to train an animal to treat apparently similar stimuli as if they were different. As a parent, you might think there to be good reason to train a child to fear rats without desiring that such reactions extend also to rabbits or cats. The standard technique for teaching a discrimination in instrumental/operant conditioning will prove similar to that introduced in classical conditioning: We present the outcome whenever the organism makes the response in the presence of one stimulus, but not in the presence of another. To introduce technical terms, the stimulus that signals an effective response (effective in the sense of producing an outcome) is called the discriminative stimulus or discriminative cue (S^D). The stimulus that should come to signal an ineffective response is normally represented with a delta symbol. As I am posting this to the web where delta symbols are a bit tricky to insert into normal text, I will adopt the practice of using S+ and S- in this situation, as well.

There are other techniques to train a discrimination, as you will see in a later chapter. Rather than associate one stimulus with no outcome, we can associate it with the need for a different response. Thus, perhaps the animal will need to turn to the left for food when a red light is present, but will need to turn to the right for food when an orange light appears. Such a technique is referred to as choice discrimination. Alternatively, we might slowly introduce the second stimulus into the animal's environment, presenting it initially at very low levels of intensity. If the intensity is slowly increased, we may find that our animal has never responded to it, thus foregoing generalization. (You should be wondering whether there is something like latent inhibition going on with this procedure!) This technique is referred to as errorless discrimination. Each technique appears to have different effects on the generalization gradient. In particular, the standard technique using S+ and S- seems to cause the peak to move away from S+, and to the side opposite S-, a phenomenon referred to as peak shift. Moreover, peak shift is typically associated with a gradient that is no longer symmetrical: The gradient appears to be 'bunched up' on the S- side.

Finally, we may also note the existence of contrast effects associated with these phenomena. There are several types of contrasts found in instrumental conditioning. One that accompanies peak shift is termed behavioral contrast. Hanson, for example, compared discrimination learning and non-discrimination learning groups. The discrimination-learning group displayed a peak shift. Their responding to the S+ dropped off considerably. But the responding to the stimulus that was the new peak increased dramatically. This group displayed about twice as many responses to this untrained stimulus compared to the control group that did not have discrimination training. Thus, in a behavioral contrast, responding occurs to a novel stimulus at a greater level than would be expected on the basis of simple generalization.

Inhibition In Extinction & Punishment

Extinction in instrumental conditioning will involve essentially the same process of decoupling that we saw in classical conditioning. That is, we generally remove the outcome after which we ought to see the response return to normal or baseline levels (sometimes referred to as operant levels). As is the case for acquisition (and for classical conditioning), the learning curve for extinction typically involves diminishing returns. How long it takes a response to return to pre-learning (baseline) levels is referred to as its resistance to extinction: Responses that quickly return to baseline levels have a low resistance to extinction, whereas those that take a long time to return to baseline have a high resistance. Resistance to extinction will depend on a number of factors including the value of the outcome (see below), the energy required to make the response (more physically demanding responses generally have lower resistance), and the past history of training (responses learned under partial reinforcement conditions generally have greater resistance to extinction than those acquired under continuous reinforcement conditions: see below and the later chapter on partial reinforcement and extinction).

As was true in classical conditioning, extinction in instrumental conditioning is viewed as a type of inhibitory learning. Following extinction, we obtain similar patterns of spontaneous recovery and relearning that we did with classical conditioning: An extinguished response tends to recur after a while (spontaneous recovery), arguing against any claim that the association acquired during the acquisition phase had actually been destroyed or forgotten. Similarly, pairing the extinguished response with the outcome results in much faster acquisition (relearning), another argument suggesting extinction does not destroy the original learning.

Moreover, a stimulus associated with extinction appears to act as an aversive stimulus for the animal, suggesting some degree of inhibition. Daly, for example, found that rats would learn to escape from a place where they had earlier expected a reinforcer. When the reward was no longer available, the contextual cues associated with that location were sufficient to motivate the animal to avoid them by learning some new response getting it out of that situation.

Complicating the picture somewhat is the fact that an outcome may be a punisher. Punishment, of course, often suppresses a response. There has been an argument extending as far back as Thorndike concerning the effectiveness of punishment. Many people have reported that punishers seem to have, at best, temporary suppressive effects on on-going behavior. However, that issue appears to involve the intensity of the punisher. There is now plenty of evidence that highly aversive punishers may have long-lasting effects. According to Bolles, stimuli present when a punisher occurs may become conditioned danger signals that will tend to interfere with on-going behavior by activating the animal's instinctive defenses (SSDRs: species-specific defense reactions). Rats, for example, will run, freeze, or fight. So, in this case, a conditioned suppression-like reaction may occur because one of these responses will be incompatible with other excitatory responses such as pressing a lever for food.

In the case of a punished response, of course, extinction of that response by no longer associating it with an aversive outcome ought to inhibit the stimulus's ability to act as a danger signal triggering an SSDR. Inhibition of aversion in this case means seeing less aversion.

One more point while we are (briefly) on the subject of punishment: One of the difficulties theorists have had with the effects of punishment (and with positing a general principle that punished responses decrease in frequency) may be seen from a study by Brown, Martin, and Morrow. They taught rats to run an alleyway to escape shock. Basically, the alleyway was electrified, so the animals needed to run to the goal box (the only non-electrified, safe portion of the alleyway). When the shock was turned off, there was fairly quick extinction of running.

However, two other groups of rats were also put through an extinction procedure. For one of these groups, the shock was also turned off in the start box, so that they would actually be punishing themselves for venturing out of the start area. The other group had the final 2 feet (of a 10-foot alleyway) electrified, so that they would be punished by trying to get to the goal box. Curiously enough, these two groups did not extinguish anywhere near as rapidly: By the 6^th day of extinction, they were still running to the goal box, giving themselves needless shocks. Thus, punishment sometimes can actually prolong the response being punished. This effect is called vicious circle behavior.

Within the framework of a model such as Bolles's theory, a finding like that of Brown et al. may be accounted for in terms of shock continuing to trigger the rat's running SSDR. It is also possible that vicious circle behavior may ensue because of multiple mechanisms. Thus, in another experiment, Badia and Culbertson set up a situation in which shock could be signaled or unsignaled. In signaled shock, a stimulus will come on slightly before the shock. In this study, they allowed rats to learn a response whose only reinforcer involved shocks being signaled. Their animals acquired the response. Moreover, Badia, Culbertson, and Harsh found that given a choice between unsignaled mild shocks of short duration and signaled shocks of longer duration and higher intensity, the animals still performed the response, thus apparently subjecting themselves to more punishment than was necessary. This type of vicious circle behavior seems different from that of Brown et al. Rather than involve danger signals triggering SSDRs, it seems to implicate a tradeoff between severity of the shock and predicting when it ought to occur. On the other hand, Bolles also talks about safety signals that indicate a period free from danger. In the unsignaled condition, there are no safety signals. Thus, this type of vicious circle behavior may well result from an organism's search for safety signals. That ought to remind you a bit of the work on compensatory or antagonistic conditioning, and its adaptive value: Being in a highly aroused and tense physiological state because of the continual presence of danger is physiologically stressful; safety signals help moderate the wear and tear.

Mediated Learning & Secondary Reinforcers

In classical conditioning, we discussed several types of mediated learning (higher-order conditioning; sensory preconditioning) involving building chains of associations that would allow distant events to become associated together (recall also the Dwyer et al. study presented at the end of the last chapter). One of the major mechanisms for mediation in instrumental conditioning is secondary reinforcement (although, as we will see, there are certainly aspects of classical conditioning that govern this mechanism). Secondary reinforcement is learned reinforcement: an otherwise neutral stimulus that acquires the ability to motivate new learning or performance. Primary reinforcement, by way of contrast, is assumed to operate reflexively because of an organism's genetic makeup.

Skinner may be credited with first making the distinction between primary and secondary reinforcers. The standard example of secondary reinforcers operating in human societies is the use of money. In our society, money includes round pieces of metal and rectangular pieces of paper that have an extraordinary power to motivate behavior. In other societies, different objects serve a similar function (tooled shells, for instance). These objects are not valuable in themselves (aside from aesthetic considerations of design, etc.), but supposedly take on their value by means of serving as a medium of exchange for intrinsically valuable goods such as food or drink. Presumably, they acquire their reinforcing properties by being associated with primary reinforcers.

Essentially, then, secondary reinforcers are believed to be conditioned through a process of classical conditioning involving the following set-up:

CS (neutral stimulus) & UCS (primary reinforcer such as food)

Once we have established a pairing between the CS and a primary reinforcer, we may then test for its value as a secondary reinforcer. Our experimental design would be as follows:

Group Classical Conditioning Instrumental Acquisition

experimental CS & primary RF R in presence of S+ followed by CS
control (Nothing) R in presence of S+ followed by CS

If we see an increase in responding in the experimental group compared to the control group, then our CS has acquired reinforcing properties. This example should make clear why this is an instance of mediated learning: The effect of the CS in the experimental group occurs by virtue of its link with the primary reinforcer or UCS. When that link weakens, the value of the CS as a secondary reinforcer ought also to weaken. Thus, in times of inflation when more money is required to buy the same food, the reinforcing properties of a dollar or five dollars weakens.

On a classical conditioning analysis of secondary reinforcement, we would expect to obtain findings like those we've already seen in the previous chapters. Several examples of such findings might be mentioned. One involves a study by Egger and Miller. They trained pigeons using a design similar to this one:

Group Classical Conditioning Instrumental Acquisition

            1                 CS₁ --> CS₂ --> primary RF         R followed by CS₁ or CS₂
            2                 CS₁ --> CS₂ --> primary RF         R followed by CS₁ or CS₂
                               CS₁ --> ..... --> no RF

As you can see from this design, Group 2 had discrimination training in the sense that presence or absence of CS₂ was relevant to predicting presence or absence of the UCS. Not surprisingly, given its better signal value, CS₂ turned out to be the secondary reinforcer for the instrumental acquisition phase in this group. But what about Group 1? CS₂ certainly has better contiguity with the UCS. However, in terms of signal value, it is not adding anything to what CS₁already predicts. Thus, it is redundant, and we would predict from models like Kamin's or Mackintosh's that CS₂. be blocked. Consistent with this prediction, the secondary reinforcer for instrumental acquisition in Group 1 is CS₁, and not CS₂.

Another example is rather cute. It comes from the Brelands, former students of Skinner's who tried to train animals to perform in commercials using the principles they had learned. It is also cute because it involves the notion of money as a secondary reinforcer. In one instance, they attempted to train a pig to roll a (fake) coin into a piggy bank. During the training, the coin was paired with a primary reinforcer, since they wanted to use the coin as a secondary reinforcer for the responses involving in rolling. The procedure worked for a brief while, but then the pig started treating the coin as if it were similar to the food it had been paired with: It tried to root the coin just as it would have rooted real food. This result, termed instinctive drift, is perhaps one of the clearest demonstrations of the involvement of classical conditioning in secondary reinforcers, though it also serves to remind us that Watson's claim about reflexes quickly becoming overwhelmed by learned associations radically overstates the case.

Secondary reinforcers play an important role in certain aspects of therapy and classroom behavior. In clinical and educational settings, behavior modification techniques based on principles of conditioning are used to try to change unacceptable behavior. These techniques typically include a component of secondary reinforcement by which objects such as poker chips may be accumulated for making desired responses (or avoiding undesired responses), and later traded for privileges such as snacks, movies, pencils, etc. Use of such secondary reinforcers involves the construction of what is called a token economy.

Other findings relevant to the involvement of classical conditioning in secondary reinforcement include the intensity of the primary reinforcer (more intense primary reinforcers yield more effective secondary reinforcers), the number of times the putative secondary reinforcer is paired with the primary reinforcer, and the delay between these events. You should be able to figure out why models such as Rescorla-Wagner or Wagner's rehearsal model, for example, would support these findings.

One more important phenomenon while we are on the subject of mediated conditioning and secondary reinforcement: Most behavior involves a complex series of responses executed in a certain rapid and relatively smooth order. How is it that each single response can be reinforced? There hardly seems time for that. And how is it that organisms in real environments (rather than the laboratory where a researcher can control reinforcers and stimuli) acquire such complex organizations? The answer to these questions involves the concept of chaining, and will prove to rely heavily on secondary reinforcers.

We briefly introduced the notion of response chains earlier. An example will illustrate this concept. Let's set ourselves the task of teaching pigeons to Time Warp. The Time Warp is the dance from the Rocky Horror Picture Show. It (as is true of all dances) may be regarded as a series of steps in a chain. In the case of the time warp, there are 5 steps (The Rocky Horror Show, 1975):

It's just a jump to the left, and then a step to the right. With your hands on your hips, you bring your knees in tight. But it's the pelvic thrust, They really drive you insane. Let's do the Time Warp again.

Normally, we would try to teach a chain backwards, So, we will train the last step first. That involves teaching the pigeon a pelvic thrust. We have our response here, but we need a stimulus and a reinforcer. Let's use a red light for the stimulus (seems appropriate, huh?), and some drink for the reinforcer. Our design then is:

Phase Stimulus (CS) Response Reinforcer (UCS)

1 red light pelvic thrust drink

Note particularly that I have also labeled the stimulus a CS, and the reinforcer a UCS. This is meant to suggest that classical conditioning will be going on simultaneously with instrumental conditioning: The stimulus is paired not only with the response, but also with the outcome. Thus, as a result of instrumental conditioning, the animal should do a pelvic thrust to the red light. But, as a result of classical conditioning, the red light ought to become a secondary reinforcer. And that should suggest to you the rest of the design. Here it is in full:

Phase Stimulus (CS) Response Reinforcer (UCS)

            1                 red light                  pelvic thrust             drink
            2                 blue light                 knees in tight           red light
            3                 green light              wings on hips            blue light
            4                 yellow light             step to right              green light
            5                 white light               jump to left               yellow light

So, if you look for the moment just at Phase 2, notice that we will reward the pigeon for bringing its knees in tight by following that response with the red light. If the red light is a secondary reinforcer, then the animal will acquire the response. And note too that the red light also serves as the signal for the next step after knees in tight: the pelvic thrust. And finally, note that in Phase 2, we ought to obtain second-order conditioning: Two CSs (the blue and red lights) are being paired. If successful, this means that the blue light now also becomes a secondary reinforcer.

At the end of this, the sequence will be that a white light serves as the signal for a jump to the left; that's reinforced by the yellow light (thanks to fourth-order conditioning) which also signals Step 2 (a step to the right); that's reinforced by the green light (thanks to third-order conditioning), which signals Step 3 (wings on hips); that's reinforced by the blue light (thanks to second-order conditioning), which signals Step 4 (knees in tight); and that is reinforced by the red light (thanks to first-order conditioning), which finally signals the last step of the dance.

We haven't talked about a control experiment for this, but our control would be something like the following:

Phase Stimulus (CS) Response Reinforcer (UCS)

            1                 red light                    pelvic thrust               drink
            2                 blue light                   knees in tight            green light
            3                 yellow light               wings on hips             white light
            4                 orange light              step to right               purple light
            5                 white light                 jump to left                green light

In this control experiment, only the first step ought to be acquired. The secondary reinforcer from the first phase is never used in the later phases, and none of these is ever paired with a primary reinforcer. Indeed, based on work in discrimination training, we might predict that the other colors would become somewhat inhibitory (since they tend to signal absence of UCS).

But in real-world chains, of course, such individual discriminative cues and reinforcers do not always appear to be present (although you could argue that they are present in a dance in terms of the auditory stimuli represented in the music!). And we can solve that mystery by going back and analyzing responses as having stimulus components. Responses are also being associated with a UCS, so that doing a response can act as a secondary reinforcer! Thus, jumping to the left may be reinforced by stepping to the right, eliminating the need for all of these intervening light stimuli. If you thought our pigeon caught in a very awkward situation, you were right: By considering the stimulus components of a response, we find a way to make the concept of response chains a lot more realistic, and their execution smoother.

Interference

Because of the nature of instrumental conditioning, it is possible to have several different responses associated with the same stimulus, or the same responses and stimuli associated with several different outcomes. Under those conditions, sometimes complex patterns of results may be found. In particular, certain combinations of events appear to result in interference. We will consider two sorts of interference briefly in this section, and then revisit the issue in later chapters. The two involve response competition and approach-avoidance conflicts.

Response competition involves one response interfering with or competing with another. In fact, response competition is one of the theories regarding the process of extinction (see the chapter on partial reinforcement and extinction). The basic idea here is that the animal is being cued to perform incompatible responses.

An excellent example of response competition occurs in an experiment by Fowler and Miller. They trained rats to run to a goal box. During extinction, all of the rats were shocked on entering the goal box, but half of them were shocked on their front paws, and the other half were shocked on their rear paws. The animals shocked on their front paws jerked back, whereas the animals shocked on their rear paws jerked forwards. Moving forward is a response compatible with running into the goal box, but moving backwards is an incompatible response. Despite the fact that both groups received shock or punishment for entering the goal box, the front-paws group extinguished more rapidly. The new response caused by the shock in this case interfered with the old response.

Other examples of response competition come from work with humans in the verbal learning paradigm. Here, subjects are often asked to learn a list of word pairs, and tested on how successful they are at recalling the second word when presented with the first as a retrieval cue. So, if you studied a pair such as SHORT-LAKE, the experimenter might say SHORT, and you would need to reply with LAKE. As you may gather, we can identify the first word of a pair as the stimulus term, and the second as the response term. Numerous studies show interference when we ask people to learn several lists in which the same stimulus words are present, but there are different response words. Response competition will certainly not turn out to be the sole explanation of these findings (see, for example, Melton & Irwin, and Postman's review). But it assuredly handles some of what is going on, as we find intrusions of the earlier responses during learning of the later responses.

As for approach-avoidance conflicts, we may ask what happens when a response is associated with both an aversive and an appetitive outcome. That situation happens more frequently than you might think. In discrimination training, for example, we try to alter the excitatory generalization to the S- by associating it with lack of a reward. But, that means that the inhibition building up for S- may also generalize to the S+, canceling it out, to some extent (one of the explanations for peak shift). Thus, discrimination training involves two stimuli, each of which may be claimed to have some excitatory and some inhibitory components.

What ought to happen should thus reflect, in some sense, the summation of the excitation and inhibition, as was the case in use of the summation test in classical conditioning (see the discussion of algebraic summation theory in the chapter on attention and categorization for more details).

As an interesting footnote, Dollard and Miller tried to combine aspects of Freudian psychoanalytic theory and learning theory to describe some of the conflicts humans might be subject to. They identified several different types of conflicts, but one they termed an approach-avoidance conflict. In this situation, there is a tradeoff between the positive and negative components of making a response. As an example, we might take a rat running down an alleyway to obtain some food. Suppose that the goal box is associated both with food and with a shock. What will the rat do? One analysis of this situation (culled from several different studies) appears in Figure 2.

In this figure, we look at some measure of strength of a response as a function of how far from the goal the animal is. There are actually two opposed tendencies graphed in this figure: The tendency to approach the goal for a reinforcer, and the tendency to go away from the goal due to punishment. The solid line represents a typical, idealized avoidance gradient: The closer an animal is to an aversive or noxious stimulus, the more vigorously it leaves. As it gets further and further away, its response (running, for example) gets weaker and weaker. In contrast, the approach gradient graphed by the dotted line demonstrates the reverse finding: the closer to a desired reinforcement, the faster or more vigorously the animal approaches it.

Several additional features of Figure 2 are important. One is that the avoidance gradient is typically steeper than the gradient of approach, And the other is that in this figure, the lines cross. And because they do, we obtain an approach-avoidance conflict, with the spot at which the lines cross representing the conflict point.

If you look to the right of the conflict point, you will see that approach is stronger than avoidance. Thus, right of this spot, the animal should tend to head towards the goal. But once it passes the conflict point and approaches, then avoidance becomes stronger, driving it back. So, the model predicts that an animal will waver around the conflict point, developing large amounts of frustration in the process. There will be some tendency here for the animal to simply escape this situation, if that is at all an option.

Finally, an increase or decrease in the amount of reinforcement or punishment in this model will essentially move the relevant gradient up or down. Increasing the punisher, for example, should move the solid line up, and that will result in the conflict point (the spot at which the lines cross) moving further away from the goal. In like fashion, increasing reinforcement moves the approach gradient up, causing the spot at which the lines cross to occur closer to the goal.

This is by no means all that Dollard and Miller have to say about what approach-avoidance conflicts entail. You may be interested in reading their book on personality and psychotherapy for more information.

The Partial Reinforcement Effect

A final basic phenomenon we will discuss in this section involves the partial reinforcement effect. Skinner and his colleagues studied various aspects of reinforcement-based learning under what they claimed were 'real-world' conditions. Specifically, they asked what would happen to learning when reinforcers appeared on only some trials. The findings were quite interesting. Namely, with partial reinforcements, there was greater resistance to extinction.

We will look at partial reinforcement in more detail in a later chapter. For now, let me mention that a number of variables interact with partial reinforcement. In particular, amount of reinforcement will prove to play a pivotal role. In studies such as those conducted by Roberts, we find that animals that have been continuously reinforced will display increased resistance to extinction with small reinforcers. Roberts looked at extinction of alleyway running in rats whose reinforcers ranged from 1 to 25 food pellets. Over 36 extinction trials, there was little evidence of a change in the 1-pellet group, whereas the 25-pellet group was performing at well less than half their rate prior to extinction. However, this effect appears to depend on how much training an animal has had during the acquisition phase; it assumes a fairly substantial amount of acquisition (see D'Amato). In contrast, animals that have been partially reinforced will display increased resistance with large reinforcers (see, for example, Ratliff and Ratliff). The first result, in particular, strikes many people as counterintuitive on first coming across it. After all, shouldn't large reinforcers result in better learning, and shouldn't better learning be longer-lasting learning?

There are in fact a number of explanations for the partial reinforcement effect. For the moment, however, I will mention one to help you remember the results. This is Amsel's Frustration Hypothesis, cited in the first chapter. According to Amsel, continuously reinforced animals will experience more frustration when they lose a large reinforcer. And since frustration acts as an aversive stimulus, these animals will avoid whatever it is that is causing the frustration. So, with more frustration, there is faster learning of avoidance. But in contrast, animals in partial reinforcement are being trained to tolerate frustration. With larger reinforcers, they are trained specifically to handle greater and greater frustration. Thus, when they are placed in the highly frustrating situation of extinction (in which the expected reward fails to materialize), they will be better able to adapt to this situation.

Number of trials during acquisition also has different effects for continuously and partially reinforced groups. For continuously reinforced groups, more training results in lesser resistance, whereas for partially reinforced groups, more training results in greater resistance. The increased training in partially reinforced groups translates into better training to tolerate frustration, but the increased training in continuously reinforced groups translates into higher expectation of a reward (and thus, a ruder awakening when it is no longer there).

In short, extinction is frustrating, because expected rewards don't occur. How much resistance to extinction you will have will thus depend partly on how much frustration you experience during extinction, and on how much frustration you have been trained to tolerate during acquisition. The amount of frustration experienced during extinction depends on the size of the reinforcer you expected. (Not getting an expected $50 is a lot more frustrating than not getting an expected $1.) In addition, continuously reinforced animals have not been trained to tolerate any frustration whatsoever.

C. Some Basic Paradigms

We have already introduced two general paradigms involving acquisition and extinction. In acquisition, an outcome is typically paired with making a response in the presence of a stimulus; in extinction, that pairing typically ceases. Within this broad framework (particularly with respect to acquisition), we may distinguish several additional paradigms.

In appetitive or approach learning, the animal makes a response that results in a desired reward. This is the type of learning involving reinforcement that we have implicitly and explicitly discussed so far. But it is not the only paradigm based on reinforcement. Another that deserves particular note is omission training, in which an animal has to suppress or withhold a response in order to get its reward. Sheffield, for example, trained dogs to salivate in the presence of a tone associated with food, and then shifted them to omission training. In this latter phase, the dogs had to avoid salivating to the tone for several seconds to get the food. Omission training is initially typically difficult, and displays a relatively slow learning curve. However, there are several studies suggesting that in the long run, it will be as effective as extinction in decreasing the frequency of a response. Omission training is sometimes referred to as negative punishment to indicate that making the response is associated with removal of a reinforcer (which thus acts as a punishment).

Another paradigm based on reinforcement is escape learning. In escape learning, the animal learns a response that gets it away from punishment, either by turning off the punisher, or by allowing the animal to leave the area where the punishment was administered. Escape learning is closely associated with another paradigm, avoidance learning. In avoidance learning, the punishment is intermittent rather than continuous. If the animal makes the proper response before the punishment comes on, it will succeed in canceling that punishment. In avoidance learning, animals typically start out by escaping the aversive stimulation (making a response during the punishment that stops it), and then come to make the response early enough that they subsequently successfully avoid the aversive stimulation.

Punishment training (or aversive learning), of course, involves the administration of an unpleasant, aversive outcome following a response. Thus, punishment training, omission training, and extinction all have in common reducing the level of a given response, whereas appetitive learning, escape learning, and avoidance learning attempt to increase response level. There are some obvious interplays in paradigms here, depending on which response you focus on. Often, aspects of several different paradigms combine: One response may be punished while another is reinforced.

We may also distinguish between signaled and unsignaled learning. A discrete, distinct stimulus is present in signaled learning, but not in unsignaled learning. Thus, for example, in unsignaled avoidance, shocks can occur at regular intervals that could be avoided if the animal responds shortly before the shock's onset. There is no physical stimulus signaling the shock; the animal in this case needs to rely on an internal sense of time. In unsignaled conditions, features such as time or the contextual cues presumably act as stimuli.

Another paradigm, transfer training will prove important, especially when we focus on discrimination in a later chapter. In transfer training, we look at the effects of learning one task on another. Transfer might be nonexistent (zero), positive (facilitation: the learning is faster), or negative (inhibition: there is interference). In addition, transfer effects might be proactive (in which we look at the effect of an earlier task on the learning or performance of a later task), or retroactive (in which we saw how the later task influences performance on the earlier one).

A final paradigm involves shaping. Normally, approach learning applies to responses that are not especially frequent to start with, since we want to track an increase in frequency as one of our measures of learning. Thus, we find ourselves in the following situation: We sit in the lab, watching our animal subject, waiting for it to make the desired response so that we can administer the reinforcer.

Such a procedure will obviously be inefficient. In some cases (such as a pig rolling a coin), the wait may be very long indeed! Hence, a technology has developed that involves increasing the probability of having the animal emit that response so that we can then train it further through reinforcement. This technology, called shaping, requires reinforcing successive approximations to the desired response.

Shaping works as follows. We start out by identifying a high-frequency component of the response we want, and we reinforce that. So, if we want our rat to press a bar on the left side of an experimental chamber, then a high-frequency component would involve having the rat be in the left half of the chamber. While it is exploring its environment, we reinforce for crossing over to the left. Then, as it increases its time on the left, we drop the reinforcer. That will cause the behavior to become more variable. We await some response yet closer to what we want to train (such as being near the bar), and when that occurs we reintroduce the reinforcer. And then, of course, we cycle the process through again in order to obtain yet a closer approximation (such as touching the bar). Shaping is a very powerful technique, not only because of its ability to 'coax' low frequency responses out of an animal, but also -- and especially -- because of its ability to mold a response that is not normally part of the animal's repertoire! Thus, by combining shaping and chaining, instrumental conditioning allows us to train totally new responses, rather than just transfer stimulus control of an old response to a new stimulus.

D. A Note About Terminology: Operant vs. Instrumental Learning

Finally, we ought to note a distinction that is sometimes made between what is termed instrumental conditioning, and operant conditioning. In instrumental conditioning, the emphasis is on a discrete trial, a situation in which there is a clear starting point and a clear terminus. We may measure how long it takes the animal to make the response during the trial, or we may measure the relative probability of the animal's success. So, to take Thorndike's puzzlebox apparatus, the start of the trial occurs when a cat is placed in the puzzlebox, and it ends when the cat has made the escape response. How long this takes is what we are interested in. Similarly, in maze learning, the trial starts with the animal being placed in the start box, and ends when the animal has found its way to the goal box. (Or alternatively, we can specify the trial as what happens in some amount of time from when we have placed the animal in the start box. Where has it gotten to in, say, an allowed 30 seconds? Learning here will show up as increased probability of having made the correct response within the time frame of the trial.) In a third example, choice discrimination, the trial starts when the animal is exposed to two stimuli, and ends when the animal makes a response relevant to one of them. Our interest in this situation typically involves whether the animal has chosen the correct stimulus.

There are no discrete trials in operant conditioning, on the other hand. A standard apparatus for operant conditioning involves a Skinner Box, a chamber with something that can be manipulated (a key to peck; a bar to press; a lever to move); various discriminative stimuli that may be turned on or off (lights; noises); and means to automatically administer reinforcements or punishments (food or shock dispensers connected to the bar, for example). Particularly with respect to such simple responses as pressing a bar to obtain food, the interest will be more in how rapidly those responses are executed. We don't stop the animal between responses in order to set up another trial. Rather, we typically look at characteristics of response rate over time.

This distinction between discrete and continuous trials might also be expressed in a slightly different manner. On a discrete trial, you can succeed only once (or perhaps 8 times if we use an apparatus like Olton's radial maze, discussed in the previous chapter), whereas on continuous trials you have the opportunity to obtain virtually unlimited reinforcements. So, the difference between instrumental and operant conditioning in part involves whether there is a constraint on how many reinforcing events an animal can seek out. That having been said, I will generally treat these as equivalent.

II. Basic Requirements For Effective Conditioning

Many of the principles for effective conditioning will prove familiar from our discussion of classical conditioning. Thus, number of pairings of a response with an outcome will prove important in characterizing how quickly we see changes in characteristics of the response (its strength, its amplitude, its latency, its probability, etc.) that signal evidence of learning. By the same token, number of times a response fails to be followed by an outcome will be important, not only in describing the course of extinction, but also (as in classical conditioning) in describing the contingency between a response and an outcome. Below, we will briefly consider additional principles having to do with temporal contiguity, outcome characteristics, and contingency.

Before we do, however, we ought to note several features that make the situation a bit more interesting. First, of course, is the issue of partial reinforcement. We will delay fuller discussion of that to a later chapter. Second, there is the fact that in operant conditioning, an animal is effectively in charge of whether to emit the response or withhold it. Obviously, researchers in classical conditioning may easily arrange pairings of the CS and the UCS to achieve any desired contingency. But in operant conditioning, controlling how many times the reinforcer occurs when a response is emitted versus when a response is not emitted is clearly trickier. Third, because of the presence of three events (stimulus, response, outcome), there are three potential associations to worry about (S-R, S-O, and R-O). That means that we can ask about temporal contiguity (or contingency) not just of response and outcome, but also of stimulus and response, and of stimulus and outcome. The situation thus becomes significantly more complex.

Not all theorists believe that all three associations form. Thorndike, to remind you, accepted only an S-R association, as did Watson. But, researchers such as Rescorla have made a very strong case that the other associations are there, as well. Thus, Colwill and Rescorla used the devaluation paradigm on a reinforcer after the response had been acquired. If a reinforcer's only function is to stamp in the association (as claimed by Thorndike), devaluing the reinforcer ought not to influence the response the animal gives to the stimulus. In the abstract, the design for this type of experiment would be similar to the following:

Group Acquisition Phase Phase 2 Test Phase

experimental R to S for RF RF & LiCl R to S?
control R to S for RF (Nothing) R to S?

However, Colwill and Rescorla found a much less vigorous response following devaluation. This must have its effect on an R-O or S-O association.

What about an S-O association? From our discussion of chaining and higher-order conditioning, you already know that this association forms. Further evidence of this comes from the Rescorla study mentioned in the previous chapter, in which a stimulus that caused higher levels of responding during extinction became inhibitory, as measured by the summation and retardation tests. We had earlier read about a classical conditioning version of that study, but Rescorla also ran the same study with an instrumental conditioning set-up, and obtained the same results. Because the S-R association in these types of experiments is rapidly relearned following extinction while the S remains inhibitory, Rescorla claims that the inhibition doesn't involve the S-R link! And as a final example, consider a classic study by Seward and Levy on a phenomenon termed latent extinction. In their study, two groups of rats learned to run to a goal box for a reward. Following acquisition, one group had the experience of being placed directly in the goal without the reward. Then, both groups were put through extinction:

Group Acquisition Phase 2 Phase 3

experimental run for RF put in goal, no RF extinction
control run for RF (Nothing) extinction

In this experiment, the control group extinguished more slowly than the experimental group. Presumably, the stimulus elements of the goal box had now become associated with some inhibition for the experimental group, making their running to it less desirable.

Below, given the theoretical importance of reinforcement in operant conditioning, we will concentrate on principles having to do with its presence relative to the response.

A. Temporal Parameters

A principle of fairly long standing (and which forms a part of many behavior-level theories) has been that there must be temporal contiguity between the response and the outcome. In fact, many studies report what is called a gradient of reinforcement in approach or appetitive learning: The longer the delay between the response and the reinforcer, the weaker the learning.

A well-known experiment demonstrating the gradient of reinforcement was conducted by Grice. Grice used a choice discrimination paradigm in which rats had to enter one of two rooms or chambers. The rooms were different colors (black or white), and the rat was reinforced for entering one of these but not the other. However, there were several groups of rats who differed in terms of how long it took to get the reinforcer after choosing the correct color. All rats were immediately placed in a neutral-color room where the reinforcer was given, but one group received their reinforcer immediately, while others had to wait. The group with the longest wait was reinforced after 10 sec. Essentially, Grice found a very rapid fall-off of learning. After about 1 sec, there was no evidence that the discrimination had been learned.

Depending on the response and the circumstances (see the next major section below), longer delays in which learning still occurs have been reported. As an example, consider a study by Capaldi, in which two groups of rats were trained to run to a goal box. One group was rewarded as soon as it reached the goal box, but the other group had to wait 10 sec for its reward. Both indeed learned to run to the goal box, but the running speed (and the initial velocity out of the start box) was significantly depressed for the 10-sec delay group. Thus, their learning seems to have been affected by the delay.

Sometimes, a delay of 1 or 5 sec seems to result in no learning, and at other times (as in Capaldi's experiment), longer delays will be tolerated. Generally, however, the speed of learning as measured by vigor or probability of the response (or number of trials to acquire it) will be influenced by the response-reinforcer delay. Extrapolating from Skinner's claims, we may present one theory for why this is so: Namely, as the delay period increases, the odds increase that the animal will perform some other piece of behavior before the reinforcer is given. The association may then form between that response and the outcome, rather than between the effective response and the reward.

According to Skinner, temporal contiguity by itself is all that is needed for the formation of an association. Skinner cites the example of superstitious behavior to demonstrate this. In superstitious behavior, animals are reinforced at random, and need perform no response whatsoever. Yet, Skinner in one of his studies reported that pigeons in this circumstance were displaying apparently learned behaviors such as head shaking. He claimed that the reinforcer by dumb luck must have been presented just after the pigeon had tossed its head, so that head tossing was strengthened as a response in this situation. The increased possibility of acquiring superstitious behavior that interferes with other learning might thus partly explain why temporal contiguity is important.

From the perspective of a more cognitive, representational-level approach, we may posit a similar idea expressed in very different terms. Given the presence of a reinforcer, the animal's task is to determine which of a number of previous responses might be the one that worked. As the number of responses increases, the task becomes more difficult. Moreover, because causes normally result in relatively immediate effects (excepting, of course, situations such as illness or food poisoning: note the relevance to the taste aversions paradigm), organisms may be genetically predisposed to connect recent behavior with the current outcome (a principle of causal recency).

A similar principle, of course, applies to aversive situations. Fowler and Trapold in an experiment on escape learning varied how long it took for shock to turn off once the rat had run to a goal box. The best learning/performance occurred for a group of rats whose shock was turned off as soon as they entered the goal box. Animals that had to wait a bit for shock to turn off did worse.

Finally, Boe and Church found that the effectiveness of punishment decreased with delay. Unless punishment is administered very shortly after an animal's response, it will not prove very effective. Dog owners who come home and punish puppies for earlier 'accidents' are most likely to be associating themselves with the aversive outcome, and training fear of the owner and the spot where the dog was punished. That is certainly not the same thing as housebreaking a pet.

B. Outcome Strength

Two types of outcomes have generally been discussed: reinforcement and punishment. Each, however, may be further subdivided into two sorts, positive and negative. Positive outcomes generally involve the presentation of a stimulus that changes a relatively neutral state into the state specified by the outcome. Thus, positive reinforcement (generally referred to without the use of the word "positive") involves the provision of something desirable that normally results in appetitive behavior, and positive punishment (also typically referred to without the modifying adjective) involves the provision of something undesirable that normally results in aversive behavior.

The other two types of outcomes are negative reinforcement and negative punishment. It will help you to keep these straight by recalling that anything that is a labeled reinforcer, positive or negative, should operate by the law of reinforcement: It ought to increase the response that it follows. Similarly, anything labeled a punisher, positive or negative, ought to work by the law of punishment: It ought to decrease the response that it follows. That having been said, a negative reinforcer takes on its reinforcing properties because some response the animal makes results in removal of aversive stimulation. Negative reinforcement, of course, is the basis for escape learning. And in similar fashion, a negative punisher acquires its punishing properties by virtue of the fact that the animal makes a response leading to removal of a reward or privilege. Thus, positive outcomes involve the presentation of stimulus events, and negative outcomes involve the removal of certain stimulus events.

With respect to each, there appears to be a general principle that higher levels of strength result in stronger or faster or more vigorous responding, consistent with a claim that outcome strength influences speed of learning. Concerning positive reinforcement, for example, Kraeling taught three groups of rats to run an alleyway for a drink reinforcement that varied in the amount of sucrose concentration (recall that rats have a sweet tooth, so higher sucrose concentrations act as more effective reinforcers). Each group was given one trial per day for 99 days. At the end, they had each reached asymptote as measured by how fast they ran. However, the asymptotes differed for the three groups: The group with the highest sucrose concentration had the fastest asymptotic running speed whereas the group with the lowest concentration had the slowest speed. Crespi found similar results (see below, Figure 3): Rats given large amounts of reinforcement on each trial (64 pellets) showed faster running than rats given small amounts of reinforcement (4 pellets).

An experiment by Trapold and Fowler can illustrate the operation of this principle with amount of negative reinforcement. They conducted an experiment in which rats had to run to escape shock. Five groups of animals were given 20 trials of escape learning. The groups differed in the intensity of the shock (varying from 120 volts up to 400 volts). Faster acquisition of the escape response occurred with the larger shocks.

Finally, a classic experiment by Boe and Church may be used to illustrate the principle with positive punishment. Boe and Church trained four groups of rats to press a bar for a reward, and put each through extinction. Prior to extinction, however, three of these groups were put through punishment training in which, for 15 minutes, a bar press gave the animal a shock. The groups differed in intensity of the shock (35, 75, or 220 volts). Thus, the design was as follows:

Group Acquisition Phase Punishment Extinction Phase

            1                 RF for barpress (bp)         (None)                       No RF for bp
            2                 RF for barpress                 bp --> 35 Volts         No RF for bp
            3                 RF for barpress                 bp --> 75 Volts         No RF for bp
            4                 RF for barpress                 bp --> 220 Volts       No RF for bp

The question, of course, was how punishment of bar pressing would help speed up removal of that response. Over 9 sessions of extinction training, the group with the weak shock proved not all that different from the group with no punishment: Each engaged in a substantial number of responses during the course of extinction. However, quite different results occurred for the 75 and 220 volt groups: They showed a much lower level of responding during extinction. Indeed, the 220 volt group hardly responded at all! Thus, effectiveness of punishment in suppressing behavior will depend in part on severity of punishment. As the contrast between the control group and the 35 volt group demonstrates, weak punishers may have little permanent effect compared to extinction.

Of course, there are other variables that will influence the operation of an outcome. As you know from an earlier discussion, aversive stimulation can have the paradoxical effect of increasing the response it is meant to stamp out (vicious circle behavior). Also, the same amount of an outcome packaged in different ways may effectively act as different amounts. Thus, for example, Campbell, Batsche, and Batsche found that a reinforcer divided into smaller amounts worked better than a reinforcer presented as one large amount. And to remind you, those manipulations that seem to promote higher asymptotic levels during acquisition (in continuous reinforcement) generally also promote the fastest extinction.

There are also contrasts that may occur when an organism experiences several different levels of a reinforcement. An experiment by Crespi will illustrate these. Crespi trained rats to run to a goal box for food (the apparatus here involved a straight alleyway in which rats are released at one end of a corridor or tunnel, and have to run to the other end). One group was given a large reward, a second group was given a medium reward, and a third group was given a small reward. In each case, their running speed was measured. Then, the large-reward and small-reward groups were shifted to the medium reward. Thus, the design was something like the following:

Group Phase 1 (acquisition) Phase 2 (maintenance)

            1                 64-pellet reward                         16-pellet reward
            2                 16-pellet reward                         16-pellet reward
            3                 4-pellet reward                           16-pellet reward

You will note that I have labeled the two phases here acquisition and maintenance. The rats in the acquisition phase received 20 learning trials, and their average running speed at the end of training was measured. In a maintenance phase, on the other hand, we look at performance after learning has occurred (that is, presumably after the association has formed). In Crespi's study, there were 8 maintenance trials. Figure 3 presents the results after learning, and on the eighth maintenance trial. As you can see from this figure, Group 2 showed some slight increase; not surprising, since additional reinforced pairings ought to result in a stronger association according to most standard theories of instrumental conditioning. But notice what happened to the other two groups: A shift to a much smaller reward caused a negative contrast by which running speed slowed down considerably, whereas a shift to a much larger reward resulted in a corresponding increase (a positive contrast).

It is important to note that all groups received the same amount of learning in Phase 2 (in terms of number of trials and what the reinforcer was). Thus, we might have expected each group to display the same relative improvement. But, that did not happen. Because these contrasts occurred during a post-acquisition period involving identical additional training, they are generally interpreted as demonstrating an effect not on learning (or acquisition), but rather on performance. That argument is particularly compelling for Group 1: They continued to receive additional reinforced training during Phase 2, yet they apparently got worse! Contrasts of this sort are termed incentive contrasts.

Such contrasts should suggest that perhaps outcome amount is related more to an animal's motivation to perform a response than whether that response gets learned in the first place.

Contrast effects may occur under a variety of conditions (see, for example, Flaherty's review). They do tend to be temporary, however. Flaherty suggests, in particular, that negative contrasts may reflect frustration at obtaining the less desired reward. Consistent with this, tranquilized animals generally do not exhibit contrasts.

C. Contingency

We have already briefly alluded to the difficulties inherent in controlling contingency between a response and an outcome in operant conditioning. We have also briefly talked about the fact that Skinner (and Watson and Thorndike) claimed that contiguity was all that was needed for learning. As in classical conditioning, however, there are claims that contingency is also required for learning to occur. We will close out this section by considering the evidence that contingency influences learning and performance.

To start, let us adapt the notion of a contingency space discussed in the previous chapter. Previously, we had looked at the relative probabilities of the UCS when the CS was present or absent. Now, we look at the relative probability of an outcome when the animal makes a response (Probability 1), or withholds it (i.e., does not make the response: Probability 2). Essentially, analogous to what happens in classical conditioning, many theorists will claim that when Probability 1 exceeds Probability 2, there ought to be excitation: The animal is more likely to make the response because the odds of getting a reward increase. (We are assuming rewards outcomes rather than punishers here!) In contrast, when Probability 1 is below Probability 2, then it makes more sense for the animal not to respond: The response ought to be inhibited. Finally, when the two probabilities are equal (so that there is zero contingency), we would expect to find no evidence of acquisition.

How well does this notion of contingency hold up? One interesting study that attempts to assess this notion of an operant contingency space was performed by Hammond. Hammond set up different contingencies between bar pressing and a reinforcer for several groups of rats. Whenever the two probabilities were equal, the rats failed to display any change in bar pressing. In these circumstances, there were always some pairings of the response with the outcome that ought to have resulted in the association forming, if Skinner's claims made in the section on superstitious behavior were correct. Indeed, we might regard such a circumstance as similar to a partial reinforcement schedule. Nevertheless, no evidence of learning occurred. In contrast, rats for whom Probability 1 exceeded Probability 2 did increase their bar pressing.

Moreover, in a follow-up, Hammond found that rats that had already acquired bar pressing stopped when the two probabilities were made equal. This pattern of findings certainly goes against Skinner's claims that contiguity is sufficient. Instead, it strongly suggests an exquisite sensitivity to contingency on the rats' part.

Why, then, do we obtain superstitious behavior? According to many people, Skinner has probably failed to consider the fact that reinforcers such as food also act as UCSs that elicit certain responses, and that the contextual cues act as a CS that gets conditioned to the UCS. On such an account, superstitious behavior really isn't: It is fairly straight-forward classical conditioning of the sort we have been discussing in the last two chapters.

An example may serve to drive this point home. One claimed example of superstitious behavior has been autoshaping. In autoshaping with pigeons some food is presented after a key lights up (Brown & Jenkins, 1968). After a while, pigeons peck the key, although they need not do so to obtain the food. Autoshaping is a very useful tool for training pigeons, because it seems that the pigeons will train themselves, and save you the work. But that avoids analysis of this situation as classical conditioning in which the lighted key serves as the CS. A nuanced discussion of how classical conditioning can partly explain some of the response characteristics (along with some of the problems for a simple stimulus substitution view of classical condition) may be found in Staddon and Simmelhag (1971). Consistent with their discussion, Jenkins and Moore (1973), for example, used either food pellets or drink as the reinforcer for different groups of pigeons in an autoshaping paradigm. Pigeons pecked at solid food with a closed beak, but opened the beak slightly to drink the liquid (you can see a demonstration of this on YouTube by clicking here). Thus, what appears to be contiguity without contingency in operant conditioning is sometomes a classical conditioning situation in which there is both contiguity and contingency (but the contingency involves the CS and the UCS, rather than a response and an outcome). On the other hand, Staddon and Simmelhag do point out that such terminal responses differ from the interim reposnses that can be produced before them, and that do seem to better fit with Skinner's notion of superstitious behavior.

Another experiment done with pigeons was performed by Killeen. The pigeons faced a horizontal array of 3 keys, the middle one of which was lit. They were trained to peck at this middle key. About 5% of the time, one of their pecks at the key would cause its light to go off, and the lights of the two surrounding keys to come on. Another 5% of the time, a computer would automatically turn off the center key while turning on the others. The question Killeen asked was whether the pigeon was aware that it was responsible for this change.

How can we assess a pigeon's knowledge of the circumstances? Killeen reasoned that a pigeon aware of whether its action had turned off the light would easily be able to learn another response that would depend on that action. So, the experiment was arranged so that the pigeon would get rewarded for pecking at one of the surrounding lights when it was responsible for turning them on, but would have to peck at the other light when it was the computer that had turned them on. That would be a difficult task to accomplish without sensitivity to contingency, but Killeen's pigeons came through. They did indeed show they could discriminate events caused by their own behavior from events caused by some other external cause.

Such sensitivity is not reported in all studies, however. In a quite clever study by Thomas, rats obtained random free reinforcements, but could also make a response (bar pressing) that would give them a reinforcement on demand. But the catch was, the number of random reinforcers would drop some after the rat made that response. In other words, more reinforcers were available if the rat did not respond. In contrast to the studies above, the rats in this experiment actually did learn to bar press, which resulted in their getting less food!

One more study on reinforcement-based contingency may be mentioned. This study, done by Watson (not the same Watson who redefined psychology as behaviorism!), used 3-months-old human infants. Watson set up a contingency between their turning their heads and a reinforcer of a mobile above their cribs turning for several seconds. A second group had the same experience of mobile reinforcer, but the second group's mobile movements had nothing to do with any of their responses. Although both groups initially displayed a great deal of interest and pleasure in the mobile when it started moving, only the contingent group maintained this reaction. Thus, Watson argued that the contingent infants had some sense of mastery over the mobile that the non-contingent infants did not, some awareness that the mobile's movements were due to their own actions. They were sensitive to contingency.

Contingency will also prove important in escape and avoidance learning. In particular, Seligman and Maier have studied a phenomenon termed learned helplessness. The experimental set-up for learned helplessness typically involves something like the following design:

Group Phase 1 Phase 2

Experimental inescapable, noncontingent P escape learning
Control (Nothing) escape learning

They find that unavoidable, non-contingent punishment results in the animals in the Experimental Group not learning to escape, once a contingency is set-up between an escape response and avoidance of the shock. The Control Group, in contrast, readily acquires the response. According to Seligman's explanation of these results (the cognitive deficit hypothesis), the animals in the Experimental Group have acquired a mistaken belief. Based on the randomness of the shocks and their inability to escape them in Phase 1, they have mistakenly learned that there is no response that will be effective in avoiding or escaping shock. (You may want to compare this to various explanations for learned irrelevance in classical conditioning.) Thus, they cease trying to discover an effective response, so that learning is no longer attempted in Phase 2. Seligman has argued that some similar mechanism in humans may account for certain episodes of depression.

III. Exceptions & Complex Interactions

We have seen, at least implicitly, an emphasis on reinforcement theories in which a reinforcer is necessary for appetitive or approach learning (see also Thorndike). Indeed, the notion of reinforcement is so pervasive that, as you will see in the next chapter, some theorists have even claimed that extinction depends on there being a reinforcer present during the animal's extinction training! Although there are a variety of associational or behavior-level theories to account for instrumental and operant conditioning, we will direct the exceptions below to the most conservative of these: the claim that learning requires a reinforcer, needs only temporal contiguity, operates through an automatic process of slowly strengthening an association, and stamps in specific muscle movements that increase in likelihood in the presence of the S+.

A. Long-Delay Learning

We will start with the issue of temporal contiguity, As was true in classical conditioning, we will also find instances of long-delay learning in operant conditioning. Indeed, we might start by noting that the work of Garcia and Koelling, presented previously as an instance of long-delay classical conditioning, might just as easily have been labeled instrumental conditioning: An animal makes a response of drinking saccharine -flavored water, and experiences an outcome of becoming ill several hours later. As this reinterpretation of the learned taste aversions paradigm illustrates, it may sometimes be difficult to draw sharp, clear boundaries between classical and instrumental conditioning (again, see Staddon & Simmelhag, 1971, for a discussion of some of these issues; see also the last section in this chapter).

Leaving aside the taste aversions work, however, there are other studies suggesting relatively long delays are possible. Lieberman, Davidson, and Thomas, for example, presented a series of experiments in which pigeons had to peck the right or left side of a key. They found that some of their animal subjects were able to learn the response even with delays of 7 sec or longer (an extraordinarily long delay for a pigeon). The animals that were able to learn were the ones whose correct response was followed by an unusual event (a marker). In their experiment, the marker involved the key briefly turning a different color (from white to red on its left half and green on its right half) after it had been pecked. Other work by Lieberman and his colleagues has demonstrated that such marking can result in animals learning a discrimination even when the reinforcement is delayed a full minute. Since the marker involved a non-reinforced stimulus occurring after the relevant response but well before the reinforcement, the existence of a marking effect poses a challenge to the idea that temporal contiguity is always necessary for learning.

Indeed, this study ought to remind you a bit of some of the work we discussed regarding rehearsal and surprisingness (in particular, the work by Wagner, Rudy, and Whitlow and that of Hall and Pearce). Surprising events are apt to be rehearsed more. So, a distinctive surprising event following a response may result in that response being rehearsed for a longer period of time or becoming more distinct in memory (and thus more likely to be sampled as the cause of the reinforcement). Lieberman et al.'s take on this (the marking hypothesis) combines elements of both of the above (1985, p. 622):

[T]he effect of the marker proved to depend critically on what response preceded it: If a correct response was marked on food trials, then correct responding increased; if an incorrect response was marked, then incorrect responding increased. The most plausible explanation for this result, we believe, is that the marker triggered a memory search that focused attention on the preceding response, thereby increasing the likelihood that it would be remembered. [emphasis added]

Another example of long-delay learning concerns a study by D'Amato, Sarafin, and Salmon. They delayed reinforcers by at least 30 minutes in training trials with rats. In one experiment, animals were placed in one of the two goal boxes of a T-maze (an apparatus that looks like a T, in which the animal runs from the start box at the base of the T to one of the two arms at the top), then put in the start box and fed 30 minutes later. Despite this delay, the animals exhibited differential running to the arm in which they had been placed. Note, in particular, that no additional events or stimulus cues were present during the wait in the start box that may have become associated with the food: Once the animal was let out of the start box, the stimulus cues around the correct arm may have primed the memory of being in that arm.

Finally, note too that the work we have already mentioned by Olton using the 8-arm radial maze suggests that rats are quite adept at finding food in the maze without retracing their steps, and without generally revisiting an already-visited arm. As they visit arms at random, they would appear to maintain some information in short-term memory concerning which responses have already been made. Given the length of time it takes to visit all 8 arms, this clearly qualifies as a type of long-delay learning.

Numerous mechanisms for long-delay learning have been proposed. One that plays off of the notion of secondary reinforcers has been proposed by Spence. This involves the mechanism of an anticipatory fractional goal response (r_g). Note that the response in this instance is written with a lower-case r rather than an upper-case R. The reason is that the r is treated as one component or fraction of a more complex response, the goal response (R_g), the animal makes on reaching the goal and obtaining its reward. There will be numerous fractions or component responses such as chewing, swallowing, salivating, etc. These get conditioned to the stimulus cues present shortly before the animal enters the goal box. So, the association involves:

S_GoalCues ----------> r_g

But since these components or fractions are associated with food, they also become secondary reinforcers through higher-order conditioning. Thus, the cues present as the animal enters the goal area act as a reinforcer before the animal has actually received any food on that trial.

In addition, these components also have stimulus properties they are associated with, although these, of course are unlearned: Chewing, swallowing, etc., all cause certain physical sensations. So, the association ought also to include these, as follows (the dots indicate an unlearned association):

S_GoalCues ----------> r_g ..... s_g

And as was true of the response fraction, we indicate the stimulus fraction with a lower-case s. The r_g ..... s_g is termed a mediator because it is a unit that may come between a stimulus and a response in a chain of associations.

In Spence's theory, these mediators become anticipatory; that is, they start being conditioned to earlier and earlier spots in a sequence. Thus, in a complex maze, the stimulus cues right before the cues that led to the goal box also take on secondary reinforcing properties. If we regard the goal cues as being at spot X, we will take the cues before these as being at spot X-1. Then, through classical conditioning we have:

S_X-1 ----------> S_GoalCues ----------> r_g ..... s_g

Or, to represent this by a shortcut:

S_X-1 ----------> r_g ..... s_g

And if at this point the animal needs to make a left turn to get to the area of the goal box, then the associations at this point involve:

S_X-1 ----------> r_g ..... s_g ----------> R_Left

And of course, we may now carry the procedure through to spot X-2 (the cues present before the X-1 cues). Thus, fractional goal responses are effectively conditioned throughout a complex chain in a process that should remind you of our example of the Time Warp. Consistent with this theory, animals do tend to learn a complex maze backwards (although not all results support the theory: In particular, researchers have not found evidence of anticipatory drooling at the various spots or choice points of a maze).

Thus, in theory, the presence of secondary reinforcers may help to bridge what appears to be a long delay. That would mean that long delays are really much shorter, since we need to assess the delay in terms of the first reinforcer present after a response. In this case, that first reinforcer will be a short-delay secondary reinforcer.

While secondary reinforcers and anticipatory fractional goal responses might account for some of the long-delay results, however, they cannot account for all of them. In particular, the study by D'Amato et al. would seem difficult to explain, since the animal is being fed in the start box, so that any secondary reinforcers ought first to be associated with it, rather than the arm the animal runs to. Similarly, Olton's results would not fall under this mechanism, because secondary reinforcers ought to become associated with an arm the animal has already visited, making it more likely the animal will revisit the same arm on the next trial. But that generally doesn't happen. And that it doesn't happen makes sense, according to foraging theory: An animal foraging in the wild for food is likely to deplete a food source, so there is adaptive value in searching for food in different locations. But, searching for food in this manner also requires a memory system that can keep track of where food was found previously, so as to avoid that spot.

Other factors that make long-delay learning possible include the presence of something to make the correct response distinct, or the occurrence of little intervening activity between the effective response and the reward. We have already discussed distinctiveness in terms of Lieberman's marking hypothesis: A distinct response is likely to be more salient, rehearsed more, and thus more readily available in memory when the occurrence of a reward triggers a search for events that might have been responsible for it. The notion of little intervening activity will similarly play off of a memory mechanism. With little or no intervening activity, the last response will still be the one most likely to be recovered from memory. But as activity increases, so do the possibilities of disruption (recall what Wagner, Rudy, and Whitlow found with post-trial episodes!), and choosing the wrong response as being the cause of the reward (response competition).

B. Belongingness

On a strict associational account, any association ought to form between responses and effective outcomes. However, several studies (in addition to the work we have already discussed in learned taste aversions) suggest that need not be the case.

One of these studies, conducted by Shettleworth, involved an experiment with hamsters. Shettleworth identified six high frequency activities in hamsters that included face washing, digging, scent marking, hind leg scratching, rearing, and front paw scraping. When each of these was subsequently paired with a food reinforcer, only digging, rearing, and front paw scraping were affected. Such restriction of the operation of a reinforcer represents a violation of the requirement that reinforcers be transituational: Here, at least, are three responses that a reinforcer of food will not affect.

Another study illustrating belongingness comes out of the work of Premack. By allowing kids to play with gumball machines (dispensing candy) and pinball (game) machines, Premack identified kids who were players or eaters (based on the relative proportion of time they spent with each machine). He then set up an experiment using the following design:

Group Subjects Response & Reinforcer

            1A             players                     play to eat
            1B             players                     eat to play
            2A             eaters                       play to eat
            2B             eaters                       eat to play

Thus, there was now a contingency between responding on one machine, and responding on the other: In one case (play to eat), kids would have to increase their time on the pinball machine to get an opportunity to use the gumball machine; in the other (eat to play), the reverse was required: kids would have to increase responding to the gumball machine to get a shot at the pinball machine. Only groups 1B and 2A showed learning. Thus, what counts as an effective reinforcer for one child may be completely ineffective for another. (Similar results hold up for animals: see the next chapter).

Also as a potential illustration of belongingness, we might mention the fact that certain responses that are easy to acquire with positive reinforcement become very difficult to acquire with negative reinforcement. Pecking a key for pigeons, for example, is difficult to train with negative reinforcement (e.g., MacPhail). The explanation for this latter result may have to do with Bolles's theory of safety and danger signals. Negative reinforcement involves the presence of danger signals that trigger SSDRs. Such responses may well interfere with the desired response, particularly if that desired response involves approaching the danger signal or aversive stimulus! Such a notion is similar to a more general principle of preparedness posited by Seligman: Responses may be ordered on a continuum ranging from prepared responses at one extreme to contraprepared responses at the other. Prepared responses are responses quite similar to what an animal would naturally do in a given situation, whereas contraprepared responses are those the exact opposite of what the animal would normally do (approach danger rather than flee from it, for example). According to Seligman's principle, the closer a response is to the prepared end of the continuum, the easier it should be learned. So, the exact same response may be acceptable in one circumstance, but not in another.

C. Acquisition Without Direct Reinforcement

A number of studies question whether a reinforcer is necessary for forming or strengthening an association. The classic experiment illustrating this phenomenon was done by Tolman and Honzik. Their design involved having rats learn a maze. The rats were given one trial per day for 17 days. The experiment involved the following design:

Group Treatment

            1                     no RF; removed when reach goal box
            2                     RF on each day when reach goal box
            3                     no RF until the 11^th day

The question Tolman and Honzik asked was how the third group would perform on days 11 through 17. Since these days represented the first time this group had experienced reinforcement, a reinforcement-based account of learning would suggest that these animals started learning only on Day 11. But in fact, on the 12^th day, these animals were performing as well as (in fact, slightly better than) the animals reinforced from the beginning (Group 2): They had learned to navigate the maze in the absence of a reinforcer. This finding, termed latent learning, suggests that reinforcement may be more important for performance (motivating an animal to show its knowledge) rather than acquisition.

Another similar result involves a study by Butler in which monkeys learned a response whose consequence involved being given access to a window looking out on a parking lot. While curiosity might be called a reinforcer, it seems a bit of a stretch in this case. The problem is that we have no way of independently identifying when learning would be expected to occur in the absence of any other reinforcer such as food, and when it would not. When is the animal curious?

A third study involves the area of observational learning. In a famous experiment by Bandura, kids watched a tape of a clown playing with toys. Children in the vicarious reinforcement group saw the clown being rewarded, but children in the vicarious punishment group saw the clown being punished for the way he played. Later, when these kids were given a chance to play with the same toys, the kids in the vicarious reinforcement group displayed the same behaviors: evidence that they had learned by watching. The kids in the punishment group played in a very different manner. But they had acquired the responses as well: When the experimenter asked them to show what the clown had done, they were able to do so. Thus, we find from this study that reinforcement and punishment may have an effect at a distance: Watching others be reinforced can serve as a reinforcement. Such a notion takes us far afield from the original idea of an appetitive stimulus that follows an emitted response (note that the children had not made the response themselves, and note also that all children had learned the response, though some of them had suppressed it until given permission by the experimenter to play the way the clown had played).

A final study on observational learning illustrates that the notion is not restricted to humans. Kohn and Dennis taught one group of rats a choice discrimination. A second group that were able to watch the training of the first group learned the choice discrimination faster. Both the Kohn and Dennis and the Bandura studies suggest a point we will explore in the next subsection; namely, that responses do not need to first be emitted in order for learning to occur.

D. Non-Response-Dependent Acquisition

There are several embarrassments for a theory that claims learning requires making a response. One such embarrassment is perhaps more severe than another, but we can categorize these into two sorts: learning prior to responding, and response variations during acquisition or performance.

Learning Prior To Responding

As we saw in the observational learning studies, organisms (human and non-human) can start the process of learning prior to actually performing a physical response. To reiterate a theme that was important in classical conditioning, this seems strongly to suggest that an association forms between mental representations. Let's consider three more examples of this. They are relevant because, unlike the Bandura and Kohn and Dennis studies, they will not involve learning by observing others perform (i.e., learning by imitation). Thus, they cannot easily be accounted for in terms of vicarious reinforcement.

The first is a study by McNamara, Long, and Wike. They looked at how long it took to train two groups of rats to learn a maze. One group, however, was initially placed one-by-one in a 'wagon' and dragged through the maze. This group did not perform any of the running or turning responses, but they did get to observe their environments. And as you have probably guessed, this group learned to run to the goal box (where they had been dragged) faster than the group without that experience. Presumably, while being dragged through the maze, they had opportunities to observe the various stimulus cues and form a representation of their environments (a cognitive map: see the discussion below and in the next chapter on Tolman's theory of learning). Learning to navigate through those environments was thus speeded up for this group.

The second study involved the acquisition of cognitive maps by chimpanzees. In this study, Menzel had animal subjects watch while food was hidden in slightly under 20 different locations in a large field. The animals were brought along as Menzel hid the food over the field. Subsequently, they were released at the starting point, and observed while they collected the food. Two findings are relevant here. First, they knew the locations of the food, despite having made no response themselves. And second, they did not collect the food in the same order as it was hidden. Thus, we would not want to claim that a path through the field (consisting of a chain of locations to visit) was learned (as may have been the case in the McNamara et al. study). They clearly did not imitate Menzel's path or chain in this instance. The results again seem to suggest that animals can acquire representations that are map-like, and that their learning will show up as an enhanced ability to successfully find food (rather than as the performance of a given order of responses).

Our third study was introduced earlier in this chapter. This is the study on latent extinction by Seward and Levy. To remind you, animals placed directly in a goal box in which they have previously been fed more rapidly extinguished running to that goal box than a group without this initial experience. Insofar as extinction is normally defined as requiring the process of making a non-reinforced response, both groups should have extinguished at the same rate: The initial experience of the experimental group did not involve making a non-reinforced running response! But that didn't happen. Presumably, through classical conditioning, the goal box had become associated with food, so that absence of food may have aroused frustration or inhibition. Thus, a change in the value of the outcome (a type of devaluation) altered its value for the experimental group, making learning of extinction easier.

Response Variations

Alternatively, we may ask whether a given response, once it is acquired, is stamped in. On a strict associationist account such as Watson's or Hull's, motor movements are trained. However, the evidence strongly seems to disconfirm this, too.

Consider a study by Macfarlane. In this study, rats were trained to run a T-maze. Once they had acquired this response, Macfarlane flooded the maze and put the rats into the start box. They swam to the goal box that had previously been reinforced. The point of this study, of course, is that swimming and running technically involve different muscle movements. So, if Watson's view of learning were correct, we ought not to find evidence of learning when a different response is executed. But consistent with our discussion of cognitive maps, these animals had learned where to go: How to get there was not all that important.

A quite similar point occurs in studies with the use of the Morris maze. The Morris maze is a pool of opaque, milky-white liquid that has a platform somewhere underneath the water. The platform is close enough to the surface to enable a rat to keep its head above water without having to swim. Morris and his colleagues have found that rats released into the pool from the same spot eventually discover the platform (not having to expend energy on swimming is the reinforcer), and then learn to swim straight towards it. What is important from our perspective, however, is that these animals still head towards the platform when they are released from a new location: They are able to adjust their angle of swimming relative to the landmarks in the room that tell them in what direction the platform ought to be. That technically involves a different response then the one these animals made during acquisition. That they can execute the proper novel response again illustrates the involvement of cognitive maps in learning. What drives performance here is where to get to, and how to get there as soon as possible.

And indeed, people who study acquisition often report that animals will perform a number of different physical responses that appear equally effective in obtaining reinforcement. Rats need not (and will not) always press the bar with the same paw.

Finally, although it takes us slightly off of the focus of this section, we may also mention studies that show animals do not always prefer to make a response that has just been reinforced (and that should therefore be relatively strong). On a T-maze, for example, rats have a tendency to visit alternate arms (e.g., Dember and Fowler). This should remind you of our discussion of foraging theory. Similarly, Harlow has demonstrated that primates can learn a win-shift lose-stay strategy in choice discrimination in which they have to select the non-reinforced stimulus on the next trial. Responding to the previous S- and avoiding a response to the previous S+ ought to be difficult under normal associationist assumptions. Under foraging theory assumptions, it ought not to be that difficult.

E. Multiple Stimuli & 'Compound Conditioning'

Stimuli are complex events that can be broken down into yet simpler stimuli. Moreover, several different stimuli may be present on a given trial, each associated with its own reinforcer. In this section, we consider what happens in these circumstances.

Herrnstein's Matching Law

Consider the following situation: A pigeon is trained to peck at a key for reinforcement. We use a variable interval schedule (see the chapter on extinction and partial reinforcement for more details) in which the first peck after some random interval of time is reinforced. There may be a red key in one block of trials, and the first effective peck at it will yield two reinforcers. In another block of acquisition trials, there may be a blue key that will be rewarded with one reinforcer when pecked. And finally, in yet another block of trials, a green key may be associated with five reinforcers. Following this training, what will happen when the pigeon is put into an operant chamber containing all three keys? Situations such as this were investigated by Herrnstein.

The answer may surprise you. (And it will perhaps startle you to find that the answer doesn't depend on species: College-level humans have displayed the same result!) A first-guess common-sense theory most people come up with is that the animal (or human) will spend all of its time on the key that has the best value -- the green key. But this often does not happen. Instead, the animal (and the human) will distribute its responses in proportion to the reinforcements available. That is, it will peck all of the keys over a period of time, but will peck the green key proportionately more often than the red key, and the red proportionately more often than the blue.

Herrnstein has called this the matching law. To provide a simple formula for this law, let us assume we have a series of possible responses corresponding to a series of stimuli. In that case,

Responses to S₁/Total Responses = RFs from S₁/Total Available RFs

So, to calculate the proportion of times our animal spends with each key in the example above, we would calculate the following:

Proportion of Responses to S_Green = 5/(5+2+1) = 5/8 = .625

Proportion of Responses to S_Red = 2/(5+2+1) = 2/8 = .25

Proportion of Responses to S_Blue = 1/(5+2+1) = 1/8 = .125

That is, it should distribute 62.5% of its responses to the green key, 25% to the red key, and only 12.5% to the blue key.

The matching law applies generally whenever there is a difference in value of the reinforcer. We know that temporal contiguity can affect the value of a reinforcer, and a version of the matching law has been formulated for this situation, as well. But in this case, value depends on the reciprocal of the delay. A reinforcer that is given after a short delay has more value than one given after a long delay. So, given the same reinforcer presented at 2 and at 8 sec delays, its value would be 1/2 (.5) and 1/8 (.125), respectively. In this case, the matching law would have the following formula:

Responses to S₁/Total Responses = value of RF from S₁/Total Available values

Let us take another example. We'll again use a pigeon trained to peck at a red, blue, or green key. This time, the pigeon gets the same reinforcer from each, but at different delays: 2 sec for pecking at the red key, 4 sec for pecking at the blue key, and 8 sec for pecking at the green key. Before even doing any of the calculations, you ought to correctly predict that the red key will be pecked most (fastest reward), and the green key least (slowest reward).

But let's do the calculations. First, we need to calculate the values based on the delays. Remember that these are reciprocals. So, the values are:

red: 1/2 = .5 blue: 1/4 = .25 green: 1/8 = .125

Plugging these into our formula will yield the following results:

Proportion of Responses to S_Green = .125/(.125+.25+.5) = .125/.875 = .143

Proportion of Responses to S_Red = .5/(.125+.25+.5) = .5/.875 = .571

Proportion of Responses to S_Blue = .25/(.125+.25+.5) = .25/.875 = .286

You can see that the our predictions turn out to be correct. Specifically, our pigeon ought to distribute 57.1% of its responses to the red key, but 14.3% to green. (As a check on your calculations, by the way, note that the values ought all to add up to 100%!)

The matching law also applies to aversive outcomes. However, there is some evidence that it does not apply to all instrumental situations (see, for example, Allison). Some discrepancies have been found with fixed interval schedules, for example (see the chapter on extinction and partial reinforcement). Nevertheless, it illustrates complex processes that depend on comparing momentary values from multiple stimuli. Several theories have been proposed to explain the matching law. One of these, the melioration theory, claims that animals are assessing the momentary odds of a payoff. When one stimulus has paid off, then the animal works on the stimulus that is next most likely to pay off. Although the example differs somewhat, it is reminiscent of people playing multiple slot machines who shift to a different machine as soon as the machine they're on has scored.

Compound Conditioning

The phrase compound conditioning is typically discussed in reference to classical conditioning. Nevertheless, it is also appropriate here, as there are many instances in which several stimulus elements are present during learning. The typical (and not always correct) assumption is that a complex stimulus is the basis for generalization: The animal has associated a response with all of the elements of that stimulus, and generalizes its responses to other stimulus complexes as a function of how many elements overlap. However, we sometimes obtain results reminiscent of blocking or overshadowing. We will look at these in more detail in the chapter on attention and categorization, but let us briefly consider several studies that will illustrate the point.

The first is a study by Reynolds. Reynolds taught pigeons to peck at a key that included several different stimulus elements. Specifically, there was a white triangle against a red background on top of the reinforced key (as opposed to a circle against a green background for the non-reinforced key). In a later generalization test, Reynolds checked for how many pecks the pigeons would give to a completely red key, a completely green key, a triangular key, and a circular key. If the pigeons were under the control of the total stimulus complex (triangle & red), then we would expect significant generalization to both red and triangle, since they contain elements of the original training complex. Instead, she found that one pigeon pecked just to the red key, and the other pecked just to the triangular key. In this case, one element of the complex was the only effective or salient element; it completely overshadowed the other.

A second study that illustrates a similar phenomenon was conducted by Wagner, Logan, Haberlandt, and Price. They set up a design for two groups of animals that was something like the following:

Group Compound Stimulus RF Frequency

            1             Light & Tone₁                  50%
                           Light & Tone₂                  50%
            2             Light & Tone₁                100%
                           Light & Tone₂                    0%

This study, of course, manipulates signal value. Thus, for the first group, the light has better signal value than either of the tones, because the light predicts more reinforcers (each tone by itself only predicts half the reinforcers the light does; you would have to attend to both tones to predict as many reinforcers as the light: two things to track rather than one). In contrast, the light is a worse predictor than Tone₁ in the second group: paying attention to the light will work only half the time (as it does for Group 1), but paying attention to Tone₁ will work all of the time. Wagner et al. find that the light does a more thorough job of controlling responding in the first group, despite the fact that it really predicts the same number of reinforcers in both (note the similarity to how we set up a blocking design in classical conditioning).

Can we get blocking in the traditional fashion? We ought to predict a blocking effect if instrumental conditioning operates the same way classical conditioning does. That is, given the following design:

Group Phase 1 Phase 2

1 RF for R to S₁ RF for R to S₁&S₂
2 (Nothing) RF for R to S₁&S₂

we would predict that the animals in Group 1 should block to S₂, whereas the animals in Group 2 might be expected to show some conditioning to both S₁ and S₂. Thomas, Mariner, and Sherry used this type of design with pigeons, and obtained evidence of blocking.

Finally, consider a study by Lawrence and DeRivera. They used stimuli involving different shades of grey. In this study, animals had to perform one response if the darker shade was on top of a lighter shade, and the opposite response if the lighter shade was on top of the darker. During training, the bottom shade in all cases was a neutral grey.

The question Lawrence and DeRivera asked was whether the animals were under control of just the top color, or were comparing the top color with the bottom color. To make a long story short, they found that animals were essentially comparing the top shade to the bottom shade. When two of the lighter shades were presented, for example, which response the animal performed depended on whether the lighter of these shades was on the top or the bottom, even though both shades had been associated with the same response during acquisition. In this case, responding was apparently based on a comparison rule such as darker on top or lighter on top. Such comparative sensitivity is referred to as relational learning. It is incompatible with the claim that an association forms between independent stimulus elements and the response. That assumption forms part of many associationist models such as Hull's or Spence's theories (see the next chapter), but it also appears inconsistent with a model like the Rescorla-Wagner model, in which each stimulus is treated independently of the others on a conditioning trial. Under certain circumstances, stimulus configurations are not equal to the sum of their component parts.

IV. Several Views Of Instrumental Conditioning

We have implicitly and explicitly discussed several views of what instrumental conditioning involves. On many associationist accounts, there will be an association between a stimulus and a response, at a minimum, and there may or may not be associations with the outcome. Learning involves the formation of these associations, or their strengthening or weakening. In contrast, as many of the studies we have just reviewed on cognitive maps and response variations suggest, the stimulus-response link may not be all that critical. In this section, we briefly introduce two diametrically opposed views of learning. These are the radical behaviorist approach of Skinner and the cognitive expectancy theory of Tolman.

A. Skinner's View

We start with a notion Skinner expresses that makes him fundamentally different from practically every one else in the field of learning: Skinner's refusal to engage in formal theorizing. For Skinner, theories involve hypothetical constructs that cannot be directly observed, and that therefore ought not to be discussed. And since theories (for most scientists and researchers) generate predictions to be tested, you may correctly guess that Skinner's notion of what experiments are about will also differ. Perhaps the best way of putting it is this: Skinner's system only requires that we be able to predict and control behavior. When we have accomplished that goal, we need go no further. So, we do experiments to see which environmental contingencies control behavior in which ways.

For Skinner, there are essentially two broad categories of behavior: operants and respondents. Respondents cover behaviors that are reflexively coaxed out of the animal by the presence of specific stimuli. The presence of other stimuli at the same time enables them to acquire the ability to elicit these responses. Thus, he includes the work in classical conditioning under the category respondent. But operants are more characteristic of what most of us (and most higher animals) do: They are the pieces of behavior that an animal emits in a given situation.

The distinction between elicited and emitted behavior is an important one for Skinner. Emitted behavior is not triggered by a single stimulus link to that behavior, whatever it is. Operants are presumably influenced by a whole complex of events including the contextual stimuli, genetic constraints, and motivational factors having to do with the animal's state of hunger or thirst at the time. This is such a complex set of determiners that for all practical purposes, Skinner refuses to talk about which stimuli connect with which responses. Thus, contrary to most behaviorist theorists, Skinner does not talk about an association forming between a given stimulus and a given response, nor about whether and when that association strengthens or weakens. Rather, stimuli for Skinner serve something of the same function as occasion setters: They signal times at which a response-outcome contingency occurs.

We ought to take a moment to note that the notion of a response-outcome contingency in Skinner's work does not mean the same thing as an operant contingency space. This notion merely means that the experimenter has set up a condition whereby a given response will occasionally be followed by an outcome. Contiguity of the response and the outcome constitute the relevant learning mechanism, as we have seen in the previous discussion of superstitious behavior. But that is not to say that stimuli are irrelevant. When an outcome follows a response in the presence of one stimulus but not another, the animal learns a discrimination. The response comes under the control of the first stimulus (stimulus control), not in the sense that there is a triggering effect, but in the sense that the first stimulus becomes part of the entire stimulus situation in which responding alters.

We ought also to take out a moment to discuss this notion of a response. Although Skinner did use the term, it is in some sense odd, given the de-emphasis on any specific cause of the response. Given his view, it will not surprise you to learn that he did not insist that learning require the same physical response to increase or decrease in frequency (unlike Watson and Hull, who both claimed an association formed with specific muscular movements). So, many of the objections we looked at above to the learning of specific movements do not apply to Skinner. Instead, he adopted a functional definition of a response: any set of responses that achieve the same function qualify as the same response. Thus, if the function is to get to the left side of a T-maze, running to it, swimming to it, backing up to it, and casually crawfishing to it all count as the same response, because they all accomplish the same function.

Moreover, through shaping and chaining, complex operants followed by a single reinforcer are strengthened as a group, so that behavior may be constituted into quite long sequences. The sequence of getting out your car key, inserting it into the lock, unlocking the door, taking the key out, opening the door, getting in, closing the door, putting the key in the ignition, and turning it may all be reinforced by the motor coming on, but will all be punished by an engine that refuses to turn. If this seems a strange example to bring up, it isn't, really. Above all, Skinner was always concerned with the practical aspects of modifying behavior in the real world, and exploring how real-world contingencies affected behavior. Thus, in part as a pragmatist, he was concerned with what worked, and not with elaborate theories of why or how.

Perhaps because he was a pragmatist, he and his followers also tended to avoid experimental designs involving large groups of animals or people. His focus was on the individual, and whatever changes could be observed in the individual. Control that individual's behavior to some extent, and you have demonstrated sufficient explanation for why it occurs (since you are now able reliably to predict the presence or absence of that behavior).

Among his contributions was the study of partial reinforcement schedules (see Ferster and Skinner), behavior modification techniques and token economies, the notion of superstitious behavior, behavioral-level definitions of reinforcers and punishers (as opposed to the theoretical definitions we will see in the next chapter), work on secondary reinforcers and punishers, teaching machines, and the notion of negative reinforcement (which involves a different definition than the one I have used earlier: For Skinner, negative reinforcers are aversive events that operate as reinforcers by being removed. This is a subtle difference, but recall that we have defined negative reinforcement in terms of the removal of the event, not whether the event itself is aversive). Skinner, unlike Watson, was also happy talking about the conditioning of private events. And shortly before his death in 1990, he lambasted the current emphasis in American Psychology on cognitive models and memory systems. Ironically, his approach had become isolated from mainstream research at the same time that his behavior modification procedures had become a normal part of educational and clinical management techniques.

I don't think Skinner would have been disturbed by any research finding whatsoever. Since he refused to build formal theories, no finding could really have been inconsistent with his approach. In some sense, I think of Skinner as engaged in a process of cataloging or categorizing behavior: What are the situations under which this response increases? What are the situations under which it displays resistance to extinction? What mechanisms are effective for altering a response? How can we set our environments up to provide maximal efficiency? These were the issues that occupied him.

As an example of work inspired by Skinner's approach, we may consider a famous series of experiments by people like Verplanck and Greenspoon on verbal conditioning. They used a reinforcer of agreement (e.g., "uh huh" or even a pencil tap) and showed that people increased whatever it was that the "uh huh" followed: plural nouns rather than singular nouns, affective rather than descriptive statements, etc. Following acquisition, Greenspan put his subjects through extinction, and following that, questioned them to see if they had been aware of what was going on. He claimed they weren't. Thus, several theorists made the very strong claim that humans could easily be conditioned without their awareness (a claim Watson would have loved, of course).

Later theorists such as Dulany and Spielberger and DeNike challenged those studies. They pointed out a number of potential problems. One, for example, was that whatever people might have thought was going on would have been implicitly disconfirmed once extinction started, since they would now be collecting evidence against their hypothesis. Another was that a number of correlated hypotheses could have increased responding, but that these weren't included by the earlier experimenters. For instance, if you suspect you're being reinforced for mentioning species of dogs, then you may say "chihuahuas, collies, daschunds, terriers, pugs," etc. Note that these are plurals. But when the experimenter asks you what you believed the purpose of the experiment involved, you report that you were being stroked for coming out with dogs, which gets you coded (unfairly) as having shown conditioning without awareness. And finally, Dulany demonstrated that the people who showed verbal conditioning were those who at the time had a correlated hypothesis (i.e., were aware that something was going on, and had a theory that would result in increasing responses that the experimenter would count as correct by administering reinforcement).

I think a true Skinnerian wouldn't have been much bothered by Dulany's or Spielberger and DeNike's results. Awareness for them could be defined operationally as a series of answers to questions on a survey (much as Watson defined emotions as nothing more than certain behaviors like crying or shaking). Those answers constitute verbal behavior, as well. All a true Skinnerian need do in this situation is talk about the conditions under which one type of verbal behavior (performance on a survey) accompanies another (conditioning). I include this example because I want to give you a flavor of the extraordinarily different ways in which people interpret scientific research. To go back to the work by the philosopher Thomas Kuhn (mentioned in Chapter 1), Skinner's approach represents a completely different paradigm. And people in different paradigms can only rarely have useful discussions with one another about the foundational and philosophical assumptions that make science and the world meaningful for them. It is a bit like arguing religious beliefs.

B. Tolman's Expectancy Approach

Skinner's first major publication, The behavior of organisms, appeared in 1938. To give you a time frame, Tolman published one of his major works, Purposive behavior in animals and men, in 1932. At the time, behaviorism was the only game in town. Tolman regarded himself as a behaviorist, but he was like no other theorist then around. As you may gather from the title of his book, he believed that behavior was guided by an organism's purposes and goals. Thus, his brand of behaviorism is called purposive behaviorism.

Tolman was a theorist, in contrast to Skinner. He build models around the notion of intervening variables that came between a stimulus and a response. These variables in large part involved cognitions: beliefs, expectancies, desires, and knowledge. The argument that he and others who have used intervening variables make is that theories with such variables are more successful in their predictions than those without. He didn't worry about Watson's dictum that private events were illegitimate in a science of psychology, since the proof of the legitimacy of the concept for Tolman was its track record. Tolman was the precursor of people like Bandura who built models around observational learning, and more generally, of the cognitive revolution that occurred in American psychology in the 1960s.

Several cognitions were particularly important in Tolman's work. One of these we have already met: the notion of a cognitive map. According to Tolman, animals observing, exploring, and experiencing their environments would come to have representations of the lay-out of those environments. Thus, he performed experiments in which he showed that animals would illustrate they had learned an environment once properly motivated to do so (e.g., the Tolman and Honzik experiment on latent learning discussed earlier), and that they knew how to get around obstacles and take novel shortcuts, when their normal routes were no longer available. We can discuss such learning in terms of connections of stimulus (S-S) associations, but it was clearly observational learning, in many of Tolman's studies.

Another cognition that was quite important (and that prefigured many modern theories of learning) was the notion of an expectancy or expectation: a belief that some event ought to occur in some situation based on past experiences. In discussing the situations found in instrumental and classical conditioning, Tolman provided examples of two types of expectancies. One may be written as follows:

Ej: S₁ -----> S₂

This may be read as stating the content of expectancy Ej: That expectancy tells us that when S₁ occurs, S₂ may be expected to follow. If we substitute the CS for S₁ and the UCS for S₂, we see that we obtain a situation corresponding to classical conditioning. But this situation extends far beyond classical conditioning. It may also explain how we build cognitive maps: we learn that this part of the route is normally succeeded by this other part.

As for instrumental conditioning, the expectancies may be given as below:

E_k: S₃ R_a -----> S₄

E_l: S₃ R_b -----> S₅

These two expectancies, E_k and E_l (the subscripts are just to keep them separate; we have a huge number of expectancies in which these stimuli are specified rather than indicated through abstract mathematical variables), basically state that in the presence of stimulus S₃, one response (R_a) will lead to stimulus S₄, and the other response (R_b) will result in a different stimulus. If you view these latter stimuli as rewards or punishers, then you obtain the expectancies that account for approach or avoidance.

We can now add the notions of value and valence to account for what an animal will do. The value of an expectancy has to do with the strength of its terminal stimulus. If we temporarily regress to speaking of reinforcers and punishers, there are strong and weak reinforcers, just as there are strong and weak punishers. As you might expect, the strong ones are of greater motivational value than the weak ones. As for whether something is positive or negative, this involves the notion of its valence or sign.

Given this, we can now state some simple rules for what an animal will do in any given situation. One is that given a choice between two positive valence responses, the animal will choose the stronger. As an example, consider the following expectancies in an experiment with monkeys:

E_k: Tone - Lift White Cup -----> find banana chip

E_l: Tone - Lift Blue Cup -----> find piece of lettuce

Here, we presume the animal is faced with a choice involving two down-turned cups. Each has a reinforcer hidden beneath it, and the animal may choose the reinforcer underneath one of the cups. From past experience, it has learned that the white cup hides a banana chip, and the blue cup hides a lettuce leaf. Banana chips are stronger reinforcers: They are high-value positive-valence outcomes. Thus, our principle states the animal ought to choose the white cup. In common words, choose the better of two goods.

Our second rule will involve negative valences. We have a rat in a chamber that may leave by one of two doors. It is being shocked in the chamber, so there is every reason to leave. If it goes through the north door, the shock reduces by half, and if it goes through the south door, the shock reduces by a fourth. In this case, the principle is choose the weaker of two negative-valence outcomes. Or in plain English, if you have to, go for the lesser of two evils.

That gives you a bit of a taste of Tolman's theory. We will talk about several more relevant studies later. Tolman and Hull (see the next section), in particular, constantly chased one another's experiments and theories, arguing about whether the results suggested the need for cognitive factors or not. Interestingly enough, they both utilized intervening variables, although Hull's system was far more developed and organized than Tolman's. But Tolman had a gift for finding the weakness in one of Hull's claims, and doing an experiment that would seem to demonstrate a result the exact opposite of what Hull predicted. That in part was the case with the latent learning study, since Hull's model allows formation of an association only with a very special type of reinforcing event called a drive reduction. But in latent learning, no such event occurred on the first 10 trials for the group running without a reinforcer.

It's hard to imagine two approaches more different, and yet in some respects similar, than Tolman's and Skinner's. They were both concerned with large-scale behavior rather than the minute muscle movements of Hull's system. But Tolman freely speculated on intervening variables while Skinner loathed them. One placed the cause of behavior squarely within a cognitive or representation-level approach, but the other kept as close to a behavioral-level approach as was possible. And the specific details of each theorist's approach were generally ignored by most people, although each had enormous influence on subsequent work or theoretical approaches. Indeed, one of Tolman's students, Krechevsky, developed the notion that animals during learning test hypotheses about which stimulus element they are supposed to notice. As we will see in a later chapter, this notion evolved into modern-day attentional theories of discrimination learning.

C. A Note On The Interrelationship Between Classical & Operant Conditioning

It should by now be obvious that there are many close interrelationships between classical and instrumental conditioning, Skinner's distinction between the two not withstanding. In some cases, they are difficult to tell apart (as in the work on learned taste aversions, which may be analyzed from either a classical or an instrumental conditioning perspective). In others, they are clearly intertwined. Secondary reinforcers and aspects of chaining clearly involve classical conditioning, and classical conditioning has sometimes been used to account for what happens in avoidance learning (see the discussion of Mowrer's two-factor theory of avoidance learning in the next chapter). So, how different are they, really?

Several theorists have tried to answer this question from different perspectives. One approach simply attempts to cut the Gordian knot by applying similar models to each. In a later chapter in which we examine theories of discrimination learning, we will come across attentional and rehearsal concepts that will remind you of the corresponding models in classical conditioning. There have been some attempts, for example, to modify the Rescorla-Wagner model to handle the strength of instrumental learning by treating the S and the R as CSs in compound conditioning, and the RF as the UCS (see, for example, Wasserman, Elek, Chatlosh, and Baker). We have already seen some of the predictions regarding blocking and overshadowing that would naturally arise from application of such a model.

Another approach is more direct: If Skinner is correct in identifying these two types of learning as really involving different types of muscle systems (voluntary versus involuntary muscles), then we ought not to be able to instrumentally condition involuntary reflexes. However, there is now quite a lot of work in the area of biofeedback on different species (including humans) that demonstrates modifying involuntary responses through operation of instrumental reinforcers is feasible (see, for example, Miller).

Whether the same models ultimately apply to these two areas or not, do note that there will always be the other type of learning occurring whenever you train instrumental or classical conditioning. Reinforcers and punishers are also significant biological events that act as UCSs, so that instrumental conditioning using outcomes should generally include some aspects of classical conditioning. Similarly, biologically significant UCSs in classical conditioning may act as outcomes influencing the responses the animal makes prior to their presentation.

Learning in the real world doesn't always occur in neat packets that allow one type of association to form, and not another.

Partial Bibliography

Amsel, A. (1958). The role of frustrative nonreward in noncontinuous reward situations. Psychological Bulletin, 55, 102-119.

Allison, J. (1989). The nature of reinforcement. In S.B. Klein & R.R. Mowrer (Eds.), Contemporary learning theories: Instrumental conditional theory and the impact of biological constraints on learning (13-39). NJ: Erlbaum.

Badia, P., & Culbertson, S. (1972). The relative aversiveness of signalled vs. unsignalled escapable and inescapable shock. Journal of the Experimental Analysis of Behavior, 17, 463-471.

Badia, P., Culbertson, S., & Harsh, J. (1973). Choice of longer or stronger signalled shock vs. shorter or weaker unsignalled shock. Journal of the Experimental Analysis of Behavior, 19, 25-32.

Bandura, A. (1965). Influence of models' reinforcement contingencies on the acquisition of imitative responses. Journal of Personality and Social Psychology, 1, 589-595.

Boe, E.E., & Church, R.M. (1967). Permanent effects of punishment during extinction. Journal of Comparative and Physiological Psychology, 63, 486-492.

Bolles, R.C. (1970). Species-specific defense reactions and avoidance learning. Psychological Review, 77, 32-48.

Breland, K., & Breland, M. (1961) The misbehavior of organisms. American psychologist, 16, 681-684.

Brown, J.S., Martin,R.C., & Morrow, M.W. (1964). Self-punitive behavior in the rat: Facilitative effects of punishment on resistance to extinction. Journal of Comparative and Physiological Psychology, 57, 127-133.

Brown, Pl., & Jenkins, H.M. (1968). Auto-shaping of the pigeon's key peck. Journal of the Experimental Analysis of Behavior, 11, 1-8.

Butler, R.A. (1953). Discrimination learning by rhesus monkeys to visual exploration motive. Journal of Comparative and Physiological Psychology, 46, 95-98.

Campbell, P.E., Batsche, C.J., & Batsche, G.M. (1972). Spaced-trials reward magnitude effects in the rat: Single versus multiple food pellets. Journal of Comparative and Physiological Psychology, 81, 360-364.

Capaldi, E.J. (1978). Effects of schedule and delay of reinforcement on acquisition speed. Animal Learning and Behavior, 6, 330-334.

Colwill, R.M., & Rescorla, R.A. (1985). Postconditioning devaluation of a reinforcer affects instrumental responding. Journal of Experimental Psychology: Animal Behavior Processes, 11, 120-132.

Crespi, L.P. (1942). Quantitative variation in incentive and performance in the white rat. American Journal of Psychology, 55, 467-517.

Daly, H.B. (1974). Reinforcing properties of escape from frustration aroused in various learning situations. In G.H. Bower (Ed.), The psychology of learning and motivation (Vol 8, 187-231). NY: Academic.

D'Amato, M.R. (1970). Experimental psychology: Methodology, psychophysics, and learning. NY: McGraw-Hill.

D'Amato, M.R., Sarafin, W.R., & Salmon, D. Long-delay conditioning and instrumental learning: Some new findings. In N.E. Spear and R.R. Miller (Eds.), Information processing in animals: Memory mechanisms (113-142). NJ: Erlsbaum.

Dember, W.N., & Fowler, H. (1958). Spontaneous alternation behavior. Psychological Bulletin, 55, 412-428.

Dollard, J.C., & Miller, N.E. (1950). Personality and psychotherapy. NY: McGraw-Hill.

Dulany, D. E. (1968). Awareness, rules, and propositional control: A confrontation with S-R behavior theory. In T.R. Dixon & D.C. Horton (Eds.), Verbal behavior and general behavior theory. NJ: Prentice-Hall.

Dwyer, D.M., Mackintosh, N.J., & Boakes, R.A. (1998). Simulatneous activation of the representations of absent cues results in the formation of an excitatory association between them. Journal of Experimental Psychology: Animal Behavior Processes, 24, 163-171.

Egger, M.D., & Miller, M.E. (1963). When is a reward reinforcing? An experimental study of the information hypothesis. Journal of Comparative and Physiological Psychology, 56, 132-137.

Ferster, C.B., & Skinner, B.F. (1957). Schedules of reinforcement. NY: Appleton-Century-Crofts.

Flaherty, C.F. (1982). Incentive contrast: A review of behavioral changes following shifts in reward. Animal Learning and Behavior, 10, 409-440.

Fowler, H., & Miller, N.E. (1963). Facilitation and inhibition of runway performance by hind- and forepaw shock of various intensities. Journal of Comparative and Physiological Psychology, 56, 801-806.

Fowler, H., & Trapold, M.A. (1962). Escape performance as a function of delay of reinforcement. Journal of Experimental Psychology, 63, 464-467.

Garcia, J., Ervin, F.R., & Koelling, R.A. (1966). Learning with prolonged delay of reinforcement. Psychonomic Science, 5, 121-122.

Greenspoon, J. (1955). The reinforcing effect of two spoken sounds on the frequency of two responses. American Journal of Psychology, 68, 409-416.

Grice, G.R. (1948). The relation of secondary reinforcement to delayed reward in visual discrimination learning. Journal of Experimental Psychology, 38, 1-16.

Guthrie, E.R. (1952 ). The psychology of learning. (Revised edition) NY: Harper & Row.

Guttman, N., & Kalish, H.I. (1956). Discriminability and stimulus generalization. Journal of Experimental Psychology, 51, 79-88.

Hammond, L.J. (1980). The effect of contingency upon the appetitive conditioning of free operant behavior. Journal of the Experimental Analysis of Behavior, 34, 297-304.

Hanson, H.M. (1959). Effects of discrimination training on stimulus generalization. Journal of Experimental Psychology, 58, 321-334.

Herrnstein, R.J. (1970). On the law of effect. Journal of the Experimental Analysis of Behavior, 13, 243-266.

Jenkins, H.M., & Moore, B.R. (1973). The form of the autoshaped response with food or water reinforcers. Journal of the Experimental Analysis of Behavior, 20, 163-181.

Killeen. P. (1978). Superstition: A matter of bias, not detectability. Science, 199, 88-90.

Kohn, B., & Dennis, M. (1972). Observation and discrimination learning in the rat: Specific and nonspecific effects. Journal of Comparative and Physiological Psychology, 78, 292-296.

Kraeling, D. (1961). Analysis of amount of reward as a variable in learning. Journal of Comparative and Physiological Psychology, 54, 560-565.

Krechevsky, I. (1932). "Hypotheses" in rats. Psychological Review, 39, 516-532.

Kuhn, T.S. (1970). The structure of scientific revolutions (second edition). Chicago: University of Chicago Press.

Lawrence, D.H., & DeRivera. J. (1954). Evidence for relational transposition. Journal of Comparative and Physiological Psychology, 47, 465-471.

Lieberman, D.A., Davidson, F.H., & Thomas, G.V. (1985). Marking in pigeons: The role of memory in delayed reinforcement. Journal of Experimental Psychology: Animal Behavior Processes, 11, 611-624.

Macfarlane, D.A. (1930). The role of kinesthesis in maze learning. California University Publication Psychology, 4, 277-305.

MacPhail. E.M. (1968). Avoidance responding in pigeons. Journal of the Experimental Analysis of Behavior, 11, 625-632.

McNamara, H.J., Long, J.B., & Wike, F.L. (1956). Learning without response under two conditions of external cues. Journal of Comparative and Physiological Psychology, 49, .

Melton, A.W., & Irwin, J.M. (1940). The influence of degree of interpolated learning on retroactive inhibition and overt transfer of specific responses. American Journal of Psychology, 53, 173-203.

Menzel, E.W. (1978). Cognitive mapping in chimpanzees. In S.H. Hulse, H. Fowler, & W.K. Honig (Eds.), Cognitive processes in animal behavior (375-422). NJ: Erlbaum.

Miller, N. E. (1978). Biofeedback and visceral learning. Annual Review of Psychology, 29, 373-404.

Morris, R.G.M., Garrud, P., Rawlins, J.N.P., & O'Keefe, W. (1982).

Olton, D.S. (1978). Characteristics of spatial memory. In S.H. Hulse, H. Fowler, & W.K. Honig (Eds.), Cognitive processes in animal behavior (341-373). NJ: Erlbaum.

Postman, L. (1974). Transfer, interference, and forgetting. In J.W. Kling and L.A. Riggs (Eds.), Experimental psychology. NY: Holt, Rinehart, & Winston.

Premack, D. (1959). Toward empirical behavior laws: I. Positive reinforcement. Psychological Review, 66, 219-233.

Ratliff, R.G., & Ratliff, A.R. (1971). Runway acquisition and extinction as a joint function of magnitude of reward and percentage of rewarded acquisition trials. Learning and Motivation, 2, 289-295.

Rescorla, R.A. (1997). Response-inhibition in extinction. Quarterly Journal of Experimental Psychology. B. Comparative and Physiological Psychology, 50B, 238-252.

Reynolds, G.S. (1961). Attention in the pigeon. Journal of the Experimental Analysis of Behavior, 4, 57-71.

Roberts, W.A. (1969). Resistance to extinction following partial and consistent reinforcement with varying magnitudes of reward. Journal of Comparative and Physiological Psychology, 67, 395-400.

Seligman, M.E.P. (1970). On the generality of laws of learning. Psychological Review, 77, 406-418.

Seligman, M.E.P., & Maier, S.F. (1967). Failure to escape traumatic shock. Journal of Experimental Psychology, 74, 1-9.

Seward, J.P., & Levy, N. (1949). Sign learning as a factor in extinction. Journal of Experimental Psychology, 39, 660-668.

Sheffield, F.D. (1965). Relation between classical conditioning and instrumental learning. In W.F. Prokasy (Ed.), Classical conditioning: A symposium (302-322). NY: Appleton-Century-Crofts.

Shettleworth, S.J. (1975). Reinforcement and the organization of behavior in golden hamsters: Hunger, environment, and food reinforcement. Journal of Experimental Psychology: Animal Behavior Processes, 1, 56-87.

Skinner, B.F. (1938). The behavior of organisms: An experimental analysis. NY: Appleton-Century-Crofts.

Skinner, B.F. (1964). Behaviorism at fifty. In T.W. Wann (Ed.), Behaviorism and phenomenology: Contrasting bases for modern psychology (79-108). Chicago: U. Chicago Press.

Spence, K.W. (1947). The role of secondary reinforcement in delayed reward learning. Psychological Review, 54, 1-8.

Spielberger, L.D., & DeNike, L. (1966). Descriptive behaviorism versus cognitive theory in verbal operant conditioning. Psychological Review, 73, 306-326.

Staddon, J.E.R., & Simmelhag, V.L. (1971). The "superstition" experiment: A reexamination of its implications for the principles of adaptive behavior. Psychological Review, 78, 3-43.

The Rocky Horror Show. (1975). Original Australian Cast Album. Elephant Records. (Yep: It's better Rock & Roll than the later American album based on the movie version, in my opinion...)

Thomas, G. (1981). Contiguity, reinforcement rate, and the law of effect. Quarterly Journal of Experimental Psychology, 33B, 33-43.

Thomas, D.R., Mariner, R.W., & Sherry, G. (1969). Role of pre-experimental experience in the development of stimulus control. Journal of Experimental Psychology, 79, 375-376.

Thorndike, E.L. (1898). Animal intelligence: An experimental study of the associative processes in animals. Psychological Review Monograph Supplements 2, no. 8.

Thorndike, E.L. (1913).The psychology of learning. NY: Columbia University.

Tolman, E. C. (1932). Purposive behavior in animals and men. NY: Century.

Tolman, E.C., & Honzik, C.H. (1930). Introduction and removal of reward and maze performance in rats. University of California Publications in Psychology, 4, 257-275.

Trapold, M.A., & Fowler, H. (1960). Instrumental escape performance as a function of the intensity of noxious stimulation. Journal of Experimental Psychology, 60, 323-326.

Verplanck, W.S. (1955). The operant, from rat to man. Transactions of the New York Academy of Sciences, Series 11, 17 (8), 594-601.

Wagner, A.R., Logan, F.A., Haberlandt, K., & Price, T. (1968). Stimulus selection in animal discrimination learning. Journal of Experimental Psychology, 76, 171-180.

Wasserman, E.A., Elek, S.M., Chatlosh, D.C., & Baker, A.G. (1993). Rating causal relations: Role of probability in judgments of response-outcome contingency. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 174-188.

Watson, J.B. (1913). Psychology as the behaviorist views it. Psychological Review, 20, 158-177.

Watson, J.B. (1926 Excerpts from "What the nursery has to say about instincts." In C. Murchison (Ed.), Psychologies of 1925. NY: Clark U. Press.

Watson, J.B. (1930). Behaviorism. NY: Norton.

Watson, J.B., & Rayner, R. (1920). Conditioned emotional reactions. Journal of Experimental Psychology, 3, 1-14.

Watson, J.S. (1967). Memory and "contingency analysis" in infant learning. Merrill-Palmer Quarterly, 12, 55-76.

Some Relevant Internet Sites (but there are many more out there!):

Animal Cognition Page: (http://www.pigeon.psy.tufts.edu/psych26/history.htm)

(note particularly the links to Thorndike, Tolman, and Skinner's stuff. There's a graph of Thorndike's results with puzzle-box learning that you should look at.)

The Behaviorist Manifesto: (http://www.yorku.ca/dept/psych/classics/Watson/views.htm)

(Watson's classic paper)

Emotional Conditioning: ( http://www.yorku.ca/dept/psych/classics/Watson/emotion.htm)

(The Watson & Rayner paper reporting on their study with Little Albert)

Cognitive Maps: (http://www.yorku.ca/dept/psych/classics/Tolman/Maps/maps.htm)

(A paper by Tolman reviewing some of the cognitive map studies)

(Note: The three papers above are from the Classics in the History of Psychology webpage; you have a link for it in Chapter 1)

Chapter 4: The Basic Findings In Instrumental/Operant Conditioning(1)

I. Introduction To Instrumental Conditioning

A. Background: Two Early Views Of Instrumental Conditioning

Watson: Contiguity of S & R

Thorndike & Puzzleboxes: Reinforcement-Based Learning

B. Some Basic Findings

Generalization, Discrimination, & Contrasts

Inhibition In Extinction & Punishment

Mediated Learning & Secondary Reinforcers

Interference

The Partial Reinforcement Effect

C. Some Basic Paradigms

D. A Note About Terminology: Operant vs. Instrumental Learning

II. Basic Requirements For Effective Conditioning

A. Temporal Parameters

B. Outcome Strength

C. Contingency

III. Exceptions & Complex Interactions

A. Long-Delay Learning

B. Belongingness

C. Acquisition Without Direct Reinforcement

D. Non-Response-Dependent Acquisition

Learning Prior To Responding

Response Variations

E. Multiple Stimuli & 'Compound Conditioning'

Herrnstein's Matching Law

Compound Conditioning

IV. Several Views Of Instrumental Conditioning

A. Skinner's View

B. Tolman's Expectancy Approach

C. A Note On The Interrelationship Between Classical & Operant Conditioning

Partial Bibliography

Chapter 4: The Basic Findings In Instrumental/Operant Conditioning⁽¹⁾