What the #$*%

What the #$*%! is a Subsymbol?

USL P.O. Box 43770,

Lafayette

LA. 70504

USA

Abstract

In 1988, Smolensky proposed that connectionist processing systems should be understood as operating at what he termed the 'subsymbolic' level. Subsymbolic systems should be understood by comparing them to symbolic systems, in Smolensky's view. Up until recently, there have been real problems with analyzing and interpreting the operation of connectionist systems which have undergone training. However, recently published work on a network trained on a set of logic problems originally studied by Bechtel and Abrahamsen (1991) seems to offer the potential to provide a detailed, empirically based answer to questions about the nature of subsymbols. In this paper, I discuss the network analysis procedure and the results obtained using it. This provides the basis for a number of insights into the nature of subsymbols, which are perhaps surprising.

Introduction

In the middle 1980s, the world of cognitive science under went a major change due to the introduction of a class of systems which have become known as 'connectionist networks'. Many extravagant claims were made about systems from this class, including claims to the effect that they represented a 'Kuhnian' paradigm shift in the study of cognition (See Schneider, 1987). A significant class of the claims made about connectionist systems focused upon the alleged differences between them and systems which were constructed in accordance with Newell and Simon's (1976) Physical Symbol System Hypothesis.

At first blush, there certainly do appear to big differences between connectionist systems and traditional computational architectures, such as those of Turing and von Neumann machines. Whereas, traditional computational architectures perform operations upon strings of symbols, connectionist networks compute using continuous functions which operate upon multi-dimensioned vectors of activation. These differences inspired Smolensky (1988), in a well-known paper, to suggest that connectionist systems should be understood as operating at what he termed the 'subsymbolic' level, in contrast to traditional devices which operate at the 'symbolic' level. Smolensky's proposal naturally leads to the question of "What, exactly, is a subsymbol?" This is the question which will be addressed in this paper.

Smolensky on Sub-symbols

Smolensky (1988) characterizes subsymbols as having both a semantic and a syntactic aspects. In Smolensky's view, subsymbols are constituents of traditional symbols. He (1988: p. 3) describes the semantic aspect of subsymbols as being,

…the activities of individual processing units in connectionist networks. Entities that are typically represented in the symbolic paradigm are typically represented in the subsymbolic paradigm by a large number of subsymbols.

He also contrasts the subsymbolic level with the symbolic level when he describes the syntactic role of subsymbols. According to Smolensky (1988: p. 3) subsymbols

…participate in numerical - not symbolic - computation. Operations in the symbolic paradigm that consist of a single discrete operation (e.g. a memory fetch) are often achieved in the subsymbolic paradigm as the result of a large number of much finer-grained (numerical) operations.

Given the fact that subsymbols are specified by Smolensky in relation to symbols, the crucial issue for determining the exact nature of a subsymbols would consequently appear to be to understand the relationship between symbols and subsymbols. On this point too, Smolensky has a suggestion. He (1988: p. 3) remarks that,

…for the purpose of relating these two paradigms [the symbolic and the subsymbolic], it is often important to analyze connectionist models at a higher level; to amalgamate, so to speak, the subsymbols into symbols.

The correct means, then, of determining the exact nature of subsymbols would be to take a connectionist network which deals with an unambiguously symbolic problem and then perform the required analysis of the network. This, though, is must easier said than done.

The Methodological Problem of Network Analysis

One well known problem with connectionist networks which have undergone training is that they are extremely difficult to interpret and analyze. Robinson (1992: p. 655) has argued that "We may have to accept the inexplicable nature of mature networks". Mozer and Smolensky (1989: p. 3) make the same point somewhat colorfully when they note that,

...one thing that connectionist networks have in common with brains is that if you open them up and peer inside, all you can see is a big pile of goo.

This is, in fact, a serious problem for the connectionist research program in Cognitive Science a whole. McCloskey (1991: p. 387), for example, has argued that if trained networks cannot be interpreted or analyzed in detail, then

...connectionist networks should not be viewed as theories of human cognitive functions, or as simulations of theories or even as demonstrations of specific theoretical points.

Although the general problem of analyzing the entire class trained networks has yet to be solved, there have been a number of recent proposals which enable analysis to be performed on particular subclasses of learning networks (see Clark (1993) and Hansen and Burr (1990) for general discussions of the issues involved in, and techniques which have been applied to, trained network analysis). Some recently published work by Berkeley et al. (1995) is especially germane to the issue of the nature of subsymbols, in so much as the technique was applied to a network which was trained upon a set of logic problems, originally studied by Bechtel and Abrahamsen (1991). Logic problems are paradigm cases of problems which are 'symbolic' (in the sense intended by Smolensky) in nature. Thus, a detailed consideration of the results of Berkeley et al.'s (1995) network analysis should serve to provide good insight into the nature of subsymbols. In addition, Berkeley et al.'s (1995) analytic technique served to provide detailed evidence about the role of individual processing units in the over all functional economy of (specific) network systems. Thus, there are reasonable grounds to believe that this analytic technique can provide sufficiently detailed information about the nature of network function to make the relationship between the symbolic and subsymbolic levels explicit.

I shall begin by briefly describing Berkeley et al.'s (1995) analytic methodology. Then, I will introduce Bechtel and Abrahamsen's (1991) logic problem set and the network Berkeley et al. (1995) trained upon it. This will set the scene appropriately for a discussion of the results.

Network Analysis

The basic recipe for analyzing networks using the method described by Berkeley et al. (1995) consists of five main steps. The first of these is fairly obvious. It is to,

1) Train a network on the problem at hand.

It is important to note though that the trained network should employ Dawson and Schopflocher's (1992) value units, rather than standard connectionist units. Value units have a Gaussian activation function, rather than the commonly used sigmoidal one. The reason that value units should be employed is that the analytic methodology is designed specifically to work with this kind of unit. However, it is also worth noting that there is some evidence that networks of value units have a number of advantages over traditional processing units, under certain conditions. These advantages include having faster learning and better generalization (see Dawson and Schopflocher's (1992)).

The next two steps in the analytic method are,

2) Once the network has reached convergence, re-present all the patterns in the training set to the network, with the weights frozen and record the activation level of each hidden unit for each pattern.

3) Plot jittered density plots for each hidden unit, representing the level of activity for each pattern in the training set.

The second step can be thought of being analogous to single cell recording techniques, employed in studies of natural neural systems. The third step though requires a little further discussion, as jittered density plots may be unfamiliar to many. A jittered density plot has a horizontal scale which represents the range of possible activation states of a particular hidden unit. An example of a jittered density plot is illustrated in Figure 1.

Insert Figure 1 about here.

The actual level of activation which a particular hidden unit assumes for each input pattern in a training set can then be represented on this scale by a single dot. However, in order to prevent dots overlapping one another, it is necessary to add a random vertical 'jittering'. Thus, the vertical component of jittered density plots are without significance as they are purely heuristic. The purpose of producing such plots is to get an indication of the distribution of the levels of activity that the training set produces in particular hidden units. Step 4 of the analytic technique is,

4) Use the jittered density plot to identify bands of activation for each hidden unit.

It turns out that it is often the case (with value units) that the levels of activity of a particular hidden unit form distinct 'bands', when represented on a jittered density plot.

The identification of the number and spread of each of the bands in a set of hidden unit density plots is arguably the most crucial step in the analytic method under discussion. Only once bands have been identified (assuming that they are present) can the analysis proceed to step 5, which is,

5) Identify the 'definite features', with respect to the input patterns for each band.

The identification of bands makes it possible to consider what input properties, if any, each of the patterns which fall into a particular band have in common. These common input properties are known as 'definite features'. It is possible to provide operational definitions for definite features, as follows;

(i) A unary definite feature (for a particular band) is defined as an input bit which has a constant value for all the patterns within the band, and

(ii) A binary definite feature is defined as a perfect negative or perfect positive correlation between pairs of binary (input) features, the former representing the fact that two bits were always opposite in value, the latter representing the fact that two bits were always equal in value.

The identification of definite features is the final step in the analytic process. It provides a means by which the functional role of particular activation levels in particular hidden processing units can be discovered. It consequently permits the link between the symbolic and subsymbolic levels to be established. However, before discussing the results obtained by Berkeley et al. (1995), using this technique, the set of logic problems, upon which their network was trained should be introduced.

Bechtel and Abrahamsen's Logic Problem

Bechtel and Abrahamsen (1991) describe a logic task in which a connectionist network was presented with simple logical inferences and asked to determine the validity and problem type of those inferences. The training set was made up of 576 patterns. The inferences were all of the kinds illustrated in Table I, and four possible variables or their negations filled each of the argument places.

Insert Table I about here.

For ease of description, a concise descriptive notation was developed by Berkeley et al. (1995). In this notation S1(V1) stands for the first variable of the first sentence of the argument, S1(V2) stands for the second variable of the first premise, S2 stands for the variable in the second sentence and C stands for the conclusion variable. The term 'sign' is used in the notation to refer to the relationship between the negated or unnegated status of pairs of letters. Two variables are said to be of the same sign when they are both either negated or unnegated and said to be of different sign when one variable is negated when the other is not.

Bechtel and Abrahamsen's logic task was trained by Berkeley et al. (1995) on a network of value units with the problems being represented on fourteen input units and the network's responses being represented by 3 output units. The network had ten hidden units and is illustrated in Figure 2.

Insert Figure 2 about here.

The jittered density plots of the hidden units of this network are illustrated below in figure 3.

Insert Figure 3 about here.

It turned out that most of the bands in the network had definite features and, as such, could be interpreted. This provided evidence which revealed the role that particular hidden units played within the network when solving the logic problems. This is just the information which is required in order to determine the nature of the subsymbols within the system.

As a matter of fact, there is a great deal of information encoded within the hidden units of this network. For current purposes though, it will be sufficient to focus in upon just a few of the hidden units. I shall begin by looking at hidden unit 8.This hidden unit had three distinct and clearly separate bands in its jittered density plot, as is illustrated in Figure 4.

Insert Figure 4 about here.

Moreover, each of the bands had a number of unary definite features associated with it. The definite features and their interpretation are illustrated in Table II.

Insert Table II about here.

The detailed analysis and interpretation of the bands of hidden unit 8 of the network makes it very clear that the function of this unit within the network was to act as a connective detector. The unit had just three subsymbolic states each of which corresponded to one of the three possible connectives in the training set. This must then be what Smolensky had in mind when he talked of subsymbols.

As a matter of fact, hidden unit 8 was not particularly typical of the hidden units of Berkeley et al.'s (1995) network. For many of the units, it was far less straightforward to find labels which captured the role of the unit as a whole within the functional economy of the network. This was because bands often had complex sets of input properties associated with them and the properties differed quite markedly between individual bands. This is illustrated nicely by the interpretation Berkeley et al. (1995) discovered for hidden unit 3. The jittered density plot for this unit is illustrated in Figure 5.

Insert Figure 5 about here.

The details of the interpretation of the bands is presented in Table III (N.B. Space constraints do not permit providing the same level of detail as was possible in Table II, above).

Insert Table III about here.

A first point to note is that a large number of input patterns fall into the A band, which is assigned the interpretation "No definite features". Berkeley et al. (1995) found bands without definite features commonly arose close to the zero level of unit activation in a number of the networks hidden units. They concluded that such bands contained all the input patterns which the other bands in the network were not sensitive to. Other unit bands though usually did have definite features associated with them. For example, the C band of this unit is sensitive to the relationship between the first variables in the first premise and the conclusion variable. It is also sensitive to the relationship between the second variable in the first premise and the variable in the second premise. The B band is sensitive to these input properties, as well as a number of other ones, such as the main connective of the first premise. Thus, the interpretation enables us to determine precisely the subsymbols developed by the network in order to solve the Bechtel and Abrahamsen (1991) problem set.

'Rules'

One of the more interesting things which Berkeley et al. (1995) discovered about their network was that it developed what might be termed 'rules of inference'. For example, they discovered that all valid Modus Ponens problems produced a single pattern of bands [0-A, 1-A, 2-B, 3-A, 4-A, 5-A, 6-A, 7-B, 8-B 9-A] within the hidden units of the network. That is to say, every valid Modus Ponens problem produced activity of hidden unit 0 such that it fell into band A, produced activity of hidden unit 1 such that it fell into band B, and so on. This was of interest, as it showed that the network had found a means of using the subsymbolic properties represented by each of the bands as a way of making generalizations about the class of problems that is had to solve. Perhaps more intriguing though was the fact that when the properties associated with these generalizations were, studied it turned out that the network had become sensitive to almost exactly the same set of properties as would be predicted from studying traditional inference rules. This point can be nicely illustrated by comparing the traditional rule for Modus Ponens with the network's 'rule', as presented in Table IV.

Insert Table IV about here.

The only substantial difference between the two rules (once the traditional rule is re-expressed in Berkeley et al.'s descriptive notation) is that the network is not sensitive to whether the sign (i.e. negated or unnegated status) of the consequent letter is the same as the sign of the conclusion. The reason that the network did not have to be sensitive to this property was because of the restricted nature of the training set. Apart from this one minor difference though, the network and the traditional rules are close to identical to one another!

One reason this is an interesting result with respect to subsymbols, is that it shows that it is genuinely the case that a link between the symbolic level and the subsymbolic level can be found. More importantly though, it also shows that subsymbols can be used as a basis upon which interesting and useful generalizations about network performance can be based. From what Smolensky (1988) claims about subsymbols, this is just the kind thing one would hope to find.

Conclusion

Now an empirical 'look' at subsymbols has been achieved, a comparison between symbols and subsymbols is in order. Unfortunately, this presents something of a problem because there is some disagreement in the literature over the nature of symbols. For example, Vera and Simon (1993: p. 9) remark that "We call patterns symbols when they can designate or denote." However, Touretzky and Pomerleau (1993) take issue with this view. Whilst Touretzky and Pomerleau (1993) embrace the requirement that symbols must designate or denote (i.e. be 'representations' in their prefered terminology), they also add the condition that representations must have "combinatorial power" in order to count as true symbols.

It is clear though from the evidence discussed above that the individual activations of hidden units, as represented by the individual dots in the jittered density plots are subsymbols in Smolensky's (1988) sense of the term. Moreover, these dots also fairly unambiguously satisfy Vera and Simon's (1993) condition of symbolhood. Individual dots in the jittered density plot of hidden unit 8 (the unit described earlier as a 'connective detector') 'designate or denote' particular main connectives in the problem set. This being the case, subsymbols fall within the class of 'representations', as Touretzky and Pomerleau (1993) use the term, but they fail to satisfy their conditions for symbolhood.

This is not the appropriate place to attempt to adjudicate between the alternative definitions of 'symbol' proposed by Vera and Simon (1993), and Touretzky and Pomerleau (1993). However, if a broader definition of 'symbol', closer to Vera and Simon's (1993) conception is ultimately judged to be the most appropriate, then the evidence discussed above suggests that the difference between symbols and subsymbols may not be as great as has previously been supposed. Hopefully, future evidence from the analysis of trained connectionist networks will serve to provide evidence which will clarify the issues in the debate over the appropriate definition for symbols. In the meantime, it appears that the Berkeley et al. (1995) logic network analysis provides suggestive evidence for a prima facie case to be made that connectionist networks are, in fact, symbolic systems.

Bibliography

Bechtel, W. and Abrahamsen, A. (1991) Connectionism and the Mind, Basil Blackwell (Cambridge, Mass.).

Berkeley, I., Dawson, M., Medler, D., Schopflocher, D., and Hornsby, L. (1995), "Density Plots of Hidden Unit Activations Reveal Interpretable Bands", in Connection Science, Vol. 7, No. 2, pp. 167-186.

Clark, A. (1993), Associative Engines: Connectionism, Concepts, and Representational Change, MIT Press (Cambridge, Mass.).

Dawson, M. and Schopflocher, D. (1992), "Modifying the Generalized Delta Rule to Train Networks of Non-monotonic Processors for Pattern Classification", in Connection Science, 4/1, pp. 19-31.

Hanson, S. and Burr, D. (1990), "What connectionist models learn: Learning and representation in connectionist networks", in Behavioral and Brain Science, 13, pp. 471-518.

McCloskey, M. (1991), "Networks and Theories: The Place of Connectionism in Cognitive Science" in Psychological Science, Vol. 2, No. 6, pp. 387-395.

Mozer, M. and Smolensky, P. (1989), "Using Relevance to Reduce Network Size Automatically" in Connection Science, 1, pp. 3-16.

Newell, A. and Simon, H. (1976), "Computer science as empirical inquiry: Symbols and search", in Communications of the ACM, 19/3, pp. 113-126.

Robinson, D. (1992), "Implications of Neural Networks for How We Think about Brain Function", in Behavioral and Brain Science, 15, pp. 644-655.

Schneider, W. (1987), "Connectionism: Is it a paradigm shift for psychology?" in Behaviour Research Methods, Instruments & Computers, 19/2, pp. 73-83.

Smolensky, P. (1988), "On the Proper Treatment of Connectionism", in Behavioral and Brain Sciences, 11, pp. 1-74.

Van Gelder, T. (1990), "Compositionality: A Connectionist Variation on a Classical Theme", in Cognitive Science, 14, pp. 355-384.

Problem Type Problem Example Descriptive Notation

Modus Ponens (MP) If A Then B
A
--------------
Therefore B
Connective: If...then...
S1(V1): A
S1(V2): B
S2: A
C: B

Modus Tollens (MT) If A Then C
Not C
---------------
Therefore Not A
Connective: If...Then...
S1(V1): A
S1(V2): C
S2: C
S2 is negated
C: A
C is negated

Alternative Syllogism (AS)

Type 1 D Or A
Not D
---------------
Therefore A
Connective: ...Or...
S1(V1): D
S1(V2): A
S2: D
S2 is negated
C: A

Alternative Syllogism (AS)

Type 2 B Or C
Not C
---------------
Therefore B
S1(V1): B
S1(V2): C
S2: C
S2 is negated
C: B

Disjunctive Syllogism (DS)

Type 1 Not Both C and D
C
-----------------------
Therefore Not D
S1(V1): C
S1(V2): D
S2: C
C: D
C is negated

Disjunctive Syllogism (DS)

Type 2 Not Both A and D
D
-----------------------
Therefore Not A
S1(V1): A
S1(V2): D
S2: D
C: A
C is negated

Table I

Examples of valid inferences from Bechtel and Abrahamsen's (1991) logic problem set. Notation: S1(V1) - The first argument place (V1) of the first premise (S1) of an argument. S1(V2) - The second argument place (V2) of the first premise (S1) of an argument. S2 - The second premise of an argument. C - The conclusion of an argument.

Band	Number of Patterns	Mean Activation	Input Units	Value for all patterns	Interpretation of Definite Features
A	192	0.03	3 & 4	0 1	The connective is OR
B	192	0.11	3 & 4	1 1	The connective is IF...THEN
C	192	0.82	3 & 4	1 0	The connective is NOT BOTH...AND

Table II

The definite features and interpretation of the bands of hidden unit 8 of the logic network.

Band	Number Of Patterns	Median Activity	Interpretation Of Definite Features
A	456	0.00	No definite features
B	12	0.81	S1(V1) is negated S1(V1) is the same letter as C S2 and C are not negated S1(V2) is the same letter as S2 The connective is OR
C	86	0.99	S1(V1) is the same letter as C S1(V2) is the same letter as S2

Table III

The interpretation of the bands of hidden unit 3 of the logic network.

Problem Type

Formal Definition Of Rule For Valid Problem Type

Network 'Rule' Identified By The Interpretation

Valid Modus Ponens (MP)

S1(V1) = S2

S1(V2) = C

SIGN S1(V1) = SIGN S2

SIGN S1(V2) = SIGN C

CONNECTIVE: IF...THEN

S1(V1) = S2

S1(V2) = C

SIGN S1(V1) = SIGN S2

CONNECTIVE: IF...THEN

Table IV

The logic networks 'rule' for Modus Ponens, as compared to the traditional rule.

Figure 1

An illustrative example of a jittered density plot.

Figure 2

The pattern of interconnections between processing units and the input and output encoding scheme used with the network trained by Berkeley et al. (1995).

Figure 3

The bands recovered for the ten hidden units of the network trained upon the logic problem by Berkeley et al. (1995).

Figure 4

The jittered density plot for hidden unit 8 of the logic network.

Figure 5

The jittered density plot for hidden unit 3 of the logic network.