In 1988, Smolensky proposed that connectionist processing systems
should be understood as operating at what he termed the 'subsymbolic'
level. Subsymbolic systems should be understood by comparing them
to symbolic systems, in Smolensky's view. Up until recently, there
have been real problems with analyzing and interpreting the operation
of connectionist systems which have undergone training. However,
recently published work on a network trained on a set of logic
problems originally studied by Bechtel and Abrahamsen (1991) seems
to offer the potential to provide a detailed, empirically based
answer to questions about the nature of subsymbols. In this paper,
I discuss the network analysis procedure and the results obtained
using it. This provides the basis for a number of insights into
the nature of subsymbols, which are perhaps surprising.
Introduction
In the middle 1980s, the world of cognitive science under went
a major change due to the introduction of a class of systems which
have become known as 'connectionist networks'. Many extravagant
claims were made about systems from this class, including claims
to the effect that they represented a 'Kuhnian' paradigm shift
in the study of cognition (See Schneider, 1987). A significant
class of the claims made about connectionist systems focused upon
the alleged differences between them and systems which were constructed
in accordance with Newell and Simon's (1976) Physical Symbol System
Hypothesis.
At first blush, there certainly do appear to big differences between
connectionist systems and traditional computational architectures,
such as those of Turing and von Neumann machines. Whereas, traditional
computational architectures perform operations upon strings of
symbols, connectionist networks compute using continuous functions
which operate upon multi-dimensioned vectors of activation. These
differences inspired Smolensky (1988), in a well-known paper,
to suggest that connectionist systems should be understood as
operating at what he termed the 'subsymbolic' level, in contrast
to traditional devices which operate at the 'symbolic' level.
Smolensky's proposal naturally leads to the question of "What,
exactly, is a subsymbol?" This is the question which
will be addressed in this paper.
Smolensky on Sub-symbols
Smolensky (1988) characterizes subsymbols as having both a semantic
and a syntactic aspects. In Smolensky's view, subsymbols are constituents
of traditional symbols. He (1988: p. 3) describes the semantic
aspect of subsymbols as being,
the activities of individual processing units in connectionist
networks. Entities that are typically represented in the symbolic
paradigm are typically represented in the subsymbolic paradigm
by a large number of subsymbols.
He also contrasts the subsymbolic level with the symbolic level
when he describes the syntactic role of subsymbols. According
to Smolensky (1988: p. 3) subsymbols
participate in numerical - not symbolic - computation. Operations
in the symbolic paradigm that consist of a single discrete operation
(e.g. a memory fetch) are often achieved in the subsymbolic paradigm
as the result of a large number of much finer-grained (numerical)
operations.
Given the fact that subsymbols are specified by Smolensky in relation
to symbols, the crucial issue for determining the exact nature
of a subsymbols would consequently appear to be to understand
the relationship between symbols and subsymbols. On this point
too, Smolensky has a suggestion. He (1988: p. 3) remarks that,
for the purpose of relating these two paradigms [the symbolic
and the subsymbolic], it is often important to analyze connectionist
models at a higher level; to amalgamate, so to speak, the subsymbols
into symbols.
The correct means, then, of determining the exact nature of subsymbols
would be to take a connectionist network which deals with an unambiguously
symbolic problem and then perform the required analysis of the
network. This, though, is must easier said than done.
The Methodological Problem of Network Analysis
One well known problem with connectionist networks which have
undergone training is that they are extremely difficult to interpret
and analyze. Robinson (1992: p. 655) has argued that "We
may have to accept the inexplicable nature of mature networks".
Mozer and Smolensky (1989: p. 3) make the same point somewhat
colorfully when they note that,
...one thing that connectionist networks have in common with brains
is that if you open them up and peer inside, all you can see is
a big pile of goo.
This is, in fact, a serious problem for the connectionist research
program in Cognitive Science a whole. McCloskey (1991: p. 387),
for example, has argued that if trained networks cannot be interpreted
or analyzed in detail, then
...connectionist networks should not be viewed as theories of
human cognitive functions, or as simulations of theories or even
as demonstrations of specific theoretical points.
Although the general problem of analyzing the entire class trained
networks has yet to be solved, there have been a number of recent
proposals which enable analysis to be performed on particular
subclasses of learning networks (see Clark
(1993) and Hansen and Burr (1990) for general discussions of the
issues involved in, and techniques which have been applied to,
trained network analysis). Some recently published work by Berkeley
et al. (1995) is especially germane to the issue of the
nature of subsymbols, in so much as the technique was applied
to a network which was trained upon a set of logic problems, originally
studied by Bechtel and Abrahamsen (1991). Logic problems are paradigm
cases of problems which are 'symbolic' (in the sense intended
by Smolensky) in nature. Thus, a detailed consideration of the
results of Berkeley et al.'s (1995) network analysis should
serve to provide good insight into the nature of subsymbols. In
addition, Berkeley et al.'s (1995) analytic technique served
to provide detailed evidence about the role of individual processing
units in the over all functional economy of (specific) network
systems. Thus, there are reasonable grounds to believe that this
analytic technique can provide sufficiently detailed information
about the nature of network function to make the relationship
between the symbolic and subsymbolic levels explicit.
I shall begin by briefly describing Berkeley et al.'s (1995)
analytic methodology. Then, I will introduce Bechtel and Abrahamsen's
(1991) logic problem set and the network Berkeley et al.
(1995) trained upon it. This will set the scene appropriately
for a discussion of the results.
Network Analysis
The basic recipe for analyzing networks using the method described
by Berkeley et al. (1995) consists of five main steps. The first
of these is fairly obvious. It is to,
1) Train a network on the problem at hand.
It is important to note though that the trained network should
employ Dawson and Schopflocher's (1992) value units, rather than
standard connectionist units. Value units have a Gaussian activation
function, rather than the commonly used sigmoidal one. The reason
that value units should be employed is that the analytic methodology
is designed specifically to work with this kind of unit. However,
it is also worth noting that there is some evidence that networks
of value units have a number of advantages over traditional processing
units, under certain conditions. These advantages include having
faster learning and better generalization (see Dawson and Schopflocher's
(1992)).
The next two steps in the analytic method are,
2) Once the network has reached convergence, re-present all the
patterns in the training set to the network, with the weights
frozen and record the activation level of each hidden unit for
each pattern.
3) Plot jittered density plots for each hidden unit, representing
the level of activity for each pattern in the training set.
The second step can be thought of being analogous to single cell
recording techniques, employed in studies of natural neural systems.
The third step though requires a little further discussion, as
jittered density plots may be unfamiliar to many. A jittered density
plot has a horizontal scale which represents the range of possible
activation states of a particular hidden unit. An example of a
jittered density plot is illustrated in Figure 1.
The actual level of activation which a particular hidden unit
assumes for each input pattern in a training set can then be represented
on this scale by a single dot. However, in order to prevent dots
overlapping one another, it is necessary to add a random vertical
'jittering'. Thus, the vertical component of jittered density
plots are without significance as they are purely heuristic. The
purpose of producing such plots is to get an indication of the
distribution of the levels of activity that the training set produces
in particular hidden units. Step 4 of the analytic technique is,
4) Use the jittered density plot to identify bands of activation
for each hidden unit.
It turns out that it is often the case (with value units) that
the levels of activity of a particular hidden unit form distinct
'bands', when represented on a jittered density plot.
The identification of the number and spread of each of the bands
in a set of hidden unit density plots is arguably the most crucial
step in the analytic method under discussion. Only once bands
have been identified (assuming that they are present) can the
analysis proceed to step 5, which is,
5) Identify the 'definite features', with respect to the input
patterns for each band.
The identification of bands makes it possible to consider what
input properties, if any, each of the patterns which fall into
a particular band have in common. These common input properties
are known as 'definite features'. It is possible to provide operational
definitions for definite features, as follows;
(i) A unary definite feature (for a particular band) is
defined as an input bit which has a constant value for all the
patterns within the band, and
(ii) A binary definite feature is defined as a perfect
negative or perfect positive correlation between pairs of binary
(input) features, the former representing the fact that two bits
were always opposite in value, the latter representing the fact
that two bits were always equal in value.
The identification of definite features is the final step in the
analytic process. It provides a means by which the functional
role of particular activation levels in particular hidden processing
units can be discovered. It consequently permits the link between
the symbolic and subsymbolic levels to be established. However,
before discussing the results obtained by Berkeley et al.
(1995), using this technique, the set of logic problems, upon
which their network was trained should be introduced.
Bechtel and Abrahamsen's Logic Problem
Bechtel and Abrahamsen (1991) describe a logic task in which
a connectionist network was presented with simple logical inferences
and asked to determine the validity and problem type of those
inferences. The training set was made up of 576 patterns. The
inferences were all of the kinds illustrated in Table I, and four
possible variables or their negations filled each of the argument
places.
For ease of description, a concise descriptive notation was developed
by Berkeley et al. (1995). In this notation S1(V1) stands
for the first variable of the first sentence of the argument,
S1(V2) stands for the second variable of the first premise, S2
stands for the variable in the second sentence and C stands for
the conclusion variable. The term 'sign' is used in the notation
to refer to the relationship between the negated or unnegated
status of pairs of letters. Two variables are said to be of the
same sign when they are both either negated or unnegated and said
to be of different sign when one variable is negated when the
other is not.
Bechtel and Abrahamsen's logic task was trained by Berkeley et
al. (1995) on a network of value units with the problems being
represented on fourteen input units and the network's responses
being represented by 3 output units. The network had ten hidden
units and is illustrated in Figure 2.
The jittered density plots of the hidden units of this network are illustrated below in figure 3.
It turned out that most of the bands in the network had definite
features and, as such, could be interpreted. This provided evidence
which revealed the role that particular hidden units played within
the network when solving the logic problems. This is just the
information which is required in order to determine the nature
of the subsymbols within the system.
As a matter of fact, there is a great deal of information encoded
within the hidden units of this network. For current purposes
though, it will be sufficient to focus in upon just a few of the
hidden units. I shall begin by looking at hidden unit 8.This hidden
unit had three distinct and clearly separate bands in its jittered
density plot, as is illustrated in Figure 4.
Moreover, each of the bands had a number of unary definite features
associated with it. The definite features and their interpretation
are illustrated in Table II.
The detailed analysis and interpretation of the bands of hidden
unit 8 of the network makes it very clear that the function of
this unit within the network was to act as a connective detector.
The unit had just three subsymbolic states each of which corresponded
to one of the three possible connectives in the training set.
This must then be what Smolensky had in mind when he talked of
subsymbols.
As a matter of fact, hidden unit 8 was not particularly typical
of the hidden units of Berkeley et al.'s (1995) network.
For many of the units, it was far less straightforward to find
labels which captured the role of the unit as a whole within the
functional economy of the network. This was because bands often
had complex sets of input properties associated with them and
the properties differed quite markedly between individual bands.
This is illustrated nicely by the interpretation Berkeley et
al. (1995) discovered for hidden unit 3. The jittered density
plot for this unit is illustrated in Figure 5.
The details of the interpretation of the bands is presented in
Table III (N.B. Space constraints do not permit providing the
same level of detail as was possible in Table II, above).
A first point to note is that a large number of input patterns
fall into the A band, which is assigned the interpretation "No
definite features". Berkeley et al. (1995) found bands
without definite features commonly arose close to the zero level
of unit activation in a number of the networks hidden units. They
concluded that such bands contained all the input patterns which
the other bands in the network were not sensitive to. Other unit
bands though usually did have definite features associated with
them. For example, the C band of this unit is sensitive to the
relationship between the first variables in the first premise
and the conclusion variable. It is also sensitive to the relationship
between the second variable in the first premise and the variable
in the second premise. The B band is sensitive to these input
properties, as well as a number of other ones, such as the main
connective of the first premise. Thus, the interpretation enables
us to determine precisely the subsymbols developed by the network
in order to solve the Bechtel and Abrahamsen (1991) problem set.
'Rules'
One of the more interesting things which Berkeley et al.
(1995) discovered about their network was that it developed what
might be termed 'rules of inference'. For example, they discovered
that all valid Modus Ponens problems produced a single pattern
of bands [0-A, 1-A, 2-B, 3-A, 4-A, 5-A, 6-A, 7-B, 8-B 9-A] within
the hidden units of the network. That is to say, every valid Modus
Ponens problem produced activity of hidden unit 0 such that it
fell into band A, produced activity of hidden unit 1 such that
it fell into band B, and so on. This was of interest, as it showed
that the network had found a means of using the subsymbolic properties
represented by each of the bands as a way of making generalizations
about the class of problems that is had to solve. Perhaps more
intriguing though was the fact that when the properties associated
with these generalizations were, studied it turned out that the
network had become sensitive to almost exactly the same set of
properties as would be predicted from studying traditional inference
rules. This point can be nicely illustrated by comparing the traditional
rule for Modus Ponens with the network's 'rule', as presented
in Table IV.
The only substantial difference between the two rules (once the
traditional rule is re-expressed in Berkeley et al.'s descriptive
notation) is that the network is not sensitive to whether the
sign (i.e. negated or unnegated status) of the consequent letter
is the same as the sign of the conclusion. The reason that the
network did not have to be sensitive to this property was because
of the restricted nature of the training set. Apart from this
one minor difference though, the network and the traditional rules
are close to identical to one another!
One reason this is an interesting result with respect to subsymbols,
is that it shows that it is genuinely the case that a link between
the symbolic level and the subsymbolic level can be found. More
importantly though, it also shows that subsymbols can be used
as a basis upon which interesting and useful generalizations about
network performance can be based. From what Smolensky (1988) claims
about subsymbols, this is just the kind thing one would hope to
find.
Conclusion
Now an empirical 'look' at subsymbols has been achieved, a comparison
between symbols and subsymbols is in order. Unfortunately, this
presents something of a problem because there is some disagreement
in the literature over the nature of symbols. For example, Vera
and Simon (1993: p. 9) remark that "We call patterns symbols
when they can designate or denote." However, Touretzky and
Pomerleau (1993) take issue with this view. Whilst Touretzky and
Pomerleau (1993) embrace the requirement that symbols must designate
or denote (i.e. be 'representations' in their prefered terminology),
they also add the condition that representations must have "combinatorial
power" in order to count as true symbols.
It is clear though from the evidence discussed above that the
individual activations of hidden units, as represented by the
individual dots in the jittered density plots are subsymbols in
Smolensky's (1988) sense of the term. Moreover, these dots also
fairly unambiguously satisfy Vera and Simon's (1993) condition
of symbolhood. Individual dots in the jittered density plot of
hidden unit 8 (the unit described earlier as a 'connective detector')
'designate or denote' particular main connectives in the problem
set. This being the case, subsymbols fall within the class of
'representations', as Touretzky and Pomerleau (1993) use the term,
but they fail to satisfy their conditions for symbolhood.
This is not the appropriate place to attempt to adjudicate between
the alternative definitions of 'symbol' proposed by Vera and Simon
(1993), and Touretzky and Pomerleau (1993). However, if a broader
definition of 'symbol', closer to Vera and Simon's (1993) conception
is ultimately judged to be the most appropriate, then the evidence
discussed above suggests that the difference between symbols and
subsymbols may not be as great as has previously been supposed.
Hopefully, future evidence from the analysis of trained connectionist
networks will serve to provide evidence which will clarify the
issues in the debate over the appropriate definition for symbols.
In the meantime, it appears that the Berkeley et al. (1995)
logic network analysis provides suggestive evidence for a prima
facie case to be made that connectionist networks are, in
fact, symbolic systems.
Bechtel, W. and Abrahamsen, A. (1991) Connectionism and the
Mind, Basil Blackwell (Cambridge, Mass.).
Berkeley, I., Dawson, M., Medler, D., Schopflocher, D., and Hornsby,
L. (1995), "Density Plots of Hidden Unit Activations Reveal
Interpretable Bands", in Connection Science, Vol.
7, No. 2, pp. 167-186.
Clark, A. (1993), Associative Engines: Connectionism, Concepts,
and Representational Change, MIT Press (Cambridge, Mass.).
Dawson, M. and Schopflocher, D. (1992), "Modifying the Generalized
Delta Rule to Train Networks of Non-monotonic Processors for Pattern
Classification", in Connection Science, 4/1, pp. 19-31.
Hanson, S. and Burr, D. (1990), "What connectionist models
learn: Learning and representation in connectionist networks",
in Behavioral and Brain Science, 13, pp. 471-518.
McCloskey, M. (1991), "Networks and Theories: The Place of
Connectionism in Cognitive Science" in Psychological Science,
Vol. 2, No. 6, pp. 387-395.
Mozer, M. and Smolensky, P. (1989), "Using Relevance to Reduce
Network Size Automatically" in Connection Science,
1, pp. 3-16.
Newell, A. and Simon, H. (1976), "Computer science as empirical
inquiry: Symbols and search", in Communications of the
ACM, 19/3, pp. 113-126.
Robinson, D. (1992), "Implications of Neural Networks for
How We Think about Brain Function", in Behavioral and
Brain Science, 15, pp. 644-655.
Schneider, W. (1987), "Connectionism: Is it a paradigm shift
for psychology?" in Behaviour Research Methods, Instruments
& Computers, 19/2, pp. 73-83.
Smolensky, P. (1988), "On the Proper Treatment of Connectionism",
in Behavioral and Brain Sciences, 11, pp. 1-74.
Van Gelder, T. (1990), "Compositionality: A Connectionist
Variation on a Classical Theme", in Cognitive Science,
14, pp. 355-384.
If A Then B
A -------------- Therefore B | Connective: If...then...
S1(V1): A S1(V2): B S2: A C: B | |
If A Then C
Not C --------------- Therefore Not A | Connective: If...Then...
S1(V1): A S1(V2): C S2: C S2 is negated C: A C is negated | |
| D Or A
Not D --------------- Therefore A | Connective: ...Or...
S1(V1): D S1(V2): A S2: D S2 is negated C: A |
| B Or C
Not C --------------- Therefore B | S1(V1): B
S1(V2): C S2: C S2 is negated C: B |
| Not Both C and D
C ----------------------- Therefore Not D | S1(V1): C
S1(V2): D S2: C C: D C is negated |
| Not Both A and D
D ----------------------- Therefore Not A | S1(V1): A
S1(V2): D S2: D C: A C is negated |
3 & 4 | The connective is OR | ||||
The connective is IF...THEN | |||||
The connective is NOT BOTH...AND |
No definite features | |||
S1(V1) is negated
S1(V1) is the same letter as C S2 and C are not negated S1(V2) is the same letter as S2 The connective is OR | |||
S1(V1) is the same letter as C
S1(V2) is the same letter as S2 |
S1(V1) = S2
S1(V2) = C SIGN S1(V1) = SIGN S2 SIGN S1(V2) = SIGN C CONNECTIVE: IF...THEN | S1(V1) = S2
S1(V2) = C SIGN S1(V1) = SIGN S2 CONNECTIVE: IF...THEN |