Squeak Language Definition (semi-formally)
The following more or less formal
specification of the Squeak Language is derived from an
EBNF definition of Squeak 2.7alpha by - Dwight
Hughes. It comes from the Swiki, and should
be considered only a snapshot of Squeak at this version.
By "EBNF" is meant
"Extended Backus-Naur Formalism". This is a way to write
formal grammars widely used in linguistics (and elsewhere) and is named
after its originators.
The EBNF used here is defined as:
[ ... ] apply zero or one times;
[ ... ]* apply zero or more times;
[ ... ]+ apply one or more times;
... | ... choose one of the alternatives;
"..." use the literal characters enclosed;
( ... ) used for grouping.
In what follows the defined terms are
set in bold-face red, and I divide the specification in some
convenient groups, namely
1. Letters and digits
2. Interpunctions
3.1: Terms: Variable identifiers of various kinds
3.2: Terms: Arrays
3.3: Terms: Numerical terms
3.4: Terms: Logical Constants
3.5: Terms: Logical Variables
4. Statements
This sequence works from the smallest
and simplest expressions in Squeak (letters and digits) to the
longest and most complex expressions in Squeak (methods, messages,
blocks).
1. Letters and digits
Since Squeak is a formal language like ordinary algebra - which I
suppose you to be somewhat familiar with - and is made up for a computer
from strings of characters it
makes sense to start with saying from what letters and other squiggles
the well-formed strings of the Squeak Language (SL) are composed
letter =
uppercase |
lowercase.
Letters in SL are either uppercase
or lowercase, and specifically these are
uppercase =
"A" | "B" | "C" | "D" |
"E" | "F" | "G" | "H" |
"I" | "J" | "K" | "L" |
"M" | "N" | "O" | "P" |
"Q" | "R" | "S" | "T" |
"U" | "V" | "W" | "X" |
"Y" | "Z".
lowercase =
"a" | "b" | "c" | "d" |
"e" | "f" | "g" | "h" |
"i" | "j" | "k" | "l" |
"m" | "n" | "o" | "p" |
"q" | "r" | "s" | "t" |
"u" | "v" | "w" | "x" |
"y" | "z".
So the "letters" of Squeak
are just what an English-speaking person would expect and what can be
found on a standard English qwerty-keyboard as such. In any case, also
in what follows, the vertical stroke "|" without
quotation-symbols is like an exclusive or. Thus in uppercase the
intended meaning is: Any one of "A" .... "Z"
(without surrounding quotation- marks) is an uppercase letter and indeed
nothing else.
There are more letter-like squiggles
in English and on qwerty-keyboard, and for Squeak these are nameda bit
differently from standard EnglishL
character =
("[" | "]" | "{" | "}" |
"(" | ")" | "_" | "^" |
";" | "$" | "#" | ":" |
"-" | "|" | ".") | decimal_digit |
letter | special_character.
special_character
= ("+" | "*" | "/" | "\" |
"~" | "<" | ">" | "=" |
"@" | "%" | "&" | "?" |
"!" | "`" | "," ).
What are called the
"characters" of SL are mostly used in Squeak as grouping
terms, and the special_characters indeed have special roles, mostly in
arithmetical or logical statements.
The numerical symbols in Squeak are
again what English speakers would expect:
decimal_digit =
"0" | "1" | "2" | "3" |
"4" | "5" | "6" | "7" |
"8" | "9".
A decimal_digit in Squeak is precisely
what one would expect it is, and more generally
digits = [digit]+
are squences of one or more digits. So "digits" are
squences made up of one or more digit-characters, but a terminological
oddity (probably due to the use of letters in arithmetical systems
that have more that 10 basic digits, for which capital letters are
normally used) is that a digit is not just a decimal_digit, but may also
be am uppercase letter:]
digit = decimal_digit |
uppercase.
2. Interpunctions
If you look closely at any normal
English written text of a page or more in length, you'll find that
something like a third of it may consist of interpunction, like blanks,
dots, commas and the like. Interpunctions serve as means of grouping
characters and terms and to help the human reader. It is often referred
to as "whitespace", precisely because in printing practice so
much is indeed made up of white space without any character.
In SL there is interpunction as
well, and indeed its purpose is to help the human readers of the Squeak
Language:
whitespace =
[space | tab | newline]+.
This is the whitespace in
SL, in fact mostly defined by reference to a standard keyboard, with a
space and a tab key (both of which are represented in the computer by
specific numbers: computers have no use for whitespace).
Note that whitespace consists of one or more of space
| tab | newline, as indeed conforms to human writing and typing
practice.
There is a tricky bit involved in
newlines, that correspond to the Enter-key on the keyboard:
newline =
cr | lf | crlf.
The tricky bit arises from the desire
to cater to many OS-s: On a
Mac newline is cr; on Unix newline is lf; and on Dos newline is crlf (so
a sequence of the previous two).
In Squeak the standard newline is cr
on all platforms, but this concerns only text written inside Squeak, and
not text written on other systems and filed into Squeak, for which
reason the lf and crlf exist in Squeak.
Finally, the general function of
interpunction is to separate a term from surrounding terms:
separator =
whitespace | comment.
This only adds comment, defined below
as anything occuring between two double quote-marks. Comments
occur in the SL to help the user. When Squeak parses a human users input
it skips all comments, effectively treating it as whitespace, with which
it also does nothing (except permit its use).
3.1 Terms 1: Variable Identifiers of
various kinds
Sofar in fact we considered the
smallest well-formed expressions in Squeak: characters, digits and
interpunction. Terms of
a language are well-formed expressions that are intermediate between
characters and statements, and that have some sort of meaning on their
own. In English what are called "terms" here are often also
called "words" or "phrases".
In Squeak, most of the terms are
called "identifiers", which is a fancy name for
"name". There are several kinds of them. In this section I
deal with the various kinds of variable identifiers in Squeak, used for
different purposes in different contexts, which
the user may introduce for his own ends:
identifier =
letter [letter | decimal_digit]*.
This defines the general set of
identifiers: Any string that is made up of letters or decimal_digits
(and so without whitespace etc.). Likewise there is in Squeak:
capital_identifier =
uppercase [letter | decimal_digit]*.
This is just like an identifier,
except that it starts definitely with an uppercase letter. Squeak has
both of them because of its convention (chosen but not imposed by the
Squeak parser, in most cases) to have capital_identifiers for common
names (names for possibly many things) and other identifiers for
individual names (names for precisely one thing).
Next, there are in Squeak two explicit
ways to mark special terms:
character_constant
= "$"(character | "'" | """).
symbol_constant
= "#" symbol.
The character-constants exist, among
other things, to be able to deal with terms like "+" without
making Squeak regard them as instructions to add.
The symbol_constants exist, among
other things, to make sure that Squeak treats the identifier that
follows "#" as a unique name in the system, and is used
everywhere in its code to name parts of it. What is a "symbol"
in Squeak is defined below:
symbol =
identifier | binary_selector | [keyword]+ | string.
"Binary_selector" and
"keyword" are defined further down, and "string"
immediately below. The general point of symbols in Squeak (as in many
other computer-languages) is to have unique names for things in the
system.
One important point to notice (that
may differ from conventions in other languages) is that for Squeak a
string is in fact a symbol and so a kind of complex constant.
Strings are explicitly defined as follows:
string =
"'" [character | "'" "'" |
"""]* "'".
The point of this definition is that a
string is a sequence of characters possibly with comments before, inside
and after it (and comments are very useful inside programming code, to
explain what happens and may be problematic).
3.2: Terms 2: Arrays
At this point we have defined the
letters, digits, interpunctions and variable identifiers of the Squeak
Language, but in fact have not yet introduced any wherewithal to do
anything useful. This we start now, with arrays.
An array is a sequence of distinct
components that can be stored and recalled as a unit by a computer. It
occurs in most computer-languages, since it provides a basic way of
storing and retrieving information.
array =
"(" [number | symbol | string | character_constant | array]*
")".
Thus an array is written as a
bracketed sequence of items, that may in general be about
anything, including arrays. One main limitation on arrays, also in
Squeak, is that they must be pre-declared and have a fixed length. A
nice thing about the notation for arrays in Squeak is that the separator
used is not the comma, as in most languages, but the empty character.
This is easier to read, especially in long arrays.
Next, often the most convenient thing
to store something in computer-memory is in an array of constants.
In Squeak, this is defined with help of the following term:
literal =
(number | symbol_constant | character_constant | string |
array_constant).
The literals consist of those items
that are constants for Squeak. These are used in the next definition:
array_constant =
"#" array of literals
Here we see a convention at work in
the previous section, namely the use of the prefix "#"
to indicate that the rest of the string following it is a
symbol and so a constant.
3.3 Terms 3: Numerical Terms
In Section (1) of this
specification, digits were defined, which enables Squeak to deal
with terms for simple natural numbers. But there are many more kinds
of numbers in mathematics, and Squeak provides for these as
follows, in a way differing from other computer languages:
number =
["-"][radix"r"]["-"]digits["."digits]["e"["-"]exponent].
The first character is an option
"-" for negative numbers. The radix (or base, in
standard mathematics in English) is specified thus
radix =
decimal_digits.
It is the number-base of the numerical
expression following it (explained in beginning algebra). NOTE:
In fact radix is between 2 and 36 inclusive (actually, Squeak checks
only the lower bound - you may make the upper bound as large as you
wish, but you can represent only the first 36 digits of the larger base;
numbers entered using them are interpreted correctly however).
Also, the set of digits
allowed in a number of radix N is the first N characters of the string
'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ', while the default radix = 10, as
in ordinary mathematics, and need NOT be explicitly written if it this
is intended. In case you
want to calculate on an alternative base, you must explicitly specify
the radix, and then Squeak will use it. Note the radix in decimal_digits
is followed by the letter "r", marking its end and the
beginning of the actual digits of the number.
Having written the radix and the
number, one may follow this by the letter "e" for
"exponent", followed by more digits:
exponent =
decimal_digits.
This is also as in elementary
mathematics.
Note that the given definition of
"number" in Squeak allows Squeak to represent and indeed
calculate very many different kinds of numbers, including binary
(radix=2), octal (radix=8) and hexadecimal (radix = 16). If you are
neither a mathematician nor a computer scientist much of this will
probably be useless, but since computers store everyting in fact in
powers of 2, these may be very handy.
3.4. Terms 4: Logical Constants
Now we finally arrived at the
beginning of Squeak's processing. First, there are the all important:
pseudo_variable =
"true" | "false" | "nil" |
"self" | "super" | "thisContext" |
"homeContext".
These terms are called
"pseudo-variables" because they have on the one hand a
variable meaning in variable contexts (rather like pronouns in
English) but on the other hand cannot be assigned a different content,
as all variables can be in Squeak.
It need here only be note that
"true", "false" and "nil" are used for
logical processing, while "nil" is also used for assignments
(see below). The "true" and "false" are
Squeak's Booleans, while "nil" is a very convenient
addition to these.
Next, "self"
and "super" are used to deal with Squeak's so-called
inheritance, and basically instruct Squeak where to find certain methods
(see below for methods).
Finally, "thisContext"
and "homeContext" are used mostly internally by Squeak in the
processing of blocks and other code, to keep track of what is where and
belongs to what.
I noted above that
"true" and "false" are Squeak's Booleans. These are
implemented in Squeak in another way than in other languages, and these
come with the following:
special_keyword =
"ifTrue:" | "ifTrue:ifFalse:" |
"ifFalse:" | "ifFalse:ifTrue:" |
"whileTrue" | "whileTrue:" |
"whileFalse" | "whileFalse:" | "and:" |
"or:" | "to:do:" | "to:by:do:".
Except for the last two, all of these
are used for dealing with logical alternatives and possibilities, while
the last two are used for passing parameters.
Three remarks should be made:
First, In Squeak, the colon ending a
term is used to indicate that a parameter follows.
In the case of terms dealing with Booleans, these often are blocks (more
of which below).
Second, there are some more terms in
the category special_keyword, namely the Booleans involving nil, such as
"ifNil:". In general, it seems not completely clear at present
what is counted as belonging to the very basis of Squeak, and what
doesn't, and one reason is that in fact the logical processing in Squeak
is somewhat differently implemented from other languages.
Third, in fact the operations indicated by these keywords are normally
optimized/inlined by Squeak into tests and jumps and are not sent as
actual messages. This happens for speed reasons, and because these tests
and jumps are quite simple and very universal. However, you can have the
effect of ordinary messages (useful for debugging) by using
#perform:, #perform:with:,
#perform:with:with:, and #perform:with:with:with:.
This is another series of
special_keyword. (Not recommended, unless you are debugging.) The
"with:" etc. is a general way to pass 1, 2 or 3 parameters in
Squeak.
3.5. Terms 5: Logical Variables
The parts in the previous section are
concerned with processing by Squeak and are constants. This section
treats variables, and it should first be noted that in Squeak
variables are rather special and different from other languages.
In Squeak, variables are as it were
named slots used for arbitrary storage. That is, for Squeak a variable
is processed as made up of a string which is a Squeak identifier, and a
contents, which has been assigned by Squeak or the user. These pairs of
a name and the contents it refers to is called
"variable" because the contents can be changed, while the name
remains the same.
Here there are two points of
importance: Squeak initializes anything it recognizes as a variable as
nil, until this is undone. (So all variables refer to some contents, if
only nil.)
Second, and most important for Squeak:
The contents of a variable are anything that can be represented by a
well-formed Squeak-expression. This gives Squeak very great power, and
it also liberates the user of having to add type-declarations for
variables, as is usual in other computing languages, where there are
variables but normally of a specific kind, that needs explicit
declaration, like "integer", "float" or
"string".
So the expressions for v ariables that
follows are in fact handles for storage-spaces for the user of Squeak.
They come in several kinds, depending on the purpose they serve in
Squeak:
variable_name =
identifier.
This concerns the names for storing
the contents of the "variable" of that name, by assignment
(see below). By convention
variable names begin with lowercase, but this is not enforced by the
system, though it may ask when one uses an initial uppercase whether the
variable is to be stored as Global, i.e. accessible to the whole system,
and not just by the part in which it is declared
temporaries
= "|"
[variable_name]* "|".
Temporaries are variables that are
only accessible and maintained by Squeak when processing the code in
which they occur. They are declared by the user by means of writing them
between two bars, separated by whitespace. (Like before, variables are
initialized to nil when declared: As soon as you've written "| blab
blub |" in a Workspace Squeak has somewhere stored "blab"
and "blub" pointing to nil as long as nothing else is assigned
to them.
class_name =
capital_identifier.
In fact Squeak's classes are Squeak's
programs, and one must refer to these by identifiers starting with an
uppercase letter. (Note new class_names are usually declared and added
in a browser.)
There are several types of variables,
all named by identifiers I list here without explications:
class_variable_name =
capital_identifier.
instance_variable_name =
identifier.
class_instance_variable_name
= identifier.
By convention both instance variables
and class instance variables begin with lowercase, but this is not
enforced by the system.
Sofar, we have dealt with names for
the programs in Squeak, and now we turn to parts of programs of
Squeak:
argument_name
= identifier.
This seems somewhat of a misnomer,
since it refers to the names of methods
(see below: What are called
"methods" in Squeak are in fact Squeak's programs, that are
collected in classes, where a class is a collection of programs for a
specific purpose).
There is a somewhat important NOTE:
argument_names cannot be assigned to (at least it should be disallowed).
By convention argument names begin with lowercase, but this is not
enforced by the system (though the parser may complain when beginning
with upper case). Thus, an argument_name in Squeak is not "an
object", because nothing can be assigned to it.
Now the general approach of Squeak
towards getting things done is to have written or gotten somehow a class
of behaviors, named by methods, which may be executed by naming the
class and the method and sending both to Squeak.
The class has an identifier, and the
method an argument_name and possibly some parameters. In Squeak, there
are three basic kinds of messages: Those with no parameters, those with
one parameter, and those with more than one parameter. These are
distinguished by the following terms:
unary_selector =
identifier.
By convention unary message names
begin with lowercase, but this is not enforced by the system. A simple
instance is: "2 sin" that when send to Squeak will be
calculated as the sinus of the number 2. Here "2" is a name of
the class (the nymber 2 in this case) while "sin" is the name
of a unary_selector.
There are quite a few unary_selectors
in Squeak, for quite a few different purposes. It is a bit different
with the next kind of message:
binary_selector =
(special_character [special_character]) | ("-"
[special_character]) | "|".
The difference is that
binary_selectors are mostly used in mathematical contexts, and are
mostly the standard mathematical arithmetical terms like +, -, * etc.
Here it should be remarked once more
(without explanation) that in Squeak numbers are represented in a
somewhat different way than in other programming languages. (This needs
some getting used to, but Squeak is remarkably powerful with numbers as
well.)
keyword =
identifier ":".
This is used to define key_word
messages (below), that correspond mostly to the methods with more than
one parameter - of which there are very many in Squeak. By convention
keywords begin with lowercase, but this is not enforced by the system.
4. Statements
We arrive finally at the statements of
Squeak, that the user needs to make Squeak do anything. I start with the
very basic one:
assignment_op =
":=" | "_".
This is a binary term, used e.g.
thus: myArray := #(5 'a' #(5 'a')). This declares the variable
"myArray" and assigns it the constant array "#(5 'a' #(5
'a'))" (showing a reflexive feature possible in Smalltalk that may
interest logicians).
The term ":=" is the
classical Smalltalk operator of assignment. In Squeak one can also write
instead an underscore: "_"
which is displayed as left-arrow (but at present not in all fonts of
Squeak).
message_expression = unary_expression | binary_expression |
keyword_expression.
These are the three basic kinds of
messages described in the prevuous section. The basis for these
expressions is
primary = variable_name |
argument_name | literal | block | brace_expression | "("
expression ")".
The first three of these are names for
constants in Squeak; the last three names for expressions in Squeak that
Squeak can calculate a value for.
unary_object_description
= primary | unary_expression.
unary_expression =
unary_object_description unary_selector.
These two define the first of the
three kinds of message_expression
in Squeak. For the second kind
there are the following definions:
binary_object_description = unary_object_description |
binary_expression.
binary_expression = binary_object_description binary_selector
unary_object_description.
Next, there is the last kind of
message_expression, for which we need the following:
keyword_expression
= binary_object_description [keyword binary_object_description]+.
At this point the three kinds of
messages of Squeak are defined.
The next point is to define sequences
of messages and relate them to methods. The first is done as
follows:
message_pattern = [unary_selector | binary_selector argument_name |
[keyword argument_name]+.
The second thus:
method = message_pattern [temporaries] [primitive_declaration]
[statements].
The extra in method compared to
message_pattern consists of the wherewithall to make more complicated
calculations and logical decisions, and is defined as follows, insofar
as the necessary definitions have not been given yet:
primitive_declaration = "<" "primitive:"
decimal_digits ">".
Squeak's Virtual Machine comes with a
considerable number of basic operations implemented by primitives, all
of which have a unique identifying number. (There are efforts to add
names to these, so that users have a better idea what they do, but sofar
this has not been done. To change primitives one has to change and
recompile Squeak's Virtual Machine, and indeed manage some programming
in C or C++).
To define statements
we need to define the following
expression = [variable_name
assignment_op]* (primary | message_expression |
cascaded_message_expression)
which is the somewhat misleading term
used for a single statement of Squeak. The only undefined term in it is
defined thus:
cascaded_message_expression = message_expression [";" (
unary_selector | binary_selector unary_object_description | [keyword
binary_object_description]+ ) ]+.
Note this is much like message_expression.
The basic difference is related to the ";" which in turn is a
way to return everything calculated to the initial object named in
the message_expression. (This is explained elsewhere in more detail.)
We arrive at statements, which in fact
are sequences of expression
statements =
[expression "."]* ["^"] expression
["."].
The "^" is a constant of
Squeak that in fact assures that Squeak returns the value it has
calculated. This is always the last statement in a block or method
(but because of logical alternatives needs not be the last line in the
block or method).
Note also that expressions are
separated by dots, and that it is a convention not to write a dot behind
the last expression in statements.
Finally, we come to a powerful
implementation of a basic method in mathematical logic, called
lambda-conversion. In Squeak this is implemented by so-called blocks,
defined thus:
block = "[" [[":" argument_name]+ "|"]
[temporaries] [statements] "]".
It should be noted that as of version 2.6, Squeak has block local temporaries in a
somewhat limited form. Squeak does not yet handle blocks as full
closures -- block arguments are actually compiled as "hidden"
temporaries and block local temps have the same name scope as the method
temps.
To finish this semi-formal
specification of the Squeak Language, it remains to mention
comment = """ [character | """
""" | "'"]* """.
A comment may appear anywhere in
Squeak code, and simply acts the same as
whitespace as far as Squeak is concerned.