CPSC 328: Program Translators

Offered:     Winter 1995
Instructor:  David J. Eck
Room:        Napier 202
Time:        MWF 2:40 to 3:50 PM

Information available on:

About the course
Testing and Grading
Programming Assigments: Summary and Details
Projects
The Stack Machine

About the course

As you know, computers execute machine language programs. That's all they do. Programs written in other languages must be translated into machine language before they can be executed. This translation is done either all at once by a compiler or statement by statement, as needed, by an interpreter. The study of program translators is really the study of so-called "formal," or artificial, languages. Although translators are important and pretty programs in their own right, the theory behind them has application in other areas, such as human-computer interface design, network protocols and artificial intelligence.

There is no textbook for the course, but I have put the following books on reserve in the library: Compilers, by Aho, Sethi and Ullman; Program Translation Fundamentals, by Peter Calingaert; and Compiler Design in C, by Allen Hub. My lectures will be based primarily on the first of these.

Testing and Grading

There will be three tests, covering material from lecture and class presentations. The tests will be given on January 30, February 22, and March 15. (The third test is during the regularly scheduled final exam period.) The three tests will count equally in the grade and will together contribute 40% of the course grade. Programming assignments will count for 45% of the course grade. The final 15% will be based on the project/presentation.

Programming Assignments

In this course, you will write a compiler for a large subset of Pascal. The compiler will translate Pascal programs into the assembly language of an "abstract stack machine" that I have defined specifically for use in this course. All of the assigments, except assigment number 0, will be part of this compiler. Here is a list of the assignments and their due dates:

0. Simple expressions (due January 9)
1. Lexical analysis and symbol table (due January 23)
2. Variables and assignment statements (due February 6)
3. I/O statements (due February 13)
4. Control statements (due February 20)
5. Subroutines (due March 3)
6. Arrays and Records (due March 13)

Assignments are due by 5:00 PM on the date listed. There will be a 15% penalty on assignments that are up to five days late. After that, no credit will be given. Since each assignment builds on previous assignments, it is important for you to get them done on time.

Each person in the class should work individually on his or her compiler. Please do not talk with each other about how you are writing your own compiler, except in general terms. If you have questions or problems, please come to me with them. In an upper level course, I do not expect that you will need help with details of coding. But you will almost certainly need help with some of the techniques of compilation.

The compiler can be written in any language you want to work in, provided that it has full support for recursion. I wrote my compiler in Pascal. C, C++, Lisp and Scheme would also be appropriate. The compiler will output a file containing the translation of the input program into the language of the abstract stack machine. You can then run the stack machine program using the interpreter I have provided.

Project and Presentation

The compiler you write for this course will use only a few of many available techniques. I will cover others in class on a theoretical level. A few others will be covered in individual projects and class presentations based on them. The project will consist of a short paper and/or a program. The presentation will be about half a class period. We will negotiate the details for each individual project. Here are some possible topics:

Lex and Yacc: These are programs used as aids in compiler writing. They are a standard part of UNIX. I will cover them in a general way in class. The project would be to look at the specifics and present some examples of how they can actually be used. This would be a project for two people, one doing Lex and one Yacc.
Optimization: The code produced by a compiler can often be improved, to make it smaller or faster. This is called optimization. We will not have time to cover it in detail (and in fact, I have no practical experience with it). There are two basic categories of optimization that are used in practice, and this would be a project for two people.
Error recovery: The compiler you write will stop as soon as it finds an error. Many compilers will try to "recover" from the error and then continue the process of compilation. (Simple example: ignoring a missing semicolon error in Pascal.) The project would be to look at methods for doing this.; Triples and quadruples. Many compilers translate programs into an intermediate language, rather than directly into machine language. The stack machine language that you will use in your compiler is an example of this. Other intermediate languages use so-called "triples" or "quadruples." The project would be to report on what these are and on how they are used.
Nested subroutines: Subroutines in Pascal can be nested inside other subroutines, which causes a number of complications. The subset of Pascal handled by your compiler will not allow this feature. The project would be to report on techniques for handling nested subroutines.
Separate compilation and linking: Long programs are usually broken up into "units" or "modules" that can be separately or independently compiled. The results of the separate compilations must be "linked" into an executable file. The project would be to report on the issues involved in separate comilation and linking.

Detailed Programming Assignments

Over the course of assignments 0, 1, 2, 3, 4, 5, and 6, you will be writing a recursive descent compiler for a large subset of Pascal. exactly what this means and how it is done will be covered in class.

Most of the assignments are specified useing "extended BNF". This is a convenient way of specifying the syntax of a language, especially for writing recursive descent compilers. The following notation is used:

    <name>  A "name", that is an item of type "name".
    :==     "Is defined as."   Example:  <sign> :==  + | - 
    |       Choice of options. For example:   + | - 
    ( )     Grouping.
    [ ]     Optional item.  For example:     [ + ]
    [ ]...  Item to be repeated zero or more times.
    "("     A literal (, rather than one used for grouping.
               Other special characters are quoted in the same way.

Assignment 0.

Simple expressions (due January 9)

Assignment number 0, due January 9, is to write a recursive descent evaluator for integer expressions as defined by the following extended Backus-Naur Form productions:

       <expression>  :==  [ + | - ]  <term>  [ ( + | - ) <term> ]...
       <term>        :==  <factor>  [ ( * | / ) <factor> ]...
       <factor>      :==  INT  |  "(" <expression> ")"

where INT can stand for any integer, that is, any sequence of digits 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9. Of course, INT could be defined in BNF notation, but it is convenient to think of all integers as being the same, as far as the syntax of the language goes. The difference is only in the semantics, that is, the meaning. "INT" is said to be a token. The actual integer, such as 1729, that the token represents in a given particular instance is said to be an attribute of the token. The token is what is needed for syntactic analysis; its attribute is needed for semantic anlysis.

Assignment 1.

Lexical analysis and symbol table (due January 23)

Assignment number 1, due January 23, is to write a lexical analyzer for your compiler, and to begin work on the symbol table. Lexical analysis refers to analyzing a program into tokens. The tokens are the minimal meaningful elements of the language. Tokens in Pascal include reserved words such as begin and procedure, operators such as := and +, identifiers, constants of various types, and punctuation. Here is a list of the tokens your lexical analyzer must recognize:

  Integer constants    Real constants    Char constants
  String constants    Identifiers   and   array
  begin    const    div    do
  downto    else    end    for
  function    if    mod    not
  of    or    procedure    program
  record    repeat    then    to
  type    until    var    while
  :=   +   -   *   /   =   >  <  >=   <=   <>
  .   ..   (   )   [   ]   ,   ;   :

I would also suggest a token to represent end-of-input and another to represent an error (such as an illegal character found in the input). Note that all constants of a given type are represented by the same token. The actual value of the constant is considered to be an "attribute" of the token. Identifiers are similar, except that the attribute for an identifier should probably be some sort of reference to a symbol-table entry for that identifier. The symbol table would contain information about the identifier, such as whether it is a global variable, parameter, constant, etc. Alternatively, you might want to have different tokens to represent different types of identifiers. Also, you have to consider how to handle identifiers that have not yet been defined. Your lexical analyzer should be pretty much complete after this assignment. Your symbol table should be able to store at least global variable names. You will need to store the type of the variable, along with its name. You can expect to do more work on your symbol table as you do the remaining assignments. I will give you more information and more suggestions in class.

Assignment 2.

Variables and assignment statements (due February 6)

For assignment number 2, you will write a compiler for a simple subset of Pascal in which the only possible statement is an assignment statement. This compiler will use the lexical anlyzer and symbol table that you wrote for assignment 1. You might find that you need to add some features to the symbol table. Your compiler should output a stack machine program equivalent to the Pascal input file.

Following is a description of the syntax of the language you should compile for this assignment. The description is written in extended BNF. The symbols IDENT, INT, FLOAT and CHAR are special tokens representing, respectively, an identifier, an integer constant, a floating point constant and a character constant. (Note that a character constant is a single character; it is not the same as a string constant.)

You should write a procedure corresponding to each of the BNF productions in the following syntax description. In the remaining assignments, you will build on what you write for this assignment. Most of the procedures you write will be in final form. The productions marked with "(*)" correspond to procedures that will need further work in later assignments. For example, the definition of <program> will be modified in a later assignment to include a type section and procedure and function definitions. You should write a procedure now for each production, even if the production is very simple like the one for <type>, since the definition might become much more complicated later.

(*)              <program>  :==  program IDENT ;
                                     [ <const_section> ]
                                     [ <var_section> ]
                                     begin <statement_list> end .

           <const_section>  :==  const IDENT = <constant> ;
                                     [ IDENT = <constant> ; ]...

             <var_section>  :== var <variable_declaration> ;
                                     [ <variable_declaration> ; ]...

    <variable_declaration>  :==  IDENT [ , IDENT ]... : <type>
    
(*)                 <type>  :==  IDENT

                <constant>  :== INT | FLOAT | CHAR | IDENT

          <statement_list>  :==  <statement> [ ; <statement> ]...

(*)            <statement>  :== <assignment_statement> | EMPTY

    <assignment_statement>  :==  <L-value> := <R-value>

                 <L-value>  :==  <variable_ref>

                 <R-value>  :==  <basic_expression> 
                                     [ ( = | < | > | <> | <= | >= ) 
                                                  <basic_Expression> ]...

        <basic_expression>  :==  [ + | - | not] <term> 
                                     [ ( + | - | or ) <term> ]...

                    <term>  :==  <factor> 
                                     [ ( * | / | div | mod | and ) <factor> ]...

(*)               <factor>  :==  <constant> | <variable_ref> |
                                     "(" <R-value> ")"

(*)         <variable_ref>  :==  IDENT

Of course, you will also have to deal with the semantics of the language. This is much harder than the syntax, which you can handle more-or-less mechanically just by following the BNF productions.

What is meant by semantics here? Well, for example, you have to know that when a variable is declared, it should be entered into the symbol table, along with any relevant information such as its type and its location in memory. When a variable is used in an expression, you have to check whether it was defined and what its type is. You have to check that the types of the left-hand side and the right hand side of an assignment statement match. There are many other things to worry about.

Here are a few hints about how to handle the semantics.

The purpose of your <L-value> procedure should be to put an address on the stack. This might be used later, for example, by a sm_Store instruction.
The purpose of your <R-value> procedure should be to put a value on the stack. The same is true for <basic_expression>, <term>, and <factor>.
You will need to know the type of each <L-value>, <basic_expression>, <term>, and <factor> that is processed. Therefore, the corresponding procedures should return (in a var parameter) an indication of the type.
A <var_section> processes variable declarations. You will need to know the number of memory locations required by the variables, so the corresponding procedure should return this number.
When an IDENT token is encountered, the semantics will determine the exact type of identifier that is acceptable and what is done with it. For example, in the defintion of a <type>, the IDENT must be the name of a type, such as "real", "boolean" or (later in the course) some name that was defined as a type in the program that is being compiled.

This is by no means everything you need to know about semantics. If you have any questions as you think about or work on your program, please ask them!

I found it extrememly useful, by the way, to write two procedures to help me in processing an <R-value>:

     procedure CheckOpType (var op: TokenKind;
                            opType: TypeIndicator);
     { if the given operator op (plusTOK, geTOK, etc.) is not }
     { legal with an operand of type opType, then generate an }
     { error; otherwise, do nothing }

     procedure Emit (var binop: TokenKind;
                     var op1Type, op2Type: TypeIndicator);
     { Assuming that two operands of types op1type and op2type }
     { have already been processed, generated the stack machine }
     { instruction or instructions needed to apply the binary }
     { operation binop to those operands.  (More than one }
     { instruction is needed if one of the operands has to be }
     { converted from integer to real, and in that case, op1Type
     { or op2Type will be changed accordingly.) }

You will probably find that you have to do some more work on your symbol table for this assignment. You should initialize the symbol table with the predefined constant identifiers "true" and "false" and with the predefined type identifiers "real", "integer", "char" and "boolean".

Assignment 3.

I/O statements (due February 13)

In this assignment, you will implement the standard input/output statements: read, readln, write and writeln. These identifiers should be added to the symbol table when it is initialized. You might declare them to be symbols of type "standardProc", for example. Here is the syntax for these statements, including a change in the definition of <>statement> that was used in the previous assignment:

      <statement>  :==  <assignment_statement> |
                           <I/O_statement>  |  EMPTY

  <I/O_statement>  :==  "read" <input_items>  |
                        "readln" [ <input_items> ]  |
                        "write" <output_items>  |
                        "writeln" [ <output_items> ]

    <input_items>  :==  "(" <L-value> [ , <L-value> ]... ")"
    
   <output_items>  :==  "(" <output_item> [ , <output_item> ]... ")"
   
    <output_item>  :==  STRING  |
                           <R-value> [ : <R-value> [ : <R-value> ] ]

Here, STRING refers to a string constant. The only types of <L-value> that are acceptable in input statements are integer, real and char. The only types of <R-value> that are acceptable in output statements are integer, real, char and boolean.

Note that the definition of <I/O_statement> is misleading. It is not really the word "read" that you would see. What I mean is an IDENT that represents the standard READ procedure. The same is true for the other words, "readln", "write" and "writeln". This is important since the standard identifiers can legally be redefined, and then they will no longer have the same meaning. I handled this in my compiler by defining a new symbol type for StandardProcedureNames, and by inserting the identifiers "read", "readln", "write" and "writeln" into the symbol table during its initialization. If the identifier is redefined, the new definition will hide the old one. There are, of course, many other ways to handle this.

Note the second and third <R-value> in the definition of <output_item>. These represent, respectively, a field width and a number of decimal places. They must be of type integer. You should allow a field width only if the value being output is of type real or integer. You should allow a number of decimal places only if the value being output is of type real. (Read the descriptions of the instructions sm_WriteInt, sm_WriteFloat and sm_WriteDecimal carefully. The field width and number of decimal places are optional in Pascal, but they are required in the stack machine language.)

Assignment 4.

Control statements (due February 20)

Control statements include things like if statements and while statements. These are relatively easy to handle (except the for statement), but they do represent the first use you will encounter of labels and jump statements in the stack machine language. Here are the new and revised syntax definitions you must implement:

        <statement>  :==  <assignment_statement> |
                              <I/O_statement>  |  
                              <if_statement>  |
                              <while_statement>  |
                              <repeat_statement>  |
                              <for_statement>  |
                              begin  <statement_list>  end  |
                              EMPTY

     <if_statement>  :==  if <R-value> then <statement>
                              [ else <statement> ]

  <while_statement>  :==  while <R-value> do <statement>
 
 <repeat_statement>  :==  repeat <statement_list> until <R-value>
 
    <for_statement>  :==  for IDENT := <R-value> ( to | downto )
                              <R-value> do <statement>

I suggest that you get everything else working before you tackle the for statement.

Assignment 5.

Subroutines (due March 3)

For this assigment, you will implement procedures and functions. (You might also want to implement the standard, predefined functions such as sin and eoln.) There are a lot of subtleties in getting user-defined subroutines to work. Here is the new and extended syntax you will have to handle:

          <program>  :==  program IDENT ;
                              [ <const_section> ]
                              [ <var_section> ]
                              [ <subroutine> ]...
                              begin <statement_list> end .

       <subroutine>  :==  ( procedure IDENT <formal_param_list> ;  |
                            function IDENT <formal_param_list> : IDENT ; )
                              [ <const_section> ]
                              [ <var_section> ]
                              begin <statement_list> end ;
                              
<formal_param_list>  :==  [ "(" <formal_param> [ ; <formal_param> ]... ")" ]

     <formal_param>  :==  [ var ] IDENT [ , IDENT ]... : IDENT
 
        <statement>  :==  <assignment_statement> |
                              <I/O_statement>  |  
                              <if_statement>  |
                              <while_statement>  |
                              <repeat_statement>  |
                              <for_statement>  |
                              <procedure_call>  |
                              begin  <statement_list>  end  |
                              EMPTY

   <procedure_call>  :==  IDENT <actual_param_list>

<actual_param_list>  :==  [ "(" <actual_param> [ , <actual_param> ]... ")" ]

     <actual_param>  :==  <L-value> | <R-value>

           <factor>  :==  <constant> | <variable_ref>  |
                              "(" <R-value> ")"  |
                              <function_call>

    <function_call>  :==  IDENT <actual_param_list>

Some notes on this:

The name of a procedure or function is in "global scope", that is, is visible in the rest of the program. The parameter and local variable names constitute a separate scope, and they are visible only while the subroutine is being compiled.
The symbol table entry for a subroutine name should include at least:
1. The location (instruction number) in the compiled code where the subroutine begins.
2. The number of parameters.
3. The type of each parameter and an indication of whether it is a var or value parameter. (There are various ways to handle this information; more on this in class.)
4. For functions, the type of the returned value.
While a function is being compiled, an assignment to the function name is legal. You will probably have to play some tricks to implement this.
An <actual_param> must be an <L-value> if it corresponds to a var parameter; otherwise, it must be an <R-value>.
When a procedure or function call is executed, the actual parameters are evaluated and left on the stack, where they will be part of the activation record. Then an sm_Subroutine is executed to transfer control to the beginning of the subroutine's code. Upon return from the subroutine, the first thing that should happen is that the block of memory occupied by the parameters is released.
For a function, you have to worry about the return value, which has to be left on the stack. I suggest reserving space for the return value on the stack before processing the actual parameters.
The only legal function return types are: integer, real, char and boolean.
The compiled code for the subroutine itself should begin with an sm_ReserveBlock instruction to reserve space on the stack for the local variables. Then, there should be an sm_SetBase command to set the base register to the beginning of the activation record. (Don't forget to allow for the return address put on the stack by the sm_Subroutine instruction that will be used to call the subroutine.)
The subroutine should end with an sm_RestoreBase instruction, followed by an sm_FreeBlock instruction, followed by an sm_Return instruction.

Assignment 6.

Arrays and Records (due March 13)

The final assignment is to add array and record types to the language handled by your compiler. This is not so difficult as it might sound. However, you will have to add to your symbol table again to account for type definitions. Also, you will have to account for the existence of variables that occupy more than one memory location, which will require changes in some of the existing procedures in your program. Here is the syntax specification:

        <type>  :=  IDENT | <array_type> | <record_type>

  <array_type>  :==  array "[" <constant> ".." <constant> "]"
                              of <type>

 <record_type>  :==  record <variable_declaration> 
                              [ ; <variable_declaration> ]... [ ; ] end

<variable_ref>  :==  IDENT  [ <var_modifier> ]...

<var_modifier>  :==  ( "[" <R-value> "]" )  |  ( "." IDENT )

     <program>  :==  program IDENT ;
                              [ <const_section> ]
                              [ <type_section> ]
                              [ <var_section> ]
                              [ <subroutine> ]...
                              begin <statement_list> end .

  <subroutine>  :==  ( procedure IDENT <formal_param_list> ;  |
                            function IDENT <formal_param_list> : IDENT ; )
                              [ <const_section> ]
                              [ <type_section> ]
                              [ <var_section> ]
                              begin <statement_list> end ;

<type_section>  :==  type <type_def> ; [ <type_def> ]...

    <type_def>  :==  IDENT = <type>

I suggest they you keep an array of "TypeDescriptions" to store information about user defined types. Types in your program can be specified as integers: A positive integer would refer to a user-defined type while negative integers could refer to the standard types integer, real, char and boolean. (You could also store the type descriptions directly in the symbol table, but keep in mind that not all types are named. For example, in the declaration

      var X: array[1..10] of integer;

the type of the variable X is anonymous. If type descriptions are stored in the symbol table, types in your program can be specifed as pointers or indices of symbol table entries.)

The above syntax does not allow for multi-dimensional arrays, but you can still have them for all intents and purposes. For example,

      type matrix5by3 = array[1..5] of array[1..3] of integer;

Note that the field names of a record are essentially in a scope of their own. Thus, they can duplicate names used outside the record. You might or might not want to handle them in the same way as other symbols, since they can only be used in a limited way, that is in a <variable_ref> involving a record variable.