Ssnotes
Ssnotes
Ssnotes
SIC Machine Architecture- SIC/XE Machine Architecture- Traditional (CISC) Machines- VAX Architecture-
Pentium Pro Architecture – RISC Machines – UltraSPARC Architecture- PowerPC Architecture- Cray T3E
Architecture.
1.0 Introduction
Software is set of instructions or programs written to carry out certain task on digital computers. It is
classified into system software and application software. System software consists of a variety of programs
that support the operation of a computer. Application software focuses on an application or problem to be
solved. System software consists of a variety of programs that support the operation of a computer.
Examples for system software are Operating system, compiler, assembler, macro processor, loader or
linker, debugger, text editor, database management systems (some of them) and, software engineering
tools. These software’s make it possible for the user to focus on an application or other problem to be
solved, without needing to know the details of how themachine works internally.
System software – support operation and use of computer. Application software - solution to a
problem. Assembler translates mnemonic instructions into machine code. The instruction formats,
addressing modes etc., are of direct concern in assembler design. Similarly, Compilers must generate
machine language code, taking into account such hardware characteristics as the number and type of
registers and the machine instructions available. Operating systems are directly concerned with the
management of nearly all of the resources of a computing system.
There are aspects of system software that do not directly depend upon the type of computing
system, general design and logic of an assembler, general design and logic of a compiler and, code
optimization techniques, which are independent of target machines. Likewise, the process of linking
together independently assembled subprograms does not usually depend on the computer being used.
1.2 The Simplified Instructional Computer (SIC)
Simplified Instructional Computer (SIC) is a hypothetical computer that includes the hardware
features most often found on real machines. There are two versions of SIC, they are, standard model
(SIC), and, extension version (SIC/XE) (extra equipment or extra expensive).
Memory
There are 215 bytes in the computer memory, that is 32,768 bytes , It uses Little Endian format to
store the numbers, 3 consecutive bytes form a word , each location in memory contains 8-bit bytes.
Registers
There are five registers, each 24 bits in length. Their mnemonic, number and use are given in the
following table.
1
Mnemonic Number Use
A 0 Accumulator; used for arithmetic operations
PC 8 Program counter
Data Formats
Integers are stored as 24-bit binary numbers , 2’s complement representation is used for negative
values, characters are stored using their 8-bit ASCII codes, No floating- point hardware on the standard
version of SIC.
Instruction Formats
All machine instructions on the standard version of SIC have the 24-bit format as shown above
Addressing Modes
There are two addressing modes available, which are as shown in the above table.
Parentheses are used to indicate the contents of a register or a memory location.
Instruction Set
SIC provides, load and store instructions (LDA, LDX, STA, STX, etc.). Integer arithmetic
operations: (ADD, SUB, MUL, DIV, etc.). All arithmetic operations involve register A and a word in
memory, with the result being left in the register. Two instructions are provided for subroutine linkage.
COMP compares the value in register A with a word in memory, this instruction sets a condition code
CC to indicate the result. There are conditional jump instructions: (JLT, JEQ, JGT), these instructions
test the setting of CC and jump accordingly. JSUB jumps to the subroutine placing the return address in
register L, RSUB returns by jumping to the address contained in register L.
Input and Output are performed by transferring 1 byte at a time to or from the rightmost 8 bits of
register A (accumulator). The Test Device (TD) instruction tests whether the addressed device is ready
2
to send or receive a byte of data. Read Data (RD), Write Data (WD) are used for reading or writing the
data.
LDA, STA, LDL, STL, LDX, STX ( A- Accumulator, L – Linkage Register, X – Index
Register), all uses 3-byte word. LDCH, STCH associated with characters uses 1-byte. There are no
memory-memory move instructions.
Storage definitions are
3
JLT MOVECH
.
.
.
STR1 BYTE C ‘HELLO WORLD’
STR2 RESB 11
ZERO WORD 0
ELEVEN WORD 11
Example 5 (To transfer two hundred bytes of data from input device to memory)
LDX ZERO
CLOOP TD INDEV
JEQ CLOOP
RD INDEV STCH RECORD,
XTIX B200
JLT CLOOP
.
.
INDEV BYTE X ‘F5’
RECORD RESB 200
ZERO WORD 0
B200 WORD 200
Example 6 (Subroutine to transfer two hundred bytes of data from input device tomemory)
JSUB READ
………….
………….
READ LDX ZERO
CLOOP TD INDEV
4
JEQ CLOOP
RD INDEV
STCH RECORD, X
TIX B200 : add 1 to index compare 200 (B200)
JLT CLOOP
RSUB
……..
……..
INDEV BYTE X ‘F5’
RECORD RESB 200
ZERO WORD 0
B200 WORD 200
Registers
s exponent fraction
Instruction Formats
The new set of instruction formats fro SIC/XE machine architecture are as follows. Format 1 (1
byte): contains only operation code (straight from table). Format 2 (2 bytes): first eight bits for operation
code, next four for register 1 and following four for register 2. The numbers for the registers go according
to the numbers indicated at the registers section (ie, register T is replaced by hex 5, F is replaced by hex
6). Format 3 (3 bytes): First 6 bits contain operation code, next 6 bits contain flags, last 12 bits contain
displacement for the address of the operand. Operation code uses only 6 bits, thus the second hex digit
will be affected by the values of the first two flags (n and i). The flags, in order, are: n, i, x, b, p, and e.
Its functionality is explained in the next section. The last flag e indicates the instruction format (0 for 3
and 1 for 4). Format 4 (4 bytes): same as format 3 with an extra 2 hex digits (8 bits) for addresses that
require more than 12 bits to be represented.
5
Format 1 (1 byte)
op
Format 2 (2 bytes)
8 4 4
op r1 r2
Format 3 (3 bytes)
6 1 1 1 1 1 1 12
op n i x b p e disp
Format 4 (4 bytes)
6 1 1 1 1 1 1 20
op n i x b p e address
Direct (x, b, and p all set to 0): operand address goes as it is. n and i are both set to the same
value, either 0 or 1. While in general that value is 1, if set to 0 for format 3 we can assume that the rest
of the flags (x, b, p, and e) are used
as a part of the address of the operand, to make the format compatible to theSIC format
Relative (either b or p equal to 1 and the other one to 0): the address of the operand should be
added to the current value stored at the B register (if b = 1) or to the value stored at the PC register (if p =
1)
Immediate (i = 1, n = 0): The operand value is already enclosed on the instruction (ie. lies on
the last 12/20 bits of the instruction)
Indirect (i = 0, n = 1): The operand value points to an address that holds the address for the
operand value.
6
Indexed (x = 1): value to be added to the value stored at the register x to obtain real address of
the operand. This can be combined with any of the previous modes exceptimmediate.
The various flag bits used in the above formats have the following meaningse - e = 0
means format 3, e = 1 means format 4
Bits x,b,p: Used to calculate the target address using relative, direct, and indexedaddressing Modes
b and p - both set to 0, disp field from format 3 instruction is taken to be the target address. For a format
4 bits b and p are normally set to 0, 20 bit address is the target address
i=0, n=0 or i=1, n=1 Simple addressing, (TA):TA is taken as the address of the operandvalue
Two new relative addressing modes are available for use with instructions assembled using
format 3.
Instruction Set
SIC/XE provides all of the instructions that are available on the standard version. In addition we
have, Instructions to load and store the new registers LDB, STB, etc, Floating-point arithmetic
operations, ADDF, SUBF, MULF, DIVF, Register move instruction : RMO, Register-to-register
arithmetic operations, ADDR, SUBR, MULR, DIVR and, Supervisor call instruction : SVC.
There are I/O channels that can be used to perform input and output while the CPU is
executing other instructions. Allows overlap of computing and I/O, resulting in more efficient system
operation. The instructions SIO, TIO, and HIO are used to start, test and halt the operation of I/O
channels.
LDA #5
STA ALPHA LDA #90
STCH C1
7
.
.
ALPHA RESW 1
C1 RESB 1
Example 2(Arithmetic operations)
LDT #11
LDX #0 : X=0
MOVECH LDCH STR1, X : LOAD A FROM STR1
STCH STR2, X : STORE A TO STR2
TIXR T : ADD 1 TO X, TEST (T)
JLT MOVECH
……….
……….
………
STR1 BYTE C ‘HELLO WORLD’
STR2 RESB 11
Example 4 (To transfer two hundred bytes of data from input device to memory)
LDT #200
LDX #0 CLOOP TD
INDEV
JEQ CLOOPRD
INDEV
STCH RECORD, XTIXR T
JLT CLOOP
.
.
INDEV BYTE X ‘F5’
RECORD RESB 200
Example 5 (Subroutine to transfer two hundred bytes of data from input device to memory)
JSUB READ
……….
……….
READ LDT #200
LDX #0
CLOOP TD INDEV
JEQ CLOOP
8
RD INDEV
STCH RECORD, X
TIXR T : add 1 to index compare T
JLT CLOOP
RSUB
……..
……..
INDEV BYTE X ‘F5’
RECORD RESB 200
Registers – There are 16 general purpose registers (GPRs) , 32 bits each, named as R0 to R15, PC
(R15), SP (R14), Frame Pointer FP ( R13), Argument Pointer AP (R12) ,Others available for
general use. There is a Process status longword (PSL) – for flags.
Data Formats - Integers are stored as binary numbers in byte, word, longword, quadword, octaword. 2’s
complement notation is used for storing negative numbers. Characters are stored as 8-bit ASCII codes.
Four different floating-point data formats are also available.
Instruction Formats - VAX architecture uses variable-length instruction formats – op code 1 or 2 bytes,
maximum of 6 operand specifiers depending on type of instruction. Tabak – Advanced Microprocessors
(2nd edition) McGraw-Hill, 1995, gives more information.
Addressing Modes - VAX provides a large number of addressing modes. They are Register mode,
register deferred mode, autoincrement, autodecrement, base relative, program-counter relative, indexed,
indirect, and immediate.
Instruction Set – Instructions are symmetric with respect to data type - Uses prefix – type of operation,
suffix – type of operands, a modifier – number of operands. For example, ADDW2 - add, word length, 2
operands, MULL3 - multiply, longwords, 3 operands CVTCL - conversion from word to longword.
VAX also provides instructionsto load and store multiple registers.
Input and Output - Uses I/O device controllers. Device control registers are mapped to separate I/O
space. Software routines and memory management routines are used for input/output operations.
9
1.3.1.2 Pentium Pro Architecture
Introduced by Intel in 1995.
Memory - consists of 8-bit bytes, all addresses used are byte addresses. Two consecutive bytes form a
word, four bytes form a double word (dword). Viewed as collection of segments, and, address = segment
number + offset. There are code, data, stack , extra segments.
Registers – There are 32-bit, eight GPRs, namely EAX, EBX, ECX, EDX, ESI, EDI, EBP, ESP. EAX,
EBX, ECX, EDX – are used for data manipulation, other four are used to hold addresses. EIP – 32-bit
contains pointer to next instruction to be executed. FLAGS is an 32 - bit flag register. CS, SS, DS, ES,
FS, GS are the six 16-bit segment registers.
Data Formats - Integers are stored as 8, 16, or 32 bit binary numbers, 2’s complement for negative
numbers, BCD is also used in the form of unpacked BCD, packed BCD. There are three floating point
data formats, they are single, double, and extended- precision. Characters are stored as one per byte –
ASCII codes.
Instruction Formats – Instructions uses prefixes to specify repetition count, segment register, following
prefix (if present), an opcode ( 1 or 2 bytes), then number of bytes to specify operands, addressing
modes. Instruction formats varies in length from 1 byte to 10 bytes or more. Opcode is always present in
every instruction
Addressing Modes - A large number of addressing modes are available. They are immediate mode,
register mode, direct mode, and relative mode. Use of base register, index register with displacement is
also possible.
Instruction Set – This architecture has a large and complex instruction set, approximately 400 different
machine instructions. Each instruction may have one, two or three operands. For example Register-to-
register, register-to-memory, memory-to- memory, string manipulation, etc…are the some the
instructions.
Input and Output - Input is from an I/O port into register EAX. Output is from EAX to an I/O port
Registers - More than 100 GPRs, with 64 bits length each called Register file. There are 64 double
precision floating-point registers, in a special floating-point unit (FPU). In addition to these, it contains
PC, condition code registers, and control registers.
Data Formats - Integers are stored as 8, 16, 32 or 64 bit binary numbers. Signed, unsigned for integers
10
and 2’s complement for negative numbers. Supports both big- endian and little-endian byte orderings.
Floating-point data formats – single, double and quad-precision are available. Characters are stored as 8-
bit ASCII value.
Instruction Formats - 32-bits long, three basic instruction formats, first two bits identify the format.
Format 1 used for call instruction. Format 2 used for branch instructions. Format 3 used for load, store
and for arithmetic operations.
Addressing Modes - This architecture supports immediate mode, register-direct mode,PC-relative,
Register indirect with displacement, and Register indirect indexed.
Instruction Set – It has fewer than 100 machine instructions. The only instructions that access memory
are loads and stores. All other instructions are register-to-register operations. Instruction execution is
pipelined – this results in faster execution, and hence speed increases.
Input and Output - Communication through I/O devices is accomplished through memory. A range of
memory locations is logically replaced by device registers. When a load or store instruction refers to this
device register area of memory, the corresponding device is activated. There are no special I/O
instructions.
1.3.2.2 Cray T3E Architecture
Announced by Cray Research Inc., at the end of 1995 and is a massively parallel processing
(MPP) system, contains a large number of processing elements (PEs), arranged in a three-dimensional
network. Each PE consists of a DEC Alpha EV5 RISC processor, and local memory.
Memory - Each PE in T3E has its own local memory with a capacity of from 64 megabytes to 2
gigabytes, consists of 8-bit bytes, all addresses used are byte addresses. Two consecutive bytes form a
word, four bytes form a longword, eight bytes form a quadword.
Registers – There are 32 general purpose registers(GPRs), with 64 bits length each called R0 through
R31, contains value zero always. In addition to these, it has 32 floating-point registers, 64 bits long, and
64-bit PC, status , and control registers.
Data Formats - Integers are stored as long and quadword binary numbers. 2’s complement notation for
negative numbers. Supports only little-endian byte orderings. Two different floating-point data formats –
VAX and IEEE standard. Characters stored as8-bit ASCII value.
Instruction Formats - 32-bits long, five basic instruction formats. First six bits always identify the
opcode.
Addressing Modes - This architecture supports, immediate mode, register-direct mode, PC-relative, and
Register indirect with displacement.
Instruction Set - Has approximately 130 machine instructions. There are no byte or word load and store
instructions. Smith and Weiss – “PowerPC 601 and Alpha 21064: A Tale of TWO RISCs “ – Gives more
information.
Input and Output - Communication through I/O devices is accomplished through multiple ports and
I/O channels. Channels are integrated into the network that interconnects the processing elements. All
channels are accessible and controllable from all PEs.
11
Unit II:
Assemblers- Basic Assembler Functions- A simple SIC Assembler- Assembler Algorithm and Data
Structures. Machine-Dependent Assembler features-Instruction formats and addressing modes. Machine-
Independent Assembler features-Literals- Expressions-Program blocks. Assembler Design options--One pass
Assemblers- Multi-pass Assemblers.
Consider the following assembly language program for SIC. This program contains
a main routine that calls the subroutine RDREC which reads records from an input device(
code F1) and WRREC which copies them to an output device(code 05).
The main routine calls subroutines:
• RDREC – To read a record into a buffer.
• WRREC – To write the record from the buffer to the output device.
At the end of the file it writes EOF on the output device.(The end of each record is marked
with a null character (hexadecimal 00)).
The line numbers are for reference only. Indexed addressing is indicated by adding
the modifier ”X” following the operand. Lines beginning with ”.” contain comments only.
12
13
Figure 2.1 –Example of a SIC assembler language program
14
Data transfer (RD, WD)
A buffer (BUFFER) is used to store record. The end of each record is marked with a
null character (0016). Buffer length is 4096 Bytes The end of the file is indicated by a zero-
length record(EOF). When the end of file is detected, the program writes EOF on the output
device and terminates by RSUB.
• Header record
Col. 1 H
Col. 2 7 Program name
Col. 8 13 Starting address of object program (hex)
Col. 14 19 Length of object program in bytes (hex)
• Text record
Col. 1 T
Col. 2 7 Starting address for object code in this record (hex)
Col. 8 9 Length of object code in this record in bytes (hex)
15
Col. 10 69 Object code, represented in hex (2 col. per byte). So a maximum of
30 bytes can be stored in each text record.
• End record
Col.1 E
Col.2 7 Address of first executable instruction in object program (hex).
(”ˆ” is only for separation only)
We have two columns per byte for object code. Each machine instruction is 3 bytes
that is it occupies 6 columns. In the first text record we are saving 10 machine instructions
each of 3 bytes size. So we are storing a total of 30 bytes (60 columns) which is 1E in
decimal.(IE marked in a circle in the example given).
Forward reference: It is the reference to a label that is defined later in the program.
In the above example in line number 1000 the instruction STL will store the linkage
register with the contents of RETADR. But during the processing of this instruction the
value of this symbol is not known as it is defined at the line number 1033.
To generate the object code for the instruction at 1000 we need the opcode for STL
and the value for the symbol RETADR. But the value or address of RETADR is not
available until 1033. This reference of RETADR before it is defined is called forward
referencing.
16
So generating the object code by scanning the entire program only once becomes
difficult. Due to this reason usually the design is done in two passes. A two pass assembler
resolves the forward references with the help of a SYMBOL TABLE and then converts the
program into the object code.
17
the pass 1. During Pass 2, symbols used as operands are looked up the symbol table to
obtain the address value to be inserted in the assembled instructions. SYMTAB is usually
organized as a hash table for efficiency of insertion and retrieval. A sample SYMTAB is
shown below.
(Both pass 1 and pass 2 require reading the source program. Apart from this an
intermediate file is created by pass 1 that contains each source statement together with its
assigned address, error indicators, etc. This file is one of the inputs to the pass 2. A copy of
the source program is also an input to the pass 2, which is used to retain the operations that
may be performed during pass 1 (such as scanning the operation field for symbols and
addressing flags), so that these need not be performed during pass 2. )
18
2.4.3 The Algorithm for Pass 1
19
Explanation of Pass 1 Algorithm:
The algorithm scans the first statement START and saves the operand field (the
address) as the starting address of the program. Initializes the LOCCTR value to this
address. This line is then written to the intermediate file. If no operand is mentioned the
LOCCTR is initialized to zero.
If a label is encountered, the symbol has to be entered in the symbol table along with
its associated address value. If the symbol already exists that indicates an entry of the same
symbol already exists. So an error flag is set indicating a duplication of the symbol.
Next it checks for the mnemonic code, it searches for this code in the OPTAB. If
found then the length of the instruction is added to the LOCCTR to make it point to the next
instruction.
If the opcode is the assembler directive WORD it adds a value 3 to the LOCCTR. If
it is RESW, it needs to add the number of data word to the LOCCTR (each word is of size
3bytes so 3*no of words). If it is BYTE it adds the length of the constant to the LOCCTR, if
RESB it adds number of bytes reserved. If it is END directive then it is the end of the
program it finds the length of the program by evaluating current LOCCTR minus the starting
address mentioned in the operand field of the END directive. Each processed line is written
to the intermediate file.
20
Explanation of Pass 2 Algorithm:
Here the first input line is read from the intermediate file. If the opcode is START,
then this line is directly written to the listing file(output file). A header record is written in
the object program which gives the starting address and the length of the program (which is
calculated during pass 1).
21
Then the first text record is initialized. Comment lines are ignored. OPTAB is
searched to find the object code of an opcode. If there is a symbol in the operand field, the
symbol table is searched to get the address value for this which gets appended to the object
code of the opcode. If the address is not found then zero value is stored as operand's address.
An error flag is set indicating it as undefined. If symbol itself is not found then store 0 as
operand address and the object code instruction is assembled.
If the opcode is BYTE or WORD, then the constant value is converted to its
equivalent object code( for example, for character EOF, its equivalent hexadecimal value
'454f46' is stored). If the object code cannot fit into the current text record, a new text record
is created and the rest of the instructions object code is listed. The text records are written to
the object program. Once the whole program is assemble and when the END directive is
encountered, the End record is written.
Study the instruction formats and addressing modes of SIC/XE from first module.
Program Relocation
Sometimes it is required to load and run several programs at the same time. The
system must be able to load these programs wherever there is place in the memory.
Therefore the exact starting address is not known until the load time.
In an absolute program the starting address to which the program has to be loaded is
mentioned in the program itself using the START directive. So the address of every
instruction and labels are known while assembling itself. This is called absolute addressing.
Consider an example
This statement says that the register A is loaded with the value stored at location
102D(which is the address of THREE). Suppose we need to load and execute the program at
location 3000 instead of location 1000. Since program is loaded into location 3000, at
address 102D (address of THREE) the required value which needs to be loaded in the
22
register A is no more available. The address of the symbols also get changed relative to the
displacement of the program. Hence we need to make some changes in the address portion
of the instruction so that we can load and execute the program at location 3000.
Since assembler will not know actual location where the program will get loaded, it
cannot make the necessary changes in the addresses used in the program. However, the
assembler can identifies and informs the loader those parts of the program which need
modification. An object program that has the information necessary to perform this kind of
modification is called the relocatable program.
The above diagram shows the concept of relocation. Initially the program is loaded at
location 0000. The instruction JSUB is loaded at location 0006. The address field of this
instruction contains 01036, which is the address of the instruction labeled RDREC. The
second figure shows that if the program is to be loaded at new location 5000. The address of
the instruction JSUB gets modified to new location 6036. Likewise the third figure shows
that if the program is relocated at location 7420, the JSUB instruction would need to be
changed to 4B108456 that correspond to the new address of RDREC.
The only part of the program that require modification at load time are those that
specify direct addresses. The rest of the instructions need not be modified. The instructions
which doesn't require modification are the ones that is not a memory address (immediate
addressing) and PC-relative, Base-relative instructions.
23
It is not possible for the loader to distinguish the address and constant from the
object program. So the assembler must keep some information to tell the loader which part
of the object program need to be modified. For this the concept of modification record is
record.
Modification record is a type of record which is added to the object program. One
modification record is created for each address to be modified. The assembler produces a
modification record to store the starting location and the length of the address field to be
modified.
24
The object code lines at the end starting with M are the descriptions
of the modification records for those instructions which need change if
relocation occurs.
M00000705 is the modification suggested for the statement at location
0007 and requires modification 5-half bytes.
Similarly for the remaining modification records.
MACHINE INDEPENDENT ASSEMBLER FEATURES
The features which are NOT closely dependent to machine architecture are called
machine independent assembler features. The machine independent assembler features
includes:
1. Literals
2. Symbol Defining Statements
3. Expressions
4. Program Blocks
5. Control Sections and Program Linking
LITERALS
It is convenient for the programmer to be able to write the value of a constant operand as
part of the instruction that uses it.
This avoids having to define the constant elsewhere in the program and make a label for it.
Such an operand is called a Literal because the value is literally in the instruction.
A literal is defined with a prefix '=' followed by a specification of the literal value.
Consider the following example:
.
.
LDA FIVE
.
.
FIVE WORD 5
Using the concept of literal we can rewrite the above code as:
.
.
LDA =X’05’
Difference between literal operands and immediate operands
For literals prefix is =, and for immediate addressing prefix is #.
In immediate addressing, the operand value is assembled as part of the machine instruction,
ie there is no memory reference.
Line no Location Counter
55 0020 LDA #03 010003
In the above example the last 12 bits of the machine code corresponds to 003 which is equal
to the immediate value.
With a literal, the assembler generates the specified value as a constant at some other
memory location. The address of this generated constant is used as the target address (TA)
for the machine instruction ( using PC-relative or base-relative addressing with memory
reference.)
25
Literal Pool
All the literal operands used in a program are gathered together into one or more literal
pools. This is usually placed at the end of the program.
In some cases, it is desirable to place literals into a pool at some other location in the
object program. To allow this an assembler directive LTORG is used.
When the assembler encounters a LTORG statement, it generates a literal pool containing
all literal operands used since previous LTORG or the beginning of the program
Literals placed in a pool by LTORG will not be repeated in a pool at the end of the
program.
Reason for using LTORG is to keep the literal operand close to the instruction (otherwise
PC-relative addressing may not be allowed)
Literal Table (LITTAB)
A literal table(LITTAB) is created for storing the literals which are used in the program.
The literal table contains the literal name, operand value and length.
The literal table is usually created as a hash table on the literal name.
Duplicate literals
The same literal used more than once in the program, then it can be consider as a duplicate
literal.
In such cases, only one copy of the specified value needs to be stored
To recognize the duplicate literals, two methods are there
1. Compare the character strings defining them
Easier to implement e.g. =X’05’. But not possible to handle the literals like
=C’EOF’ and =X’454F46’.
Here both literals are same in the form of their data value.
2. Compare the generated data value
Possible to handle the literals like =C’EOF’ and =X’454F46’. Here both literals are same
in the form of their generated data value. So comparison based on generated data value is
needed to identify duplicate literals or not. But this is difficult to implement compared to
the first method.
Implementation of Literals
During Pass-1:
The literal encountered is searched in the literal table. If the literal already exists, no
action is taken; if it is not present, the literal is added to the LITTAB and for the address
value it waits till it encounters LTORG or END statement for literal definition.
When Pass 1 encounters a LTORG statement or the end of the program, the
assembler makes a scan of the literal table. At this time each literal currently in the table
26
is assigned an address. As addresses are assigned, the location counter is updated to reflect
the number of bytes occupied by each literal.
During Pass-2:
The assembler searches the LITTAB for each literal encountered in the instruction
and replaces it with its equivalent value as if these values are generated by BYTE or WORD.
The following figure shows the difference between the SYMTAB and LITTAB
When the assembler encounters EQU statement, it enters the symbol MAXLEN along
with its value in the symbol table. During LDA the assembler searches the SYMTAB for
its entry and its equivalent value as the operand in the instruction.
The object code generated is the same for both the options discussed, but is easier to
understand.
If the maximum length is changed from 100 to 500, it is difficult to change if it is
mentioned as an immediate value wherever required in the instructions. We have to scan
the whole program and make changes wherever 100 is used.
27
If we mention this value in the instruction through the symbol defined by EQU, we may
not have to search the whole program but change only the value of MAXLENGTH in the
EQU statement.
ORG Statement:
This directive can be used to indirectly assign values to the symbols. The directive is
usually called ORG (means origin).
Its general format is:
ORG value
where value is a constant or an expression involving constants and previously defined
symbols.
When this statement is encountered during assembly of a program, the assembler resets its
location counter (LOCCTR) to the specified value.
Since the values of symbols used as labels are taken from LOCCTR, the ORG statement
will affect the values of all labels defined until the next ORG is encountered.
Eg: ORG AlPHA
When this statement is encountered during assembly of a program, the assembler resets its
location counter (LOCCTR) to the value of ALPHA.
EXPRESSIONS
The assemblers allow the use of expressions as operand
The assembler evaluates the expressions and produces a single operand address or value.
Assemblers generally allow arithmetic expressions as operands formed according to the
normal rules using arithmetic operators +, - *, /. (Division is usually defined to produce an
integer result.)
Individual terms may be constants, user-defined symbols, or special terms.
The only special term used is * ( the current value of location counter) which indicates the
value of the next unassigned memory location.
Thus the statement
BUFFEND EQU *
Assigns the value of LOCCTR to BUFFEND, which is the address of the next byte
following the buffer area.
Some values in the object program are relative to the beginning of the program and some
are absolute (independent of the program location, like constants). Hence, expressions are
classified as either absolute expression or relative expressions depending on the type of
value they produce.
Absolute Expressions:
The expression that uses only absolute terms is absolute expression. Absolute
expression may contain relative term provided the relative terms occur in pairs with
opposite signs for each pair.
Example:
MAXLEN EQU BUFEND-BUFFER
In the above instruction the difference in the expression BUFEND-BUFFER gives a
value that does not depend on the location of the program and hence gives an
absolute value
Relative Expressions:
The expression that uses the values relative to the program are called relative
expression.
Absolute expression may contain relative term provided the relative terms occur in
pairs with opposite signs for each pair.
28
Example:
MAXLEN EQU ALPHA + BUFEND-BUFFER
In the above instruction the difference in the expression BUFEND-BUFFER gives
a value that does not depend on the location of the program but it is added to the
value of ALPHA which is program relative. Hence this expression is relative.
COPY START 0
LDA LENGTH
………
………
USE CDATA
MAX RESW 1
LENGTH RESW 1
USE CBLOC
KS
BUFFER RESB 00
………
//Subroutine to read record into buffer
USE
RDREC CLEAR X
LDA
INPU
T
………..
…………
USE CDATA
INPUT BYTE X’F1’
…………
//Subroutine to write record from buffer
USE
29
………
USE CDAT
A
MIN RESW 1
BUFEND RESW 1
In the example give above three
program blocks are used :DEFAULT: executable instructions.
CDATA: all data areas that are less in length.
CBLOCKS: all data areas that consists of larger blocks of memory.
DEFAULT
CDATA
CBLOCKS
31
using multiple control sections, the beginning of each of the control section is
indicated by an assembler directive: CSECT
The syntax
secname CSECT
The assembler maintain separate LOCCTR beginning at 0 for each control sections.
Control sections differ from program blocks in that they are handled separately by the
assembler.
Handling of External References
Instructions in one control section may need to refer to instructions or data located in
another section. This is called as external references. The external references are indicated
by two assembler directives: EXTDEF and EXTREF
The assembler must include information in the object program that will cause the loader to
handle external references properly. For this three types of records are used in object
program: Define, Refer and Modification Record.
32
The format of modification record which we studied in Module 2 is revised to support
the handling of external references.
COPY START 0
EXTDEF BUFFER,
BUFFEND, LENGTH EXTREF
A,B
LDA ALPHA
………………
………………
………………
BUFFER WORD 3
BUFFEND EQU *
LENGTH EQU BUFFEND-BUFFER
RDREC CSECT
EXTREF BUFFER, BUFFEND, LENGTH
………………………..
………………………..
……………………
…… LDA
BUFF
ER
…………………………
…………………………..
…………………
………. END
The object program generated for the above code segment is:
H^ COPY ^ 000000^001033
D^BUFFER^000033^BUFEND^001033^LENGTH^00
002D
R^A ^B
T^…………………………
33
…
T^……………………………
……………………………….
……………………………
…
M^000004^05^+RDREC
……………………………
… E^000000
1. One that produces object code directly in memory for immediate execution (Load-
and-go assemblers).
2. One pass assembler generating object code for later execution.
1. Load-and-Go Assembler
Load-and-go assembler generates their object code in memory for immediate
execution. Since no object program is written out, no loader is needed. It is useful in a system
with frequent program development and testing. Since the object program is produced in
memory, the handling of forward references becomes less difficult.
Working of One pass assembler (Load and Go Assembler)
In load-and-Go assemblers when a forward reference is encountered :
Omits the operand address if the symbol has not yet been defined(placess 000 at the
operand addresses position)
Enters this undefined symbol into SYMTAB and indicates that it is undefined
Adds the location at which the operand is referenced to a list of forward references
associated with the SYMTAB entry
When the definition for the symbol is encountered, scans the reference list and inserts
the address.
At the end of the program, reports the error if there are still SYMTAB entries
indicated undefined symbols(* indicates undefined).
When the END statement is encountered, search SYMTAB for the symbol named in
the END statement and jumps to this location to begin execution if there is no error.
In short, whenever any undefined symbol is encountered it will insert into SYMTAB as a
new entry and indicate that it is undefined and also adds the location at which the operand is
34
referenced as a linked list associated with that SYMTAB entry. When the definition for the
symbol is encountered, scans the reference list and inserts the address in proper location.
35
if found
els
e insert (LABEL, LOCCTR) into SYMTAB
else
els
e
add 3 to
LOCCTR else if
OPCODE =’WORD’
add 3 to
LOCCTR else if
OPCODE =’RESW’
36
add 3 #[OPERAND] to LOCCTR
else if OPCODE =’RESB’
37
{ find length of constant in
bytes add length to
LOCCTR convert
constant to object code
}
if object code will not fit into current text record
Example:
The following figure shows the status upto this point. The symbol RREC is referred once at
location 2013, ENDFIL at 201C and WRREC at location 201F. None of these symbols are
defined. The figure shows that how the pending definitions along with their addresses are
included in the symbol table.
38
When the definition for the symbols RDREC and ENDFILL are encountered, the
reference list associated with the symbols is scanned and the address is inserted at proper
location. It is gioven in following figure:
39
When the definition of a symbol is encountered, the assembler generates another Text
record with the correct operand address of each entry in the reference list.
When loaded, the incorrect address 0 will be updated by the latter Text record
containing the symbol definition.
Example:
40
For a forward reference in symbol definition, we store in the SYMTAB:
o The symbol name
o The defining expression
o The number of undefined symbols in the defining expression
The undefined symbol (marked as *) associated with a list of symbols depend on this
undefined symbol.
When a symbol is defined, we can recursively evaluate the symbol expressions
depending on the newly defined symbol.
The portions of the program that involve forward references in symbol definition are
saved during Pass 1.Additional passes through these stored definitions are made as the
assembly progresses. This process is followed by a normal Pass 2.
Example:
Consider the symbol table entries from Pass 1 processing of the statement.
HALFS2 EQU MAXLEN/2
Since MAXLEN has not yet been defined, no value for HALFS2 can be computed.
The defining expression for HALFS2 is stored in the symbol table in place of its
value.
The entry &1 indicates that 1 symbol in the defining expression undefined.
SYMTAB simply contain a pointer to the defining expression.
The symbol MAXLEN is also entered in the symbol table, with the flag * identifying
it as undefined. Associated with this entry is a list of the symbols whose values
depend on MAXLEN.
41
If possible study the portion given below
Unit III: Loaders & Linkers: Basic Loader Functions- Design of Absolute Loader- Simple Bootstrap
Loader-Machine Dependent Loader features-Relocation-Program linking-Algorithm and Data structures for
a Linking loader. Loader Design options.
42
To execute an object program, we need:
Relocation - which modifies the object program so that it can be loaded at an address
different from the location originally specified
Linking - which combines two or more separate object programs and supplies the
information needed to allow references between them
Loading and Allocation - which allocates memory location and brings the object
program into memory for execution
43
44
The system software which performs linking operation is called linker. The system software
which loads the object program into memory and starts its execution is called loader.
Linkers and loaders perform several related but conceptually separate actions.
45
read next object program record
end
jump to address specified in End record
end
When a computer is first turned on or restarted, a special type of absolute loader, called
a bootstrap loader, is executed. This bootstrap loader loads the first program to be run by the
computer – usually an operating system.
Much of the work of the bootstrap loader is performed by the subroutine GETC.
GETC is used to read and convert a pair of characters from device F1 representing
1 byte of object code to be loaded. For example, two bytes = C “D8” ‘4438’H
converting to one byte ‘D8’H.
The resulting byte is stored at the address currently in register X, using STCH
instruction that refers to location 0 using indexed addressing.
The TIXR instruction is then used to add 1 to the value in X.
46
This bootstrap main function reads object code from device F1 and enters it into memory starting at address 80
(hexadecimal) . After all of the code from dev F1 has been seen entered into memory, the bootstrap executes a
jump to address 80 to begin execution of the program just loaded. Register X contains the next address to be
loaded.
BOOT START 0
CLEAR A CLEAR REGISTER A TO ZERO
LDX #128 INITIALIZE REGISTER X TO HEX 80
LOOP JSUB GETC READ HEX DIGIT FROM PROGRAM BEING LOADED
RMO A, S SAVE IN REGISTER S
SHIFTL S , 4 MOVE TO HIGHORDER 4 BITS OF BYTE
JSUB GETC GET NEXT HEX DIGIT
ADDR S ,A COMBINE DIGITS TO FORM ONE BYTE
STCH 0 ,X STORE AT ADDRESS IN REGISTER X
TIXR X ADD 1 TO MEMORY ADDRESS BEING LOADED
JUMP LOOP LOOP UNTIL END OF INPUT IS REACHED
GETC subroutine read one character from input device and convert it from ASCII code to hexadecimal digit value.
The converted digit value is returned in register A. When an end of fileis read, control is transferred to the starting
address (hex 80)
Loaders that has the capability to perform relocation are called relocating loaders or
relative loaders.
Modification Record
A Modification record is used to describe each part of the object code that must be
changed when the program is relocated.
The Modification has the following format:(Its explained in detail in module 2)
Each Modification record specifies the starting address and length of the field whose
value is to be altered. It then describes the modification to be performed.
48
Consider the following object program, here the records starting with M represents the
modification record. In this example, the record M 000007 05 + COPY is the
modification suggested for the statement at location 000007 and requires modification
of 5-half bytes and the modification to be performed is add the value of the symbol
COPY, which represents the starting address of the program.(means add the starting
address of program to the statement at 000007). Similarly for other records.
The Modification record is not well suited for certain cases. In some programs the
addresses in majority of instructions need to be modified when the program is relocated. This
would require large number of Modification records, which results in an object program more
than twice as large as the normal. In such cases, the second method called relocation bit is used.
Relocation Bit
To overcome the disadvantage of modification record, relocation bit is used.
The Text records are the same as before except that there is a relocation bit associated
with each word of object code.
Since all SIC instructions occupy one word, this means that there is one relocation bit
for each possible instruction.
The relocation bits are gathered together into a bit mask following the length indicator
in each Text record.
Text record format
If the relocation bit corresponding to a word of object code is set to 1, the programs
starting address is to be added to this word when the program is relocated.
A bit value of 0 indicates that no modification is necessary.
If a Text record contains fewer than 12 words of object code, the bits corresponding to
unused words are set to 0.
In the following object code, the bit mask FFC (representing the bit string
111111111100) in the first Text record specifies that all 10 words of object code are to
be modified during relocation.
49
4.3.2 Program Linking (Linking Loader)
Many programming languages allow us to write different pieces of code called modules,
separately. This simplifies the programming task because we can break a large program
into small, more manageable pieces. Eventually, though, we need to put all the modules
together. Apart from this, a user code often makes references to code and data
defined in some "libraries".
Linking is the process in which references to "externally" defined symbols are processed
so as to make them operational.
A linker or link editor is a program that combines object modules to form an executable
program.
A Linking Loader is a program that has the capability to perform relocation, linking and
loading. Linking and relocation is performed at load time.
50
o Pass 2 performs the actual loading, relocation, and linking.
The main data structure needed for our linking loader is an external symbol table
ESTAB. This table, which is analogous to SYMTAB in our assembler algorithm, is
used to store the name and address of each external symbol in the set of control sections
being loaded.
Two other important variables are PROGADDR (program load address) and
CSADDR (control section address).
(1) PROGADDR is the beginning address in memory where the linked program is to be
loaded. Its value is supplied to the loader by the OS.
(2) CSADDR contains the starting address assigned to the control section currently
being scanned by the loader. This value is added to all relative addresses within the
control section to convert them to actual addresses.
51
Explanation of Pass 1 algorithm
The beginning load address for the linked program (PROGADDR) is obtained from the
OS. This becomes the starting address (CSADDR) for the first control section in the
input sequence.
The control section name from Header record is entered into ESTAB, with value given
by CSADDR.
All external symbols appearing in the Define record for the control section are also
entered into ESTAB. Their addresses are obtained by adding the value specified in the
Define record to CSADDR.
When the End record is read, the control section length CSLTH (which was saved from
the End record) is added to CSADDR. This calculation gives the starting address for the
next control section in sequence.
At the end of Pass 1, ESTAB contains all external symbols defined in the set of control
sections together with the address assigned to each.
The library search process may be repeated since the subroutines fetched from a library
53
may themselves contain external references. Programmer defined subroutines have higher
priority. So the programmer can override the standard subroutines in the library by supplying
their own routines. Searching on the libraries is done by scanning through the define records of
all the object programs in the library. This method is quiet inefficient. So we go for a directory
structure. Assembled or compiled versions of the subroutines in a library is structured using a
directory that gives the name of each routine and a pointer to its address within the library. Thus
the library search involves only a search on the directory, followed by reading the object
programs indicated by this search.
The library contains an internal directory where each files along with their address are
stored. This facilitates the linking of library functions more easy, because whenever a library
function is needed its address can be directly obtained from internal directory.
INCLUDE READ(UTLIB)
INCLUDE WRITE(UTILB)
DELETE RDREC, WRREC
CHANGE RDREC, READ
54
CHANGE WRREC, WRITE
55
These commands would ask the loader to include control sections READ and
WRITE from the library UTLIB and to delete the control sections WRREC and
RDREC. The first CHANGE command would change all the external references to the
symbol RDREC to be changed to refer to READ and second CHANGE will cause
references to WRREC to be changed to WRITE.
• A linking loader performs all linking and relocation operations, including automatic
library search if specified, and loads the linked program directly into memory for execution.
• A linkage editor produces a linked version of the program (load module or executable
image), which is written to a file or library for later execution.
A linkage editor produces a linked version of the program (load module or executable
image), which is written to a file or library for later execution. When the user is ready to run
the linked program, a simple relocating loader can be used to load the program into memory.
The only object code modification necessary is the addition of an actual load address to
relative values within the program.
Figure: Processing of an object program using a) linking loader and b)linkage editor
56
The Linkage Editor(LE) performs relocation of all control sections relative to the start of
the linked program. Thus, all items that need to be modified at load time have values that are
relative to the start of the linked program. This means that the loading can be accomplished
in one pass with no external symbol table required.
If a program is to be executed many times without being reassembled, the use of a linkage
editor substantially reduces the overhead required. Linkage editors can perform many useful
functions besides simply preparing an object program for execution. Resolution of external
reference and library searching are only performed once for linkage editor.
If a program is under development or is used infrequently, the use of a linking loader
outperforms a linkage editor.
Consider a program PLANNER with a number of subroutines. You want to improve a
subroutine (PROJECT) of the program (PLANNER) without going back to the original
versions of all of the other subroutines. For that you can use linkage editor commands as
follows:
Linkage editors perform linking operations before the program is loaded for
execution.
Linking loaders perform these same operations at load time.
Dynamic linking, dynamic loading, or load on call postpones the linking
function until execution time. That is a subroutine is loaded and linked to the
rest of the program when it is first called.
Dynamic linking, dynamic loading, or load on call postpones the linking function
until execution time. That is a subroutine is loaded and linked to the rest of the program
when it is first called.
Dynamic linking is often used to allow several executing programs to share one copy of
a subroutine or library (eg. run-time support routines for a high-level language like C.)
With a program that allows its user to interactively call any of the subroutines of a large
mathematical and statistical library, all of the library subroutines could potentially be
needed, but only a few will actually be used in any one execution. Dynamic
linking can avoid the necessity of loading the entire library for each execution except
those necessary subroutines.
For example, that a program contains subroutines that correct or clearly diagnose error
in the input data during execution. If such error are rare, the correction and diagnostic
routines may not be used at all during most execution of the program. However, if the
program were completely linked before execution, these subroutines need to be loaded
and linked every time.
Fig 3.14 illustrates a method in which routines that are to be dynamically loaded must
be called via an OS service request.
57
Figure: Loading and calling of a subroutine using dynamic linking
58
Fig (a): Whenever the user program needs a subroutine for its execution, the program
makes a load-and-call service request to OS(instead of executing a JSUB instruction
referreing to an external symbol) . The parameter of this request is the symbolic
name(ERRHANDL) of the routine to be called.
Fig (b): OS examines its internal tables to determine whether or not the routine is
already loaded. If necessary, the routine is loaded from the specified user or system
libraries.
Fig (c): Control is then passed from OS to the routine being called.
Fig (d): When the called subroutine completes it processing, it returns to its caller (i.e.,
OS). OS then returns control to the program that issued the request.
Fig (e): If a subroutine is still in memory, a second call to it may not require another
load operation. Control may simply be passed from the dynamic loader to the called
routine.
Unit IV:Compilers - Basic compiler Functions – Grammars - Lexical Analysis – Syntactic Analysis-
Code Generation-Compiler Design options.
Compilers
7
For the purposes of compiler construction, a high-level programming language is usually
described in terms of grammar.
This grammar specifies the form, or syntax, of legal statements in the language.
The problem of compilation then becomes one of matching statements written by the
programmer to structures defined by the grammar, and generating theappropriate object
code for each statement.
A source program statement can be regarded as a sequence of tokens rather than simply as a
string of characters.
Tokens may be thought of as the fundamental building blocks of the language. For example, a
token might be a keyword, a variable name, an integer, an arithmetic operator, etc.
The task of scanning the source statement, recognizing and classifying the various tokens, is
known as lexical analysis. The part of the compiler that performs this analytic function is
commonly called the scanner.
After the token scan, each statement in the program must be recognized as some language
construct, such as a declaration or an assignment statement, described by the grammar.
This process, called syntactic analysis or parsing, is performed by a part of the compiler that is
usually called the parser.
The last step in the basic translation process is the generation of object code. Most compilers
create machine-language programs directly instead of producing a symbolic program for later
translation by an assembler.
Although we have mentioned three steps in the compilation process – scanning, parsing, and
code generation – it is important to realize that a compiler does not necessarily make three
7
passes over the program being translated.
For some languages, it is quite possible to compile a program in a single pass.
5.1.1 Grammars
A grammar for a programming language is a formal description of the syntax, or form, of
programs and individual statements written in the language.
The grammar does not describe the semantics, or meaning, of the various statements; such
knowledge must be supplied in the code-generation routines.
Example: for the difference between syntax and semantics, consider the two statements (I := J
+ K) and (X := Y + I), where X and Y are REAL variables and I, J, K are INTEGER variables.
These two statements have identical syntax. However, the semantics of the two statements are
quite different. The first statement specifies that the variables in the expression are to be added
using integer arithmetic operations. The second statement specifies a floating-point
addition, with the integer operand I being converted to floating point before adding.
Obviously, these two statements would be compiled into very different sequences of machine
instructions. However, they would be described in the same way by the grammar.
The differences between the statements would be recognized during code generation.
A number of different notations can be used for writing grammars. The one we describe is
called BNF (for Backus-Naur Form). Fig 5.2 gives one possible BNF grammar for a highly
restricted subset of Pascal.
A BNF grammar consists of a set of rules, each of which defines the syntax of some construct
in the programming language.
For example, Rule 13 in Fig 5.2: <read> ::= READ ( <id-list> ). This is a definition
of the syntax of a Pascal READ statement that is denoted in the grammar as
<read>.
The symbol ::= can be read “is defined to be”. On the left of this symbol is the language
construct being defined,
<read>, and on the right is a description of the syntax being defined for it.
Character strings enclosed between the angle brackets < and > are called nonterminal symbols
(such as ‘<read>’ and ‘<id-list>’). These are the names of constructs defined in the grammar.
7
Entries not enclosed in angle brackets are terminal symbols of the grammar (i.e., tokens, such
as ‘READ’, ‘(‘, and ‘)’).
The blank spaces in the grammar rules are not significant.
They have been included only to improve readability.
To recognize a <read> (to resolve all nonterminal symbols), we also need the definition of <id-
list>. This is provided by Rule 6 in Fig 5.2. <id-list> ::= id |
<id-list>, id
This rule offers two possibilities, separated by the | symbol, for the syntax of an <id-list>.
The first alternative specifies that an <id-list> may consist simply of a token id (the notation id
denotes an identifier that is recognized by the scanner).
The second alternative is an <id-list>, followed by the token “,” (comma), followed by a token
id.
Example: ALPHA is an <id-list> that consists of a single id ALPHA; ALPHA , BETA is an
<id-list> that consists of another <id-list> ALPHA, followed by a comma, followed by an id
BETA, and so forth.
It is often convenient to display the analysis of a source statement in terms of a grammar as a
tree. This tree is usually called the parse tree, or syntax tree, for the statement. Fig 5.3(a)
shows the parse tree for the statement READ ( VALUE ).
7
Rule 9 of the grammar in Fig 5.2 provides a definition of the syntax of an assignment
statement:
<assign> ::= id := <exp>
That is, an <assign> consists of an id, followed by the token :=, followed by an expression
<exp>.
Rule 10 gives a definition of an <exp>:
<exp> ::= <term> | <exp> + <term> | <exp> - <term>
Continuously, Rule 11 defines a <term> to be any sequence of <factor>s connected by * and
DIV.
Again, Rule 12 specifies that a <factor> may consist of an identifier id or an integer int (which
is also recognized by the scanner) or an <exp> enclosed in parentheses.
Fig 5.3(b) shows the parse tree for statement 14 from Fig
5.1 in terms of the rules just described.
Note that the parse tree in Fig 5.3(b) implies that multiplication and division are done before
addition and subtraction (that is, multiplication and division have higher precedence than
addition and subtraction). The terms SUMSQ DIV 100 and MEAN * MEAN must be
calculated first since these intermediate results are the operands (left and right subtrees) for the
– operation.
The parse trees shown in Fig 5.3 represent the only possible ways to analyze these two
statements in terms of the grammar of Fig 5.2. If there is more than one possible parse tree for
a given statement, the grammar is said to be ambiguous.
Fig 5.4 shows the parse tree for the entire program in Fig 5.1.
7
5.1.2 Lexical Analysis
Lexical analysis involves scanning the program to be compiled and recognizing the tokens that
make up the source statements. Scanners are usually designed to recognize keywords,
operators, and identifiers, as well as integers, floating-point numbers, character strings, and
other similar items.
Items such as identifiers and integers are usually recognized directly as single tokens and might
be defined as a part of the grammar. For example,
<ident> ::= <letter> | <ident> <letter> | <ident> <digit>
<letter> ::= A | B | C | … | Z
<digit> ::= 0 | 1 | …| 9
The output of the scanner consists of a sequence of tokens. For efficiency of later use, each
token is usually represented by some fixed-length code, such as an integer, rather than as a
variable-length character string.
In such a token coding scheme for the grammar of Fig 5.2 (shown in Fig 5.5), the token
PROGRAM would be represented by the integer value 1, an identifier id would be represented
by the value 22, and so on.
When the token being scanned is a keyword or an operator, such a coding scheme gives
sufficient information. However, in the case of identifier, it is also necessary to specify the
particular identifier name that was scanned.
The same is true for integers, floating-point values, character-string constants, etc.
This can be accomplished by associating a token specifier with the type code for such tokens.
The specifier gives the identifier name, integer value, etc., that was found by the scanner.
Fig 5.6 shows the output from a scanner for the program in Fig 5.1, using the token coding
scheme in Fig 5.5.
For token type 22 (identifier), the token specifier is a pointer to a
symbol-table entry (denoted be ^SUM,
^SUMSQ, etc.).
For token type 23 (integer), the specifier is the value of the integer (denoted by #0, #100, etc.).
The scanner usually is responsible for reading the lines of the source program as needed, and
possibly for printing the source listing. Comments are ignored by the scanner, except for
printing on the output listing.
The process of lexical scanning is quite simple. However, many languages have special
characteristics that must be considered when programming a scanner.
For example, in FORTRAN, a number in columns 1-5 of a source statement should be
interpreted as a statement number, not as an integer.
Languages that do not have reserved words create even more difficulties for the scanner.
For example, in FORTRAN, any keyword may also be used as an identifier (See the case in
the lower part of page 237).
In such a case, the scanner might interact with the parser so that it could tell the proper
interpretation of each word, or it might simply place identifiers and keywords in the same
class, leaving the task of distinguishing between them to the parser.
Modeling Scanners as Finite Automata
The tokens of most programming languages can be recognized by a finite automaton. Finite
automata are often represented graphically, as illustrated in Fig 5.7(a).
States are represented by circles, and transitions by arrows from one state to another. Each
arrow is labeled with a character or a set of characters that cause the specified transition to
occur.
Consider, for example, the finite automaton shown in Fig 5.7(a) and the first input string in Fig
5.7(b).
The automaton starts in State 1 and examines the first character of the input string. The
Fig 5.8(a) recognizes identifiers and keywords that begin with a letter and may continue with
any sequence of letters and digits.
Some languages allow identifiers such as NEXT_LINE, which contains the underscore
character (_). Fig 5.8(b) shows a finite automaton that recognizes identifiers of this type.
The finite automaton in Fig 5.8(c) recognizes integers that consist of a string of digits, including
those that contain leading zeroes, such as 000025.
Fig 5.8(d) shows an automaton that does not allow
leading zeroes, except in the case of the integer 0.
Each of the finite automata we have seen so far was designed to recognize one particular
type of token. Fig
5.9 shows a finite automaton that can recognize all of the tokens listed in Fig 5.5.
In Fig 5.9, a special case occurs in State 3. Suppose that the scanner encounters an erroneous
token such as “VAR.”.
When the automaton stops in State 3, the scanner should perform a check to see whether the
string being recognized is “END.”.
If it is not, the scanner could back up to State 2 (recognizing the “VAR”). The period
would then be
rescanned as part of the following token the next time the scanner is called.
Finite automata provide an easy way to visualize the operation of a scanner. Fig 5.10(a) shows
a typical algorithm to recognize such a token.
Fig 5.10(b) shows the finite automaton from Fig 5.8(b) represented in a tabular form.
5.1.3 Syntactic Analysis
During syntactic analysis, the source statements written by the programmer are recognized as
language constructs described by the grammar being used.
We may think of this process as building the parse tree for the statements. Parsing techniques
are divided into two general classes – bottom-up and top-down – according to the way in which
the parse tree is constructed.
Top-down methods (ex. recursive-descent parsing) begin with the rule of the grammar that
specifies the goal of the analysis (i.e., the root of the tree), and attempt to construct the tree so
that the terminal nodes match the statements being analyzed.
Bottom-up methods (ex. operator-precedence parsing) begin with the terminal nodes of the
tree (the statements being analyzed), and attempt to combine these into successively higher-
level nodes until the root is reached.
A large number of different parsing techniques have been devised, most of which are
applicable only to grammars that satisfy certain condition.
Operator-Precedence Parsing
The bottom-up parsing technique we consider is called the operator precedence method. This
method is based on examining pairs of consecutive operators in the source program, and
making decisions about which operation should be performed first.
For example, the arithmetic expression “A + B * C – D”. According to usual rules of
arithmetic, * and / have higher precedence than + and –. If we examine the first two operators
+ and *, we find that + has lower precedence
than *. This is often written as “+ < *”.
16
Similarly, for the next part pair of operators * and –, we would find that * has higher
precedence than –. We may write this as “* > –”.
A+ B * C – D
< >
This implies that the subexpression B*C is to be computed before either of the other operations
in the expression is performed.
The first step in constructing an operator-precedence parser is to determine the precedence
relations between the operators of the grammar. In this context, operator is taken to mean any
terminal symbol (i.e., any token), so we also have precedence relations involving tokens such
as BEGIN, READ, id, etc.
The matrix in Fig 5.11 shows these precedence relations for the grammar in Fig 5.2.
The relation ≐ indicates that the two tokens involved have equal precedence and should be
recognized by the parser as part of the same language construct.
Note that the precedence relations do not follow the ordinary rules for comparisons.
For example, we have “; > END” but “END > ;”.
That is, when ; is followed by END, the ; has higher precedence.
20
But when END is followed by ;, the END has higher
precedence.
Also note that in many cases, there is no precedence relation between a pair of tokens. This
means that these two tokens cannot appear together in any legal statement. If such a
combination occurs during parsing, it should be recognized as a syntax error.
There are algorithmic methods for constructing a precedence matrix like Fig 5.11 from a
grammar [see, for example, Aho et al. (1998)]. For the operator-precedence parsing method to
be applied, it is necessary that all the precedence relations be unique.
Fig 5.12 shows the application of the operator-precedence parsing method to
the READ statement from line 9 of the program in Fig 5.1.
The statement is scanned from left to right, one token at a time. For each pair of operators, the
precedence relation between them is determined.
Part (ii) of Fig 5.12 shows the statement being analyzed
with id replaced by <N1>.
Part (ii) of Fig 5.12 also shows the precedence relations that hold in the new version of the
statement. An operator-precedence parser generally uses a stack to save tokens that have been
scanned but yet parsed, so it can reexamine them in this way.
Precedence relations hold only between terminal symbols, so <N1> is not involved in this
process, and a relationship is determined between ( and ).
Fig 5.13 shows a similar step-by-step parsing of the assignment statement from line 14 of the
program in Fig 5.1.
20
21
Note that the left-to-right scan is continued in each step only far enough to determine the next
portion of the statement to be recognized, which is the first portion delimited by < and >.
Once this portion has been determined, it is interpreted as a nonterminal according to some rule
of the grammar.
This process continues until the complete statement is recognized. Note that (see Fig 5.13)
each portion of the parse tree is constructed from the terminal nodes up toward the root, hence
the term bottom-up parsing.
Although we have illustrated operator-precedence
parsing only on single statements, the same techniques can be applied to an entire program.
Behind the operator precedence technique, a more general method known as shift-reduce
parsing was developed.
Shift-reduce parsers make use of a stack to store tokens that have not yet been recognized in
terms of the grammar.
The actions of the parser are controlled by entries in a table, somewhat similar to the
precedence matrix discussed before.
The two main actions are shift (push the current token onto the stack) and reduce (recognize
26
symbols on top of the stack according to a rule of the grammar).
Fig 5.14 illustrates this shift-reduce process, using the same READ statement considered in Fig
5.12. The token currently being examined by the parser is indicated by
↑.
In Fig 5.14(a), the parser shifts (pushing the currently token onto the stack) when it encounters
26
the token BEGIN.
In Fig 5.14 (b-d), similar to the action in Fig 5.14(a).
In Fig 5.14(e), when parser examines the token ), the reduce action is invoked. A set of tokens
from the top of the stack (in this case, the single token id) is reduced to a nonterminal symbol
from the grammar (in this case,
<id-list>).
In Fig 5.14(f), the token ) is considered again. This time, it will be pushed onto the stack, to be
reduced later as part of the READ statement.
For this simple type of grammar, shift roughly corresponds to the action taken by
an operator-precedence parser when it encounters the
relations < and ≐. Reduce roughly corresponds to the action taken when an operator-
precedence parser encounters the relation >.
Recursive-Descent Parsing
The other parsing technique is a top-down method known as recursive descent. A recursive
descent parser is made up of a procedure for each nonterminal symbol.
As an example for illustrating the parsing process of a recursive descent parser, consider Rule
13 of the grammar in Fig 5.2.
The procedure for <read> in a recursive-decent parser first examines the next two input tokens,
looking for READ and (.
If these are found, the procedure for <read> then calls the procedure for <id-list>.
If that procedure (for <id-list>) succeeds, the <read> procedure examines the next input
token, looking for ).
If all these tests are successful, the <read> procedure
returns an indication of success to its caller and advances to the next token following ).
Otherwise, the <read> procedure returns an indication of
failure.
When there are several alternatives defined by the grammar for a nonterminal, the procedure is
only slightly more complicated. For the recursive-descent technique, it must be possible to
decide which alternative to use by examining the next input token.
For example, the procedure for <stmt> looks at the next token to decide which of its four
alternatives to try.
If the token is READ, it calls the procedure for <read>;
if the token is id, it calls the procedure for <assign> because this is the only alternative that can
begin with the token id, and so on.
There is a problem. For example, the procedure for
<id-list>, corresponding to Rule 6, would be unable to decide between its two alternatives
since id and <id-list> can begin with id.
If the procedure decided to try the 2nd alternative (<id-list>,
id), it would immediately call itself recursively to find an
<id-list>. This could result in another immediate recursive call, which leads to an unending
chain.
The reason for this is that one of the alternatives for
<id-list> begins with <id-list>.
Therefore, top-down parsers cannot be directly used with a grammar that contains this kind of
immediate left recursion.
Fig 5.15 shows the grammar from Fig 5.2 with left recursion eliminated.
26
Top-down parsing using new grammar: Consider Rule 6a in Fig 5.15.
This notation specifies that the terms between {and} may be omitted, or repeated one or more
times.
Thus, Rule 6a defines <id-list> as being composed of an
id followed by zero or more occurrences of “, id”. This is clearly equivalent to Rule 6
of Fig 5.2.
Fig 5.16 illustrates a recursive-descent parse of the READ statement on line 9 of Fig 5.1, using
the grammar in Fig 5.15.
27
29
Fig 5.16(a) shows the procedures for the nonterminals
<read> and <id-list>.
Assume that the variable TOKEN contains the type of the next input token, using the coding
scheme shown in Fig 5.5.
Fig 5.16(b) (corresponding to the algorithms in Fig 5.16(a)) gives a graphic representation of
the recursive-descent parsing process for the statement being analyzed.
In part (i), the READ procedure has been invoked and has examined the tokens READ and (
from the input stream (indicated by the dashed lines).
In part (ii), READ has called IDLIST (indicated by the solid line), which has examined the
token id.
In part (iii), IDLIST has returned to READ, indicating success; READ has then examined the
input token ).
This completes the analysis of the source statement. The procedure READ will now return to
its caller, indicating that a <read> was successfully found.
Fig 5.17 illustrates a recursive-descent parse of the
assignment statement on line 14 of Fig 5.1.
29
30
81
Fig 5.17(a) shows the procedures (ASSIGN, EXP, TERM, FACTOR) for the
nonterminal symbols that are involved in parsing this statement. You should
carefully compare these procedures to the corresponding rules of the grammar.
Fig 5.17(b) is a step-by-step representation of the procedure calls and token
examinations similar to that shown in Fig 5.16(b).
Note that the same technique can be applied to an entire program.
82
Unit V:Other System Software: Text Editors- Interactive Debugging Systems.
4.0 Introduction
An Interactive text editor has become an important part of almost any computing
environment. Text editor acts as a primary interface to the computer for all type of
“knowledge workers” as they compose, organize, study, and manipulate computer-based
information.
An Interactive text editor has become an important part of almost any computing
environment. Text editor acts as a primary interface to the computer for all type of
“knowledge workers” as they compose, organize, study, and manipulate computer-based
information.
A text editor allows you to edit a text file (create, modify etc…). For example the
Interactive text editors on Windows OS - Notepad, WordPad, Microsoft Word, and text
editors on UNIX OS - vi, emacs, jed, pico.
Normally, the common editing features associated with text editors are, Moving
the cursor, Deleting, Replacing, Pasting, Searching, Searching and replacing, Saving and
loading, and, Miscellaneous(e.g. quitting).
An interactive editor is a computer program that allows a user to create and revise
a target document. Document includes objects such as computer diagrams, text, equations
tables, diagrams, line art, and photographs. Here we restrict to text editors, where
character strings are the primary elements of the target text.
83
Document-editing process in an interactive user-computer dialogue has four tasks
The above task involves traveling, filtering and formatting. Editing phase involves
– insert, delete, replace, move, copy, cut, paste, etc…
There are two types of editors. Manuscript-oriented editor and program oriented
editors. Manuscript-oriented editor is associated with characters, words, lines, sentences
and paragraphs. Program-oriented editors are associated with identifiers, keywords,
statements. User wish – what he wants – formatted.
The user interface is concerned with, the input devices, the output devices and,
the interaction language. The input devices are used to enter elements of text being edited,
to enter commands. The output devices, lets the user view the elements being edited and
the results of the editing operations and, the interaction language provides
communication with the editor.
Input Devices are divided into three categories, text devices, button devices and,
locator devices. Text Devices are keyboard. Button Devices are special function keys,
symbols on the screen. Locator Devices are mouse, data tablet. There are voice input
devices which translates spoken words to their textual equivalents.
The interaction language could be, typing oriented or text command oriented and
menu-oriented user interface. Typing oriented or text command oriented interaction was
with oldest editors, in the form of use of commands, use of function keys, control keys
etc.,
84
Menu-oriented user interface has menu with a multiple choice set of text strings
or icons. Display area for text is limited. Menus can be turned on or off.
Editing
buffer
component Editing
filter
Traveling Main
component memory
input Command
language Viewing
processor Viewing Viewing filter
component buffer
Paging
Routines
Output
devices File
Display system
Control
component
Data
Typical Editor Structure
Editing operations are specified explicitly by the user and display operations are
specified implicitly by the editor. Traveling and viewing operations may be invoked
either explicitly by the user or implicitly by the editing operations.
85
When editing command is issued, editing component invokes the editing filter –
generates a new editing buffer – contains part of the document to be edited from current
editing pointer. Filtering and editing may be interleaved, with no explicit editor buffer
being created.
When display needs to be updated, viewing component invokes the viewing filter
– generates a new viewing buffer – contains part of the document to be viewed from
current viewing pointer. In case of line editors – viewing buffer may contain the current
line, Screen editors - viewing buffer contains a rectangular cutout of the quarter plane of
the text. Viewing buffer is then passed to the display component of the editor, which
produces a display by mapping the buffer to a rectangular subset of the screen – called a
window. The editing and viewing buffers may be identical or may be completely disjoint.
Identical – user edits the text directly on the screen. Disjoint – Find and Replace (For
example, there are 150 lines of text, user is in 100th line, decides to change all
occurrences of ‘text editor’ with ‘editor’). The editing and viewing buffers can also be
partially overlap, or one may be completely contained in the other. Windows typically
cover entire screen or a rectangular portion of it. May show different portions of the same
file or portions of different file. Inter-file editing operations are possible.
The components of the editor deal with a user document on two levels: In main
memory and in the disk file system. Loading an entire document into main memory may
be infeasible – only part is loaded – demand paging is used – uses editor paging routines.
Documents may not be stored sequentially as a string of characters. Uses separate editor
data structure that allows addition, deletion, and modification with a minimum of I/O and
character movement.
In time sharing environment, editor must function swiftly within the context of
the load on the computer’s processor, memory and I/O devices. In stand-alone
environment, editors on stand-alone system are built with all the functions to carry out
editing and viewing operations – The help of the OS may also be taken to carry out some
tasks like demand paging. In distributed environment, editor has both functions of stand-
alone editor, to run independently on each user’s machine and like a time sharing editor,
contend for shared resources such as files.
86
4.2 Interactive Debugging Systems
Here we discuss
One important requirement of any IDS is the observation and control of the flow
of program execution. Setting break points – execution is suspended, use debugging
commands to analyze the progress of the program, résumé execution of the program.
Setting some conditional expressions, evaluated during the debugging session, program
execution is suspended, when conditions are met, analysis is made, later execution is
resumed.
A Debugging system should also provide functions such as tracing and traceback.
Tracing can be used to track the flow of execution logic and data modifications. The
control flow can be traced at different levels of detail – procedure, branch, individual
instruction, and so on… Traceback can show the path by which the current statement in
the program was reached. It can also show which statements have modified a given
variable or parameter. The statements are displayed rather than as hexadecimal
displacements
To provide these functions, a debugger should consider the language in which the
program being debugged is written. A single debugger – many programming languages –
language independent. The debugger - a specific programming language – language
dependent. The debugger must be sensitive to the specific language being debugged.
The context being used has many different effects on the debugging interaction.
The statements are different depending on the language
87
Cobol - MOVE 6.5 TO X
Fortran - X = 6.5
C - X = 6.5
The notation used to specify certain debugging functions varies according to the
language of the program being debugged. Sometimes the language translator itself has
debugger interface modules that can respond to the request for debugging by the user.
The source code may be displayed by the debugger in the standard form or as specified
by the user or translator.
It is also important that a debugging system be able to deal with optimized code.
Many optimizations like
88
4.2.4 User-Interface Criteria
Debugging systems should be simple in its organization and familiar in its language, closely reflect
common user tasks. The simple organization contribute greatly to ease of training and ease of use. The
user interaction should make use of full-screen displays and windowing-systems as much as possible. With
menus and full-screen editors, the user has far less information to enter and remember. There should be
complete functional equivalence between commands and menus – user where unable to use full- screen
IDSs may use commands. The command language should have a clear, logical and simple syntax;
command formats should be as flexible as possible. Any good IDSs should have an on-line HELP facility.
HELP should be accessible from any state of the debugging session.
89