Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
16 views

Intro To C - Module 9

Uploaded by

Andrew Fu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Intro To C - Module 9

Uploaded by

Andrew Fu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Introduction to C: Module 9

Weekly Reading: Beej Guide, Chapters 19.1-8, Makefile Tutorial (Optional)

The Preprocessor

Statements with # at the beginning are preprocessor directives that are applied before
compilation. An #include directive, as we've already discussed in Module 8, directly includes
the contents of the header (.h) files into your own as if they had been written there by you.

We've also seen how #define can be used to create single-symbol macros: if expressions are
used, you have to be careful. You might think these are identical:

#define TEN 8 + 2
#define TEN (8 + 2)

but, because it is the text, not the value, that gets spliced in during later use, the former will
result in the expression:

int a = TEN * TEN;

expanding to:

int a = 8 + 2 * 8 + 2 // 26 -- probably not what you want

For this reason, it's often best to put parentheses around any complex expression when using it
as a macro definition. Note that there is no guarantee that macroexpansion will generate usable
code—it's very easy to write a macro that "looks right" but generates garbage that won't
compile.

Clang and GCC both support the -E flag, which can be used to display preprocessed code to
the console.

Before the bool type was added, it was common for programmers to include:

#define TRUE 1
#define FALSE 0

... but these days, you should be using the bool type from stdbool.h instead.

There is, in fact, a familiar macro you use all the time—NULL is defined, on almost all modern
systems, like so:

#define NULL ((void *) 0)


This provides flexibility. If targeting a system where 0 is a legal address, NULL can be defined in
some other way.

Macros can be parameterized. A common use case is this:

#define MAX(a, b) ((a) > (b) ? (a) : (b))

This is done to avoid the overhead of a function call; it also leverages the overloadedness of >
for polymorphism. Personally, I don't like it; if a and b are simple expressions, it's fine, but if they
are results of function calls—say, we're taking MAX(f(5), g(7))—and these functions have
side effects, you will, if you use this macro, generate code that invokes side effects twice.

I prefer to avoid macros for the most part, instead being explicit—write the code out, so it's clear
what is going on. Code is allowed to be "boring" if people can understand it. C doesn't have
hygienic macros like a modern Lisp, and it's a tricky enough language without creating whole
hosts of implicit behaviors. I am including this topic not to condone the use of complex
macros—you almost never should—but because other programmers do so, and you should
know that the capability exists.

The # operator inside a macro converts an expression into a string literal, and the ## operator
concatenates tokens—these are very rarely useful, and will not be covered in depth.

The #undef directive allows you to undefine a macro.

You will sometimes see people use a do { ... } while (0); idiom to create a block of
statements. This is necessary for the same reason that parentheses are necessary around
expressions—we want to keep the macroexpansion results together.

Let me give an incorrect example as motivation for this pattern. Consider the following macro
and program:

#include <stdio.h>

#define PUTCHAR_TWICE(n) \
putchar(n); \
putchar(n)

int main () {
PUTCHAR_TWICE('x');
return 0;
}
If you compile and run this, it does what you'd expect—prints xx to the console. But, consider
this program instead:

#include <stdio.h>

#define PUTCHAR_TWICE(n) \
putchar(n); \
putchar(n)

int main () {
if (0)
PUTCHAR_TWICE('x');
return 0;
}

It shouldn't do anything, right? And yet, it prints a single x to the console. Why? The compiler
option -E shows us how the macro expands:

int main () {
if (0)
putchar('x'); putchar('x');
return 0;
}

That is, the if only guards the first statement—we wanted PUTCHAR_TWICE to behave as if the
statements were conceptually coupled—all or nothing. The do { ... } while (0); idiom
allows us to correct this. Redefining the macro as:

#define PUTCHAR_TWICE(n) \
do { \
putchar(n); \
putchar(n); \
} while (0);

fixes this.

It's conventional to use ALL_CAPS for macro names, so they aren't confused with functions. As
you've seen, they're not functions—they're inline expansions—occasionally useful, but a mess
to debug.

There are some predefined, standard macros you can use without worry—these are often
helpful in debugging. They are: __LINE__, __FILE__, __TIME__, and __DATE__. An example
using all of them is:
int main () {
size_t n = ((long long) 1) << 47;;
int* a = malloc(n);
if (!a) {
printf("Allocation failure at %s:%d (%s %s)...\n", __FILE__,
__LINE__ - 2, __DATE__, __TIME__);
printf("Exiting immediately...\n");
exit(2);
}
return 0;
}

$ ./macro
Catching allocation failure at macro.c:62 (Sep 15 2024 10:05:47)...
Exiting immediately...

The program ends up in an error condition because of the failed allocation (my system doesn't
have 128 TB of RAM to spare) and so we get a helpful error message with a file, line number,
and time of execution, before the program exits.

Conditional inclusion and compilation are achieved using #if, #ifdef ("if defined"), #ifndef
("if not defined"), #else, #elif, and #endif. Thist can also be used to achieve portability—for
example, code to be executed only on certain systems can be #ifdef-guarded:

#ifdef WIN32
// do WIN32 specific stuff...
#endif

Or #if-guarded:

#if INT_MAX < (1 << 30)


// int is probably 16 bits, so do one thing...
#else
// int is probably 32 bits, so do some other thing...
#endif

You hopefully won't have to do much of this stuff—it can make for very confusing software, but
it's sometimes necessary when making code portable.

Programming in the Large

It is rare that a real-world program exists only in one file. We will explore the process of creating
a multi-file project. Although a variety of build systems exist, we’ll focus on an old and proven
offering you’re sure to encounter as a C programmer: make.
There are several reasons to split your code up into multiple files. The first is to factor the
program into reusable and modular parts, which is good practice in general, especially in the
context of working on a shared codebase. A second is to leverage separate
compilation—compiling large codebases can take a long time, so most build systems only
recompile what has changed.

Here, we will go through an example that uses multiple files. We create a simple counter object
and place the code in counter.c:

// counter.c
#include <stdlib.h>
#include <string.h>
#include "counter.h"

typedef struct counter {


char* name;
int contents;
} counter;

counter* counter_new(const char* name) {


counter* res = calloc(1, sizeof(counter));
if (res) {
res->name = malloc(strlen(name) + 1);
if (res->name) {
strcpy(res->name, name);
return res;
} else { // partial allocation
free(res);
return NULL;
}
} else {
return NULL;
}
}

void counter_inc(counter* ctr) {


ctr->contents += 1;
}

int counter_get(counter* ctr) {


return ctr->contents;
}
void counter_delete(counter* ctr) {
free(ctr->name);
free(ctr);
}

Note the lack of a main—this is not an executable program, and will therefore never be “run.”
Instead, it is a library to be used by other programs. As such, we must assume some of our
clients will require high reliability, so we must handle partial allocation in counter_new—we
must detect that case and release the uncompleted counter, lest we create a memory leak in
user programs, whereas we might be more lax if we were writing an executable of limited scope.

We do not, in general, want implementation details to leak into, or to slow down, compilation of
client code that uses the module, so we also create a header file:

// counter.h
#ifndef COUNTER_H
#define COUNTER_H

typedef struct counter counter;

counter* counter_new(const char* name);

void counter_inc(counter* ctr);

int counter_get(counter* ctr);

void counter_delete(counter* ctr);

#endif

This gives users of the library enough information to compile, but leaves the implementation
details in the .c file opaque. Because counters are used on the heap through pointers, users
don’t need to know anything but the interface—if the counter struct were given an additional
field—say, for instrumentation—we could change the .c file, but not the .h, and clients would
not need to change.

Although it is common (and probably more correct) to use angle-brackets when including
standard library files, e.g. #include <stdio.h>, you must use quotes when doing this for code
in your own projects, as seen below in a client executable that uses our counter library.

// counter_client.c
#include <stdio.h>
#include "counter.h"
int main() {
counter* c = counter_new("mycounter");
counter_inc(c);
counter_inc(c);
counter_inc(c);
printf("counter value is %d\n", counter_get(c));
counter_delete(c);
return 0;
}

The compiler only needs counter.h to compile counter_client.c separately; the compiled
version of counter.c—counter.o—is not necessary until linking time, when the final
executable is created. Before discussing make, we’ll go through this process manually. Note that
the gcc command may or may not invoke the GCC compiler; on Mac OS systems, it uses Clang
instead.

Step 1: Compile counter.c into an object file, counter.o . This does not create an executable
program—the functions are all compiled, but have not been linked yet.

$ gcc -c counter.c -o counter.o

Step 2: Compile counter_client.c into the counter_client.o object file, which will also not
be executable—the functions in that file are compiled, but depend on counter.c functions that,
until linking time, have not been specified.

$ gcc -c counter_client.c -o counter_client.o

Step 3: Link them. That is, create an executable in which all data and functions are given
unique (pointer) addresses, so the compiled counter_client.o functions correctly invoke
counter.o ones. The most manual way to do this is using ld, but this is system-dependent and
requires you to explicitly link in the C standard library. An easier way to do it is:

$ gcc counter.o counter_client.o -o counter_client

The object files are listed; the -o flag is used to specify the destination of the compiled
executable, which you can now run.

$ ./counter_client
counter value is 3

GCC and Clang are smart enough that you can do this on one line:

$ gcc counter.c counter_client.c -o counter_client


Still, for complicated programs with dozens of files and complicated dependency graphs, this
sort of work gets tedious, error-prone, and at risk of redundant compilation. Unix-based
operating systems provide a utility called make that handles dependency resolution.

To use make, you'll want to declare your dependencies in a Makefile, like so:

// Makefile
clean:
rm *.o
rm main

counter.o: counter.c counter.h


gcc -c counter.c

counter_client.o: counter_client.c counter.h


gcc -c counter_client.c

main: counter.o counter_client.o


gcc counter_client.o counter.o -o main

Makefiles have semantic whitespace, like Python, so each of those indentations must be a tab,
not spaces. Each entry has the following format:

<target>: <dependency>*
<command>
<command>
<command>

The commands must each be on their own line; there can be one or many of them. Note that
clean has no dependencies, while counter.o and counter_client.o depend on source files,
not make targets. However, main depends on other targets, so if you make that target, the
dependencies will be built first, like so:

$ make main
gcc -c counter.c
gcc -c counter_client.c
gcc counter_client.o counter.o -o main
$ ./main
counter value is 3

The make utility only recompiles what has changed; therefore, if we touch some files and not
others, we'll only see re-execution of some commands.
$ touch counter.c
$ make main
gcc -c counter.c
gcc counter_client.o counter.o -o main

The console output tells us that make recompiled counter.o, but not counter_client.o. There
is no need to, because counter_client.c hasn’t changed; therefore the existing object code is
assumed to be current.

If nothing needs to be done, make will tell us that, too.

$ make main
make: `main' is up to date.

There is no magic here. Your Makefile must list dependencies for all targets, or it may
otherwise erroneously conclude that no recompilations are necessary, leading to build failures or
worse. It does not infer, for example, that the command gcc counter_client.o counter.o
-o main depends on the targets counter_client.o and counter.o . You have to tell it that.

There is a target called clean that has no dependencies but that clears out our *.o files and our
executable—this is offered by convention, because there are times when a user wants to run a
fresh build, and this gives developers the ability to start from scratch.

$ make clean
rm *.o
rm main

$ ls -tlr
total 40
-rw-r--r-- 1 michaelchurch staff 235 Sep 15 11:07 counter_client.c
-rw-r--r-- 1 michaelchurch staff 219 Sep 15 12:13 counter.h
-rw-r--r-- 1 michaelchurch staff 219 Sep 15 12:19 Makefile
-rw-r--r-- 1 michaelchurch staff 642 Sep 15 12:28 counter.c

As you can see, the object files and the executable have been deleted.

Often, users will create an all target that generates all the libraries and executables that a
project includes, but for this small one, we don’t need to do so.

This is only the beginning of what can be achieved with make—for more depth on the topic, go
to the Makefile Tutorial listed above.

There is no assignment due for this module. Instead, start early on Phase 2 of your PSI
interpreter, which is due November 7.
Module 9–11 Project

You will expand the type system and functionality of your PSI interpreter.

● Integers (from Phase 1)—no change.


● Booleans (from Phase 1)—no change.
● Symbols (from Phase 1)—no change to the type—but evaluation semantics will change.
● Errors (from Phase 1)—a wrapper class around a string to give a helpful message to the
user, e.g., "division by zero".
● Lists—dynamically typed (e.g., possibly heterogeneous) collections that hold zero or
more elements. You should be able to handle any reasonable length, but you do not
need to worry about cyclic or infinite lists.
● Functions—the builtin functions are of function type. In Phase 3, you will address (and
create) user-written functions, but for now, you just need to handle an expanded set of
builtins.

You will expand your arithmetic functions as follows:


● + shall take all arities zero and above.
○ (+) returns 0.
○ (+ x) returns x.
○ (+ x y z) returns x + y + z, and so on.
● - shall take all arities one and above.
○ (- x) returns -x.
○ (- x y z) returns (x - y) - z, and so on.
● * shall take all arities zero and above.
○ (*) shall return 1.
○ (* x y z) returns x * y * z.
● / shall take all arities two and above.
○ (/ x y z) returns (x / y) / z.
● = shall take all arities zero and above.
○ (=) and (= x) are vacuously #t.
○ (= x y z) is #t if and only if x = y and y = z.

Remember that cross-type equality is always false (#f). Functions compare as identical only if
they are the same object/pointer—function equivalence is undecidable! Lists compare as equal
if they have the same length and compare as equal at each index. For example:

psi> (= '(#t () 1) '(#t () 1))


#t
psi> (= '(#t () 1) '(#t () 2))
#f
The quote syntax above will be explained below; it allows us to refer to lists without evaluating
them. Using (#t () 1) without a quote would (as in Phase 1) result in an error, because #t is a
non-function in function position.

You’ll also be adding a few new functions:


● != returns #t unless all arguments are equal.
○ (=! x y z) is equal to (not (= x y z)) for all x, y, z and all arities.
● <, <=, >, and >=, which take only numeric arguments, and all arities two and up.
○ (> x y z) returns #t if and only if x > y and y > z; the rest are similar.
● not, which takes one argument of any type and returns #f unless its argument is #f.
○ (not x) is therefore equivalent to (= x #f).

In addition, you’ll add the following list-processing functions: cons, head, tail, and atom.

The first of these, cons, has arity 2, and prepends its first argument to its second, which must be
a list. This gives us the ability to build up lists.

psi> (cons 1 ())


(1)
psi> (cons 2 (cons 3 (cons 4 ())))
(2 3 4)
psi> (cons 5 6)
ERROR (cons : second arg must be a list)
​psi> (cons 7 (cons () (cons #t ())))
(7 () #t)

The head function takes arity one, and returns the first element of a list—it is an error if the
argument is anything else, or if it is an empty list.

psi> (head (cons 5 (cons 6 ())))


5
psi> (head ())
ERROR (head : empty list)

The tail function also takes one argument, which must be a nonempty list, but returns
everything but the head.

psi> (tail (cons 8 (cons 9 (cons 10 ()))))


(9 10)

Note that (cons (head l) (tail l)) = l for all non-empty lists l.
The atom function returns #t if given a non-list—that is, an atomic value—and #f if given any
list.

psi> (atom (cons 1 ()))


#f
psi> (atom #f)
#t

Last, but certainly not least, you will implement three special forms that aren’t functions: if,
quote, and def.

The first of these, if, enables conditional evaluation and execution—if is not a function, and
behaves differently from functions because the subforms may not be evaluated. A legal if-form
always has two or three subforms (arguments) and its semantics are as follows:

● (if cond then) is equivalent to (if cond then #f).


● (if cond then else) evaluates cond, and...
○ if cond ⟶ #t, it evaluates and returns then.
○ if cond ⟶ #f, it evaluates and returns else.
○ in PSI, any non-#f value is considered truthy, e.g., (if 0 1 2) ⟶ 1.

Consider the session below:

psi> (* 0 (/ 1 0))
ERROR (/ : division by zero)
psi> (if (= 3 4) (/ 1 0) (+ 5 7))
12
psi> (if (= 3 3) (/ 1 0) (+ 5 7))
ERROR (/ : division by zero)

The first example, because * is a function, must evaluate (/ 1 0), even though it is
mathematically irrelevant. The second example, because (= 3 4) ⟶ #f, ignores the (/ 1 0)
form entirely and safely evaluates (+ 5 7) ⟶ 12.

This functionality is something a user would not be able to implement, by defining functions, if
PSI did not grant it, because ordinary functions always evaluate all arguments. Thus, while this
if has an S-expression form and looks like a function, it is not such at all, because it chooses
which subform (or argument) it evaluates.

The def special form creates a variable in PSI. The first argument is always a symbol; the
second is an S-expression that will be evaluated—if no error occurs in the evaluation, the
resulting value is stored in that variable, and it is also returned.
It is legal to def a variable twice; a new binding replaces the old one. Warning: this makes the
old data structure(s) unreachable, and they must be deleted to avoid a memory leak.

This is observed in the session below:

psi> (def b (+ 7 8))


15
psi> (def c 9)
9
psi> (def d (* b c))
135
psi> d
135
psi> (+ d 2)
137
psi> e
ERROR (undefined symbol)

If the second form triggers an error in evaluation, the def-form returns that error but the
definition (or update) does not happen—if the variable was unbound, it remains unbound; if it
held a value, it retains that value.

psi> (def x (/ 1 0))


ERROR (/ : division by zero)
psi> x
ERROR (undefined symbol)
psi> (def y 17)
17
psi> (def y (/ 1 0))
ERROR (/ : division by zero)
psi> y
17

Please note that we have changed the evaluation semantics for symbols. In Phase 1, they were
self-evaluating; that is, e ⟶ e. In Phase 2, and going forward, evaluating a symbol looks up its
binding. For example, after the session below, we will have {b: 15, c: 9, d: 135, y: 17}, so d ⟶
135 in this environment. There is no binding for x, because that def was not successful, and so
x will trigger an error.

Notably, Phase 1 had environment-independent semantics. In Phase 2, evaluation depends on


an environment, in which bindings are added with def.
How you implement the environment is up to you, but you are under no obligation to be efficient.
That is, you can use strcmp for string equality and a linear (or full) search in an ordinary list or
array—performance is not important for this exercise, and I would not advise using a more
sophisticated data structure (e.g., hash table, binary search tree) unless all other features have
been built and tested.

The global environment comes with some symbols predefined—namely, the builtin functions. In
a fresh REPL (nothing def’d yet) we have the following session:

$ ./psi-repl
psi> +
<builtin function +>
psi> (atom -)
#t
psi> (if * / %)
<builtin function />
psi> ((if (= 5 6) + *) 7 (+ 4 4))
56

We should step through this in the context of the read-eval-print cycle of the REPL. In the first
example:

● the read phase takes "+" (a string) and returns +, a symbol.


● when we eval +, we lookup the binding and find a function object, which is what eval
returns.
● we print it; this is type-dependent, but the builtin function is hard-coded to print as
"<builtin function />".

Functions are first-class values, so we can use them as arguments to other functions. In the
second example, we call atom on the - function—not the symbol, but the function to which -
evaluates. Since all nonlists are atoms, it returns #t. Functions can also participate in special
forms like if—since everything but #f is “truthy”, the function object to which * evaluates is
truthy and so eval(Symbol("/")), the function for division, is returned.

The fourth form does something you haven’t seen before. It's not a special form, but the
subexpression in function position is not the name of a function. However, since it's not a special
form, ordinary semantics do apply, and we must first evaluate the subforms:

● (if (= 5 6) + *) ⟶ *
● 7 ⟶ 7
● (+ 4 4) ⟶ 8
The function application continues as usual—* is applied with arguments {7, 8}, resulting in a
return of 56. In other words, the first argument of a (non-special form) S-expression need not be
a symbol naming a function—as was required in Phase 1.

Finally, let's talk about quote, a third special form, which takes exactly one argument and "stops
evaluation" by returning the argument subform, unevaluated. That is, if the program
environment contains {a: 31},

● a ⟶ 31
○ That is, Symbol("a") ⟶ Number(31)
● (quote a) ⟶ a
○ That is, List([Symbol("quote"), Symbol("a")]) ⟶ Symbol("a")
● (quote (quote a)) ⟶ (quote a)

We observe this in the session below:

psi> (def a (+ 16 15))


31
psi> a
31
psi> (quote a)
a
psi> (quote (+ 3 5))
(+ 3 5)
psi> (quote (quote a))
'a

In the first example, we use def to create a binding; in the second, we look up the bound value.
The third example evaluates a quote-d form by removing one level of quoting, but doing nothing
more. The fourth example constructs, but does not evaluate, the list (+ 3 5)—that is:

List([Symbol("+"), Number(3), Number(5)])

as it will likely be represented in your implementation.

The fifth example might seem strange. The evaluation has been explained, but the REPL
returns 'a as shorthand for (quote a); similarly, '(+ 3 5) is short for (quote (+ 3 5)).

It is up to you whether your REPL uses the shorthand when printing, but your reader must
support both forms.

Here are some example reads using this shorthand.


psi> 'a
a
psi> '(+ 6 7)
(+ 6 7)
psi> '(1 2 3)
(1 2 3)

The read-eval-print transit of each of these looks as follows:

read("'a") = List([Symbol("quote"), Symbol("a")])


eval(List([Symbol("quote"), Symbol("a")])) = Symbol("a")
print(Symbol("a")) = "a"

read("'(+ 6 7)") = List([Symbol("quote"), List([Symbol("+"), Number(6), Number(7)])])


eval(List([Symbol("quote"), List([Symbol("+"), Number(6), Number(7)])])
= List([Symbol("+"), Number(6), Number(7)])
print(List([Symbol("+"), Number(6), Number(7)])) = "(+ 6 7)"

read("'(1 2 3)") = List([Symbol("quote"), List([Number(1), Number(2), Number(3)])])


eval(List([Symbol("quote"), List([Number(1), Number(2), Number(3)])])
= List([Number(1), Number(2), Number(3)])
print(List([Number(1), Number(2), Number(3)])) = "(1 2 3)"

If the pattern isn't clear, reread this or ask me. This can be a bit confusing, especially if you're
new to it, but you need to understand what you're implementing before you build it.

Explanation of Why This Is Interesting (Optional)

This brings us to the general philosophy of Lisp—"Code is Data."

C is a compiled language—it's static, in the sense that most programs are built to do specific
things, and only complete the tasks they have been programmed to do. The source code, in
fact, disappears at runtime and only the (possibly, optimized beyond recognition) executable
code remains. Lisp has a completely opposite philosophy: total dynamism. Code can be
ingested, transformed, and executed at runtime, on the fly.

Consider the (admittedly ugly) PSI session below:

psi> (cons '* (cons (cons '+ (cons 5 (cons 6 ()))) (cons (cons '- (cons 7 (cons 4 ()))) ())))
(* (+ 5 6) (- 7 4))
We have laboriously created the form (* (+ 5 6) (- 7 4)). It is code—a specific program
that, when executed ("evaluated") returns 33. It is also data; it is a list with three elements, two
of which are also lists.

The good news is that quoting—and the single-quote shorthand—allows us to construct the
above in a nicer way:

psi> '(* (+ 5 6) (- 7 4))


(* (+ 5 6) (- 7 4))

You are not responsible for implementing this, but most Lisps contain an eval function that can
evaluate such code "lists" at runtime:

psi> (eval '(* (+ 5 6) (- 7 4)))


33

or:

psi> (def my-prog '(* (+ 5 6) (- 7 4)))


(* (+ 5 6) (- 7 4))
psi> (eval my-prog)
33

Lisp users write code that generates code fairly often. One technique is to develop macros,
which (unlike C) are crucial in the understanding of Lisp code. For example, short-circuiting and
is often implemented in standard libraries with a definition like so:

(defmacro and (x y)
(list 'if x y #f))

This will instruct the interpreter to expand, for example,


(and (= 3 4) (/ 1 0)) ⟶ (if (= 3 4) (/ 1 0) #f),
which safely short-circuits, returning #f.

In Phase 3, we'll explore how to make some of these capabilities useful.

Start early.

You might also like