Intro To C - Module 9
Intro To C - Module 9
The Preprocessor
Statements with # at the beginning are preprocessor directives that are applied before
compilation. An #include directive, as we've already discussed in Module 8, directly includes
the contents of the header (.h) files into your own as if they had been written there by you.
We've also seen how #define can be used to create single-symbol macros: if expressions are
used, you have to be careful. You might think these are identical:
#define TEN 8 + 2
#define TEN (8 + 2)
but, because it is the text, not the value, that gets spliced in during later use, the former will
result in the expression:
expanding to:
For this reason, it's often best to put parentheses around any complex expression when using it
as a macro definition. Note that there is no guarantee that macroexpansion will generate usable
code—it's very easy to write a macro that "looks right" but generates garbage that won't
compile.
Clang and GCC both support the -E flag, which can be used to display preprocessed code to
the console.
Before the bool type was added, it was common for programmers to include:
#define TRUE 1
#define FALSE 0
... but these days, you should be using the bool type from stdbool.h instead.
There is, in fact, a familiar macro you use all the time—NULL is defined, on almost all modern
systems, like so:
This is done to avoid the overhead of a function call; it also leverages the overloadedness of >
for polymorphism. Personally, I don't like it; if a and b are simple expressions, it's fine, but if they
are results of function calls—say, we're taking MAX(f(5), g(7))—and these functions have
side effects, you will, if you use this macro, generate code that invokes side effects twice.
I prefer to avoid macros for the most part, instead being explicit—write the code out, so it's clear
what is going on. Code is allowed to be "boring" if people can understand it. C doesn't have
hygienic macros like a modern Lisp, and it's a tricky enough language without creating whole
hosts of implicit behaviors. I am including this topic not to condone the use of complex
macros—you almost never should—but because other programmers do so, and you should
know that the capability exists.
The # operator inside a macro converts an expression into a string literal, and the ## operator
concatenates tokens—these are very rarely useful, and will not be covered in depth.
You will sometimes see people use a do { ... } while (0); idiom to create a block of
statements. This is necessary for the same reason that parentheses are necessary around
expressions—we want to keep the macroexpansion results together.
Let me give an incorrect example as motivation for this pattern. Consider the following macro
and program:
#include <stdio.h>
#define PUTCHAR_TWICE(n) \
putchar(n); \
putchar(n)
int main () {
PUTCHAR_TWICE('x');
return 0;
}
If you compile and run this, it does what you'd expect—prints xx to the console. But, consider
this program instead:
#include <stdio.h>
#define PUTCHAR_TWICE(n) \
putchar(n); \
putchar(n)
int main () {
if (0)
PUTCHAR_TWICE('x');
return 0;
}
It shouldn't do anything, right? And yet, it prints a single x to the console. Why? The compiler
option -E shows us how the macro expands:
int main () {
if (0)
putchar('x'); putchar('x');
return 0;
}
That is, the if only guards the first statement—we wanted PUTCHAR_TWICE to behave as if the
statements were conceptually coupled—all or nothing. The do { ... } while (0); idiom
allows us to correct this. Redefining the macro as:
#define PUTCHAR_TWICE(n) \
do { \
putchar(n); \
putchar(n); \
} while (0);
fixes this.
It's conventional to use ALL_CAPS for macro names, so they aren't confused with functions. As
you've seen, they're not functions—they're inline expansions—occasionally useful, but a mess
to debug.
There are some predefined, standard macros you can use without worry—these are often
helpful in debugging. They are: __LINE__, __FILE__, __TIME__, and __DATE__. An example
using all of them is:
int main () {
size_t n = ((long long) 1) << 47;;
int* a = malloc(n);
if (!a) {
printf("Allocation failure at %s:%d (%s %s)...\n", __FILE__,
__LINE__ - 2, __DATE__, __TIME__);
printf("Exiting immediately...\n");
exit(2);
}
return 0;
}
$ ./macro
Catching allocation failure at macro.c:62 (Sep 15 2024 10:05:47)...
Exiting immediately...
The program ends up in an error condition because of the failed allocation (my system doesn't
have 128 TB of RAM to spare) and so we get a helpful error message with a file, line number,
and time of execution, before the program exits.
Conditional inclusion and compilation are achieved using #if, #ifdef ("if defined"), #ifndef
("if not defined"), #else, #elif, and #endif. Thist can also be used to achieve portability—for
example, code to be executed only on certain systems can be #ifdef-guarded:
#ifdef WIN32
// do WIN32 specific stuff...
#endif
Or #if-guarded:
You hopefully won't have to do much of this stuff—it can make for very confusing software, but
it's sometimes necessary when making code portable.
It is rare that a real-world program exists only in one file. We will explore the process of creating
a multi-file project. Although a variety of build systems exist, we’ll focus on an old and proven
offering you’re sure to encounter as a C programmer: make.
There are several reasons to split your code up into multiple files. The first is to factor the
program into reusable and modular parts, which is good practice in general, especially in the
context of working on a shared codebase. A second is to leverage separate
compilation—compiling large codebases can take a long time, so most build systems only
recompile what has changed.
Here, we will go through an example that uses multiple files. We create a simple counter object
and place the code in counter.c:
// counter.c
#include <stdlib.h>
#include <string.h>
#include "counter.h"
Note the lack of a main—this is not an executable program, and will therefore never be “run.”
Instead, it is a library to be used by other programs. As such, we must assume some of our
clients will require high reliability, so we must handle partial allocation in counter_new—we
must detect that case and release the uncompleted counter, lest we create a memory leak in
user programs, whereas we might be more lax if we were writing an executable of limited scope.
We do not, in general, want implementation details to leak into, or to slow down, compilation of
client code that uses the module, so we also create a header file:
// counter.h
#ifndef COUNTER_H
#define COUNTER_H
#endif
This gives users of the library enough information to compile, but leaves the implementation
details in the .c file opaque. Because counters are used on the heap through pointers, users
don’t need to know anything but the interface—if the counter struct were given an additional
field—say, for instrumentation—we could change the .c file, but not the .h, and clients would
not need to change.
Although it is common (and probably more correct) to use angle-brackets when including
standard library files, e.g. #include <stdio.h>, you must use quotes when doing this for code
in your own projects, as seen below in a client executable that uses our counter library.
// counter_client.c
#include <stdio.h>
#include "counter.h"
int main() {
counter* c = counter_new("mycounter");
counter_inc(c);
counter_inc(c);
counter_inc(c);
printf("counter value is %d\n", counter_get(c));
counter_delete(c);
return 0;
}
The compiler only needs counter.h to compile counter_client.c separately; the compiled
version of counter.c—counter.o—is not necessary until linking time, when the final
executable is created. Before discussing make, we’ll go through this process manually. Note that
the gcc command may or may not invoke the GCC compiler; on Mac OS systems, it uses Clang
instead.
Step 1: Compile counter.c into an object file, counter.o . This does not create an executable
program—the functions are all compiled, but have not been linked yet.
Step 2: Compile counter_client.c into the counter_client.o object file, which will also not
be executable—the functions in that file are compiled, but depend on counter.c functions that,
until linking time, have not been specified.
Step 3: Link them. That is, create an executable in which all data and functions are given
unique (pointer) addresses, so the compiled counter_client.o functions correctly invoke
counter.o ones. The most manual way to do this is using ld, but this is system-dependent and
requires you to explicitly link in the C standard library. An easier way to do it is:
The object files are listed; the -o flag is used to specify the destination of the compiled
executable, which you can now run.
$ ./counter_client
counter value is 3
GCC and Clang are smart enough that you can do this on one line:
To use make, you'll want to declare your dependencies in a Makefile, like so:
// Makefile
clean:
rm *.o
rm main
Makefiles have semantic whitespace, like Python, so each of those indentations must be a tab,
not spaces. Each entry has the following format:
<target>: <dependency>*
<command>
<command>
<command>
The commands must each be on their own line; there can be one or many of them. Note that
clean has no dependencies, while counter.o and counter_client.o depend on source files,
not make targets. However, main depends on other targets, so if you make that target, the
dependencies will be built first, like so:
$ make main
gcc -c counter.c
gcc -c counter_client.c
gcc counter_client.o counter.o -o main
$ ./main
counter value is 3
The make utility only recompiles what has changed; therefore, if we touch some files and not
others, we'll only see re-execution of some commands.
$ touch counter.c
$ make main
gcc -c counter.c
gcc counter_client.o counter.o -o main
The console output tells us that make recompiled counter.o, but not counter_client.o. There
is no need to, because counter_client.c hasn’t changed; therefore the existing object code is
assumed to be current.
$ make main
make: `main' is up to date.
There is no magic here. Your Makefile must list dependencies for all targets, or it may
otherwise erroneously conclude that no recompilations are necessary, leading to build failures or
worse. It does not infer, for example, that the command gcc counter_client.o counter.o
-o main depends on the targets counter_client.o and counter.o . You have to tell it that.
There is a target called clean that has no dependencies but that clears out our *.o files and our
executable—this is offered by convention, because there are times when a user wants to run a
fresh build, and this gives developers the ability to start from scratch.
$ make clean
rm *.o
rm main
$ ls -tlr
total 40
-rw-r--r-- 1 michaelchurch staff 235 Sep 15 11:07 counter_client.c
-rw-r--r-- 1 michaelchurch staff 219 Sep 15 12:13 counter.h
-rw-r--r-- 1 michaelchurch staff 219 Sep 15 12:19 Makefile
-rw-r--r-- 1 michaelchurch staff 642 Sep 15 12:28 counter.c
As you can see, the object files and the executable have been deleted.
Often, users will create an all target that generates all the libraries and executables that a
project includes, but for this small one, we don’t need to do so.
This is only the beginning of what can be achieved with make—for more depth on the topic, go
to the Makefile Tutorial listed above.
There is no assignment due for this module. Instead, start early on Phase 2 of your PSI
interpreter, which is due November 7.
Module 9–11 Project
You will expand the type system and functionality of your PSI interpreter.
Remember that cross-type equality is always false (#f). Functions compare as identical only if
they are the same object/pointer—function equivalence is undecidable! Lists compare as equal
if they have the same length and compare as equal at each index. For example:
In addition, you’ll add the following list-processing functions: cons, head, tail, and atom.
The first of these, cons, has arity 2, and prepends its first argument to its second, which must be
a list. This gives us the ability to build up lists.
The head function takes arity one, and returns the first element of a list—it is an error if the
argument is anything else, or if it is an empty list.
The tail function also takes one argument, which must be a nonempty list, but returns
everything but the head.
Note that (cons (head l) (tail l)) = l for all non-empty lists l.
The atom function returns #t if given a non-list—that is, an atomic value—and #f if given any
list.
Last, but certainly not least, you will implement three special forms that aren’t functions: if,
quote, and def.
The first of these, if, enables conditional evaluation and execution—if is not a function, and
behaves differently from functions because the subforms may not be evaluated. A legal if-form
always has two or three subforms (arguments) and its semantics are as follows:
psi> (* 0 (/ 1 0))
ERROR (/ : division by zero)
psi> (if (= 3 4) (/ 1 0) (+ 5 7))
12
psi> (if (= 3 3) (/ 1 0) (+ 5 7))
ERROR (/ : division by zero)
The first example, because * is a function, must evaluate (/ 1 0), even though it is
mathematically irrelevant. The second example, because (= 3 4) ⟶ #f, ignores the (/ 1 0)
form entirely and safely evaluates (+ 5 7) ⟶ 12.
This functionality is something a user would not be able to implement, by defining functions, if
PSI did not grant it, because ordinary functions always evaluate all arguments. Thus, while this
if has an S-expression form and looks like a function, it is not such at all, because it chooses
which subform (or argument) it evaluates.
The def special form creates a variable in PSI. The first argument is always a symbol; the
second is an S-expression that will be evaluated—if no error occurs in the evaluation, the
resulting value is stored in that variable, and it is also returned.
It is legal to def a variable twice; a new binding replaces the old one. Warning: this makes the
old data structure(s) unreachable, and they must be deleted to avoid a memory leak.
If the second form triggers an error in evaluation, the def-form returns that error but the
definition (or update) does not happen—if the variable was unbound, it remains unbound; if it
held a value, it retains that value.
Please note that we have changed the evaluation semantics for symbols. In Phase 1, they were
self-evaluating; that is, e ⟶ e. In Phase 2, and going forward, evaluating a symbol looks up its
binding. For example, after the session below, we will have {b: 15, c: 9, d: 135, y: 17}, so d ⟶
135 in this environment. There is no binding for x, because that def was not successful, and so
x will trigger an error.
The global environment comes with some symbols predefined—namely, the builtin functions. In
a fresh REPL (nothing def’d yet) we have the following session:
$ ./psi-repl
psi> +
<builtin function +>
psi> (atom -)
#t
psi> (if * / %)
<builtin function />
psi> ((if (= 5 6) + *) 7 (+ 4 4))
56
We should step through this in the context of the read-eval-print cycle of the REPL. In the first
example:
Functions are first-class values, so we can use them as arguments to other functions. In the
second example, we call atom on the - function—not the symbol, but the function to which -
evaluates. Since all nonlists are atoms, it returns #t. Functions can also participate in special
forms like if—since everything but #f is “truthy”, the function object to which * evaluates is
truthy and so eval(Symbol("/")), the function for division, is returned.
The fourth form does something you haven’t seen before. It's not a special form, but the
subexpression in function position is not the name of a function. However, since it's not a special
form, ordinary semantics do apply, and we must first evaluate the subforms:
● (if (= 5 6) + *) ⟶ *
● 7 ⟶ 7
● (+ 4 4) ⟶ 8
The function application continues as usual—* is applied with arguments {7, 8}, resulting in a
return of 56. In other words, the first argument of a (non-special form) S-expression need not be
a symbol naming a function—as was required in Phase 1.
Finally, let's talk about quote, a third special form, which takes exactly one argument and "stops
evaluation" by returning the argument subform, unevaluated. That is, if the program
environment contains {a: 31},
● a ⟶ 31
○ That is, Symbol("a") ⟶ Number(31)
● (quote a) ⟶ a
○ That is, List([Symbol("quote"), Symbol("a")]) ⟶ Symbol("a")
● (quote (quote a)) ⟶ (quote a)
In the first example, we use def to create a binding; in the second, we look up the bound value.
The third example evaluates a quote-d form by removing one level of quoting, but doing nothing
more. The fourth example constructs, but does not evaluate, the list (+ 3 5)—that is:
The fifth example might seem strange. The evaluation has been explained, but the REPL
returns 'a as shorthand for (quote a); similarly, '(+ 3 5) is short for (quote (+ 3 5)).
It is up to you whether your REPL uses the shorthand when printing, but your reader must
support both forms.
If the pattern isn't clear, reread this or ask me. This can be a bit confusing, especially if you're
new to it, but you need to understand what you're implementing before you build it.
C is a compiled language—it's static, in the sense that most programs are built to do specific
things, and only complete the tasks they have been programmed to do. The source code, in
fact, disappears at runtime and only the (possibly, optimized beyond recognition) executable
code remains. Lisp has a completely opposite philosophy: total dynamism. Code can be
ingested, transformed, and executed at runtime, on the fly.
psi> (cons '* (cons (cons '+ (cons 5 (cons 6 ()))) (cons (cons '- (cons 7 (cons 4 ()))) ())))
(* (+ 5 6) (- 7 4))
We have laboriously created the form (* (+ 5 6) (- 7 4)). It is code—a specific program
that, when executed ("evaluated") returns 33. It is also data; it is a list with three elements, two
of which are also lists.
The good news is that quoting—and the single-quote shorthand—allows us to construct the
above in a nicer way:
You are not responsible for implementing this, but most Lisps contain an eval function that can
evaluate such code "lists" at runtime:
or:
Lisp users write code that generates code fairly often. One technique is to develop macros,
which (unlike C) are crucial in the understanding of Lisp code. For example, short-circuiting and
is often implemented in standard libraries with a definition like so:
(defmacro and (x y)
(list 'if x y #f))
Start early.