Intro To C - Module 6
Intro To C - Module 6
Buffer Overruns
We've now seen a number of things one can do with C—how to create interesting compound
data types, read and write files, and control memory allocation and release. We've also seen a
number of things one should never do—divide by zero, dereference null pointers, and access
arrays out of bounds.
Let us, in the context of out-of-bounds access, discuss things that can be done to a C program.
#include <inttypes.h>
#include <stdbool.h>
#include <stdio.h>
int main() {
session s = {0};
printf("What is your name? ");
gets(s.name);
printf("What is your password? ");
gets(s.password);
if (hash(8, s.password) == 0x16c720e3) {
s.ok = 1;
}
if (s.ok) {
printf("You get ALL the MacGuffins.\n", s.name);
} else {
printf("Sorry %s, but your MacGuffins are in another castle.\n",
s.name);
}
}
The hash function above is actually not hard to crack—but let's pretend otherwise, for this
exercise; more complex hash functions do exist that are believed to be computationally
infeasible to invert, even if one knows the entire mechanism. The intended operation is that only
a user who knows the password—"getcider"—can access the MacGuffins. Let's see how it
works.
$ ./buffer_overrun
What is your name? Mike
What is your password? dunno
Sorry Mike, but your MacGuffins are in another castle.
$ ./buffer_overrun
What is your name? Mike
What is your password? getcider
You get ALL the MacGuffins.
So far, so good. Notice that the name field seems irrelevant. Is it? Let's find out. We'll imagine
that our adversary knows the structure of the program above, but neither the password nor how
to invert the hash function.
$ ./buffer_overrun
What is your name? Badguy12345678901
What is your password? dunno
You get ALL the MacGuffins.
How did he exfiltrate the MacGuffins? Well, he gave a 17-character input for s.username, which
happened to sit next to s.ok on the stack. The effect of this was to write '1' (value 49) into it,
circumventing security by "poking" memory he shouldn't have been able to access.
Much more sophisticated—and devastating—attacks, although beyond the scope of this course,
exist on this principle. The culprit here is the insecure method gets, which allows a user to give
a "string" of arbitrary length. Although this hasn't been discussed at length, stack frames contain
(in addition to local variables) return addresses used to transfer control, at the end of each
function, to the caller. Through trial-and-error, a hacker can discover a return address's location,
overwrite it, and thus transfer control elsewhere—say, to malicious code that can also be
injected via user input.
As a systems programmer, your job is to defend against things like this. User input should
never be trusted, and this is one reason why gets is deprecated, scanf should be treated with
care and suspicion as well. Always limit the ability of untrusted input to crash or subvert your
program, Replacing the gets calls with appropriate fgets calls—e.g. fgets(s.name, 16,
stdin) instead of gets(s.name)—fixes this particular problem, because it allows you to set a
limit on how many input characters are consumed—something you almost always want to do
when processing user input.
...
char user_str[256] = {0};
fread(user_str, 255, stdin) // get user input
printf("%s", user_str);
...
...
char user_str[256] = {0};
fread(user_str, 255, stdin) // get user input
printf(user_str); // BAD!!
...
If the user provides a "normal string" without format directives, it will work as expected, but a
malevolent user might include format directives. This can cause the printf call above to violate
its own (unenforced) type discipline, executing something like:
printf("Hello %x %x %x %x %x %x %x %x")
In higher-level languages, this would be a type error, but you've probably learned—through
frustrating experience—that printf is weakly typed, and will happily accept garbage. In the
case above, the program will print something like:
C puts no function arguments on the stack, but printf consumes 8, so what happens? This call
causes printf to read upstack to where 8 values, if they had been provided, would
be—exposing unrelated stack contents, potentially giving a hacker information that she could
use for further exploits. Worse, there is an obscure %n directive that can be used to write to
memory.
So, don't ever let Bobby Tables get his name into your format strings.
A Warning About Pointer Syntax
int* a = NULL;
int *a = NULL;
These days, we tend to prefer the former, because it correlates to our understanding of what we
are doing: declaring a variable of type int* and setting it to NULL. However, it has a
disadvantage, which is that it can be misleading when multiple declarations exist on one line.
int* a, b;
would declare both a and b as int*. It doesn't. The compiler treats this line as equivalent to:
int* a;
int b;
If you intend for both a and b to be pointers, you need to initialize like so:
You can think of this line as saying "*a is an int and *b is an int." which is equivalent to what
we want.
In general, of course, you shouldn't declare more than one variable on a line—C is verbose and
vertical and we're all mostly comfortable with it being that way—and, in any case, you should
always initialize variables.
#define Diamond 0
#define Heart 1
#define Spade 2
#define Club 3
They both work. They're both legal. They're subtly different, though. The first one uses the
preprocessor—we'll discuss it later—to replace, for example, the string "Club" with 3 wherever
it appears—this is technically a macro—while the second allocates four ints in the executable,
and creates global variables that refer to them. Historically, the first was faster because the
second required a load from static memory. However, a modern optimizing compiler knows what
to do—Club is const, and so the value can be inlined, skipping the load—and this is no longer
an issue.
The preferred modern way to do this sort of thing is, however, to use an enum.
C's enum types allow you to manage small sets of possible labels; underneath, they are
integers, but you'll get more useful debugging information.
For example, if you were coding a card game, you'd likely use:
This gives you a suit type and the constants Diamond, Heart, Spade, and Club as desired.
Function Pointers
As functions are realized as blocks of assembly code, each one has an address in the
executable's code segment. Function pointers, available in C, are pointers like any other.
As you've seen, function types are "backward" compared to our modern expectations. While
we'd write f: int -> float in a language like Ocaml or Haskell, C would have us declare:
If you want a name for a function type, you can use a construction like so:
double f(int x) {
return x * 3;
}
but also:
double g(int y) {
return y * y;
}
because it doesn't matter what the parameters are called. In fact, these typedefs can be written
without naming parameters, so long as types are given:
typedef int(*i2i)(int);
Functions with a given type signature are values in the corresponding type and can be treated
as such. For example, one can write:
i2d f = square;
and later:
double a = f(17);
You can use function pointers to implement callbacks, as in Javascript, or to write, for example,
the functional programming combinators of map, filter, reduce, et cetera—note that, while
Python is not usually considered an FP language, its list comprehensions are a syntactic variant
on the concept. Below is a C version of map that works in place:
typedef int(*i2i)(int);
int main() {
int a[7] = {1, 2, 3, 4, 5, 6, 7};
i2i b[3] = {double_it, square, negate};
for (int i = 0; i < 3; i++)
map_in_place(7, a, b[i]);
}
It applies b[0], then b[1], then b[2]—all functions—to each element of the array a.
Above, we have rigid functionality; the map_in_place we've defined only works on int arrays,
so if we wanted to work on double arrays, we'd have to make a d2d function type and write a
method accordingly. We'd like to be able to write generic functions. C++ solves this problem with
templates, but in C, if we want genericity, we need to achieve it with void pointers.
We examine qsort, an in-place generic sorting function available in stdlib.h. Its type
signature is:
The to-be-sorted array base can have elements of any type; so use a void* that points to the
head of it, without committing to a specific type. It's just a pointer. Therefore, in order to find the
items it must sort, qsort must know how big they are in addition to how many they are.
Furthermore, qsort doesn't know how to compare objects of arbitrary types—the programmer
might want to sort structs on a specific field, for example, or sort strings in a case-insensitive
way. It is, therefore, the user's responsibility to write the comparison function so that
compar(p1, p2)returns:
int idx = 0;
while (s1[idx]) {
int d = s1[idx] - s2[idx];
if (d != 0)
return d;
else
idx += 1;
}
// if here, s1[idx] = 0;
return -s2[idx];
}
The comparison function must operate on pointers to the things we want to sort, since it
operates on places in memory (i.e., on objects of type unknown to qsort) rather the things
themselves. Above, we cast the const void* to char**, then get the char* (strings) they point
to, then compute a lexicographic comparison.
the qsort call will sort the char* array words so as to put the strings in alphabetical order; i.e.
words[0] will be "" and words[1] will be "apple".
If we want regular ascending integer sorting, we can achieve it with this function:
6.1 Does replacing gets with fgets as above fix all security issues in main on page 1-2? Why
or why not?
return x1 - x2;
}
It seems like this might be better, insofar as it uses one operation and no branching, as opposed
to two comparisons. Why isn't it?
6.3. Pick any high-level language you've used for any project of appreciable size. Tell me
something you think the runtime uses function pointers for, and why. What do you believe is
gained, and what is lost, due to this decision? You don't have to be right; just give me your best
guess.
6.4. Predict what this program will do. (You will not be graded on the correctness of your
prediction.) Then run it. Was your prediction correct? Why or why not?
#include <inttypes.h>
#include <stdio.h>
int main () {
uint8_t a = 37;
uint8_t b = 38;
int c = (int) (uint8_t) (a - b);
int d = (int) (a - b);
if (c == d) {
printf("equal\n");
} else {
printf("not equal\n");
}
return 0;
}
This first phase of your interpreter is due October 17. For Phase 1, you will implement a basic
calculator with Lisp syntax—it will implement a small subset of PSI, a language designed for this
course. An example session is below:
$ ./psi
psi> 1
1
psi> (+ 2 3)
5
psi> (* 4 (+ 5 6))
44
psi> (/ 137 9)
15
psi> (= 18 (* 3 6))
#t
psi> (+ 3 #t)
ERROR (+ : requires numeric args)
psi> (+ 2 (/ 1 0))
ERROR (division by zero)
psi> (quit)
$
This program you will write is called a REPL (read-eval-print loop) for reasons above.
You can implement this however you choose. However, we should discuss some common
ideas. The first thing you have to do is read input. You will parse a user-supplied string as what
will likely be an S-expression tree structure wherein each item is either an atom (e.g., 5, #t, +)
or a list of elements that are also atoms or lists. For example, (), (1 2), (+ 3 #t), and (+ 4
(* 5 (+ 6 (* 7 8)))) are all lists.
It is up to you how you want to implement lists—you can use arrays, linked lists, or tree
structures. You are not expected to handle cyclic or infinite lists; they will not be used for this
project.
The evaluator—for now, as we will make adjustments in later phases—works like so:
We'll tackle evaluation of symbols in Phase 2; for now, treat them as "self-evaluating" like atoms
and (), except when they occur in function (0th) position and are used to look up appropriate
functions.
You may assume that each line of user input will never exceed 4096 characters.
Your REPL will, by necessity, allocate memory for intermediate computations. You are
responsible for free-ing everything you allocate. You should use a dynamic analysis tool like
valgrind or Clang's AddressSanitizer to be sure you are not leaking memory.
Start early.
Module 6 Writeup
Please submit your answers to questions 6.1–6.4 in PDF form, answers on one page, by
October 3. There is nothing due with regard to the code project—due October 17—but you
should get started on it now.