Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
7 views

Intro To C - Module 6

Uploaded by

Andrew Fu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Intro To C - Module 6

Uploaded by

Andrew Fu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Introduction to C: Module 6

Weekly Reading: Beej Guide, Chapters 15-16, 20.5, 21-23

Buffer Overruns

We've now seen a number of things one can do with C—how to create interesting compound
data types, read and write files, and control memory allocation and release. We've also seen a
number of things one should never do—divide by zero, dereference null pointers, and access
arrays out of bounds.

Let us, in the context of out-of-bounds access, discuss things that can be done to a C program.

#include <inttypes.h>
#include <stdbool.h>
#include <stdio.h>

// Pretend this function is actually one-way.


uint32_t hash(int len, const char *str) {
uint32_t acc = 0;
for (int i = len - 1; i >= 0; i--) {
acc *= 179;
acc += str[i];
}
return acc;
}

typedef struct session {


char name[16];
char ok;
char password[9];
} session;

int main() {
session s = {0};
printf("What is your name? ");
gets(s.name);
printf("What is your password? ");
gets(s.password);
if (hash(8, s.password) == 0x16c720e3) {
s.ok = 1;
}
if (s.ok) {
printf("You get ALL the MacGuffins.\n", s.name);
} else {
printf("Sorry %s, but your MacGuffins are in another castle.\n",
s.name);
}
}

The hash function above is actually not hard to crack—but let's pretend otherwise, for this
exercise; more complex hash functions do exist that are believed to be computationally
infeasible to invert, even if one knows the entire mechanism. The intended operation is that only
a user who knows the password—"getcider"—can access the MacGuffins. Let's see how it
works.

$ ./buffer_overrun
What is your name? Mike
What is your password? dunno
Sorry Mike, but your MacGuffins are in another castle.
$ ./buffer_overrun
What is your name? Mike
What is your password? getcider
You get ALL the MacGuffins.

So far, so good. Notice that the name field seems irrelevant. Is it? Let's find out. We'll imagine
that our adversary knows the structure of the program above, but neither the password nor how
to invert the hash function.

$ ./buffer_overrun
What is your name? Badguy12345678901
What is your password? dunno
You get ALL the MacGuffins.

How did he exfiltrate the MacGuffins? Well, he gave a 17-character input for s.username, which
happened to sit next to s.ok on the stack. The effect of this was to write '1' (value 49) into it,
circumventing security by "poking" memory he shouldn't have been able to access.

Much more sophisticated—and devastating—attacks, although beyond the scope of this course,
exist on this principle. The culprit here is the insecure method gets, which allows a user to give
a "string" of arbitrary length. Although this hasn't been discussed at length, stack frames contain
(in addition to local variables) return addresses used to transfer control, at the end of each
function, to the caller. Through trial-and-error, a hacker can discover a return address's location,
overwrite it, and thus transfer control elsewhere—say, to malicious code that can also be
injected via user input.

As a systems programmer, your job is to defend against things like this. User input should
never be trusted, and this is one reason why gets is deprecated, scanf should be treated with
care and suspicion as well. Always limit the ability of untrusted input to crash or subvert your
program, Replacing the gets calls with appropriate fgets calls—e.g. fgets(s.name, 16,
stdin) instead of gets(s.name)—fixes this particular problem, because it allows you to set a
limit on how many input characters are consumed—something you almost always want to do
when processing user input.

Format String Hygiene

Similarly, this is OK:

...
char user_str[256] = {0};
fread(user_str, 255, stdin) // get user input
printf("%s", user_str);
...

but this is not:

...
char user_str[256] = {0};
fread(user_str, 255, stdin) // get user input
printf(user_str); // BAD!!
...

If the user provides a "normal string" without format directives, it will work as expected, but a
malevolent user might include format directives. This can cause the printf call above to violate
its own (unenforced) type discipline, executing something like:

printf("Hello %x %x %x %x %x %x %x %x")

In higher-level languages, this would be a type error, but you've probably learned—through
frustrating experience—that printf is weakly typed, and will happily accept garbage. In the
case above, the program will print something like:

Hello 120a8 0 127a56ac 4e972af8 0 10 10 bbb774e0

C puts no function arguments on the stack, but printf consumes 8, so what happens? This call
causes printf to read upstack to where 8 values, if they had been provided, would
be—exposing unrelated stack contents, potentially giving a hacker information that she could
use for further exploits. Worse, there is an obscure %n directive that can be used to write to
memory.

So, don't ever let Bobby Tables get his name into your format strings.
A Warning About Pointer Syntax

The following are both legal and identical:

int* a = NULL;

int *a = NULL;

These days, we tend to prefer the former, because it correlates to our understanding of what we
are doing: declaring a variable of type int* and setting it to NULL. However, it has a
disadvantage, which is that it can be misleading when multiple declarations exist on one line.

You might think:

int* a, b;

would declare both a and b as int*. It doesn't. The compiler treats this line as equivalent to:

int* a;
int b;

If you intend for both a and b to be pointers, you need to initialize like so:

int *a, *b;

You can think of this line as saying "*a is an int and *b is an int." which is equivalent to what
we want.

In general, of course, you shouldn't declare more than one variable on a line—C is verbose and
vertical and we're all mostly comfortable with it being that way—and, in any case, you should
always initialize variables.

Two Different Ways of Defining Label Constants...

There are two historically common ways of defining label constants.

One is to use the preprocessor:

#define Diamond 0
#define Heart 1
#define Spade 2
#define Club 3

The other is to use file-scope static ints:


const int Diamond = 0;
const int Heart = 1;
const int Spade = 2;
const int Club = 3;

They both work. They're both legal. They're subtly different, though. The first one uses the
preprocessor—we'll discuss it later—to replace, for example, the string "Club" with 3 wherever
it appears—this is technically a macro—while the second allocates four ints in the executable,
and creates global variables that refer to them. Historically, the first was faster because the
second required a load from static memory. However, a modern optimizing compiler knows what
to do—Club is const, and so the value can be inlined, skipping the load—and this is no longer
an issue.

The preferred modern way to do this sort of thing is, however, to use an enum.

... and a Third, Creating enum Types

C's enum types allow you to manage small sets of possible labels; underneath, they are
integers, but you'll get more useful debugging information.

For example, if you were coding a card game, you'd likely use:

typedef enum suit {


Diamond,
Heart,
Spade,
Club} suit;

This gives you a suit type and the constants Diamond, Heart, Spade, and Club as desired.

Function Pointers

As functions are realized as blocks of assembly code, each one has an address in the
executable's code segment. Function pointers, available in C, are pointers like any other.

As you've seen, function types are "backward" compared to our modern expectations. While
we'd write f: int -> float in a language like Ocaml or Haskell, C would have us declare:

float f(int x) {...}

If you want a name for a function type, you can use a construction like so:

typedef double(*i2d)(int x);


This creates a type called i2d for functions that take one int and return a double. This would
be compatible with:

double f(int x) {
return x * 3;
}

but also:

double g(int y) {
return y * y;
}

because it doesn't matter what the parameters are called. In fact, these typedefs can be written
without naming parameters, so long as types are given:

typedef int(*i2i)(int);

Functions with a given type signature are values in the corresponding type and can be treated
as such. For example, one can write:

i2d f = square;

and later:

double a = f(17);

You can use function pointers to implement callbacks, as in Javascript, or to write, for example,
the functional programming combinators of map, filter, reduce, et cetera—note that, while
Python is not usually considered an FP language, its list comprehensions are a syntactic variant
on the concept. Below is a C version of map that works in place:

typedef int(*i2i)(int);

int double_it(int x) { // is i2i and can be used as such


return x * 2;
}

int square(int y) { // ditto


return y * y;
}

int negate(int z) { // ditto


return -z;
}

void map_in_place(int len, int* data, i2i f) {


for (int i = 0; i < len; i++)
data[i] = f(data[i]);
}

The function parameter and application are in red.

Which is used below:

int main() {
int a[7] = {1, 2, 3, 4, 5, 6, 7};
i2i b[3] = {double_it, square, negate};
for (int i = 0; i < 3; i++)
map_in_place(7, a, b[i]);
}

It applies b[0], then b[1], then b[2]—all functions—to each element of the array a.

Above, we have rigid functionality; the map_in_place we've defined only works on int arrays,
so if we wanted to work on double arrays, we'd have to make a d2d function type and write a
method accordingly. We'd like to be able to write generic functions. C++ solves this problem with
templates, but in C, if we want genericity, we need to achieve it with void pointers.

We examine qsort, an in-place generic sorting function available in stdlib.h. Its type
signature is:

void qsort(void *base, size_t n_items, size_t size,


int (*compar)(const void *, const void *))

The to-be-sorted array base can have elements of any type; so use a void* that points to the
head of it, without committing to a specific type. It's just a pointer. Therefore, in order to find the
items it must sort, qsort must know how big they are in addition to how many they are.
Furthermore, qsort doesn't know how to compare objects of arbitrary types—the programmer
might want to sort structs on a specific field, for example, or sort strings in a case-insensitive
way. It is, therefore, the user's responsibility to write the comparison function so that
compar(p1, p2)returns:

● a negative result when *p1 is less than *p2,


● 0 when *p1 and *p2 compare as equal, and
● a positive result when *p1 is greater than *p2.
Here's an example that sorts C-strings (char *) by alphabetical order:

int alpha_compar(const void *p1, const void *p2) {


char* s1 = *((char **) p1);
char* s2 = *((char **) p2);

int idx = 0;
while (s1[idx]) {
int d = s1[idx] - s2[idx];
if (d != 0)
return d;
else
idx += 1;
}
// if here, s1[idx] = 0;
return -s2[idx];
}

The comparison function must operate on pointers to the things we want to sort, since it
operates on places in memory (i.e., on objects of type unknown to qsort) rather the things
themselves. Above, we cast the const void* to char**, then get the char* (strings) they point
to, then compute a lexicographic comparison.

If, later, we have:

char* words[10] = {"orange", "grapefruit", "apple", "mangosteen", "mango",


"grape", "", "watermelon", "banana", "durian"};
for (int i = 0; i < 10; i++)
printf("%s\n", words[i]);

qsort(words, 10, sizeof(char *), alpha_compar);

the qsort call will sort the char* array words so as to put the strings in alphabetical order; i.e.
words[0] will be "" and words[1] will be "apple".

If we want regular ascending integer sorting, we can achieve it with this function:

int int_compar(const void *p1, const void *p2) {


int x1 = *((int *) p1);
int x2 = *((int *) p2);

if (x1 < x2) return -1;


else if (x1 == x2) return 0;
else return 1;
}

Module 6 Questions (Answers due October 3)

6.1 Does replacing gets with fgets as above fix all security issues in main on page 1-2? Why
or why not?

6.2. An often-seen formulation of int_compar above is:

int int_compar(const void *p1, const void *p2) {


int x1 = *((int *) p1);
int x2 = *((int *) p2);

return x1 - x2;
}

It seems like this might be better, insofar as it uses one operation and no branching, as opposed
to two comparisons. Why isn't it?

6.3. Pick any high-level language you've used for any project of appreciable size. Tell me
something you think the runtime uses function pointers for, and why. What do you believe is
gained, and what is lost, due to this decision? You don't have to be right; just give me your best
guess.

6.4. Predict what this program will do. (You will not be graded on the correctness of your
prediction.) Then run it. Was your prediction correct? Why or why not?

#include <inttypes.h>
#include <stdio.h>

int main () {
uint8_t a = 37;
uint8_t b = 38;
int c = (int) (uint8_t) (a - b);
int d = (int) (a - b);

if (c == d) {
printf("equal\n");
} else {
printf("not equal\n");
}
return 0;
}

Module 6–8 Project

This first phase of your interpreter is due October 17. For Phase 1, you will implement a basic
calculator with Lisp syntax—it will implement a small subset of PSI, a language designed for this
course. An example session is below:

$ ./psi
psi> 1
1
psi> (+ 2 3)
5
psi> (* 4 (+ 5 6))
44
psi> (/ 137 9)
15
psi> (= 18 (* 3 6))
#t
psi> (+ 3 #t)
ERROR (+ : requires numeric args)
psi> (+ 2 (/ 1 0))
ERROR (division by zero)
psi> (quit)
$

For this phase, you are responsible for:


● reading input from the user as an S-expression, e.g. "(+ (- 4 5) (* 6 7))", and
parsing it.
● evaluating the S-expression fully, e.g.
○ (+ (- 4 5) (* 6 7)) ⟶ (+ -1 (* 6 7))
○ (+ -1 (* 6 7)) ⟶ (+ -1 42)
○ (+ -1 42) ⟶ 41
● printing the final result to the console;
● and looping—that is, doing it again unless the quit function was called.

This program you will write is called a REPL (read-eval-print loop) for reasons above.

For this phase, you will implement four base types:


● Integers—you don't need to worry about floats or doubles for this project, although you
may. Use 64-bit integers. For this assignment, you may ignore overflow completely.
● Booleans—values include #t and #f.
● Symbols—for now, you only need to concern yourself with the symbols for the functions
you'll be implementing, e.g. +, =, quit.
● Errors—a wrapper class around a string to give a helpful message to the user, e.g.,
"division by zero".

You will include six functions:


● +, -, *, / — Integer operations, arity of two.
○ You're welcome to handle other arities—e.g., (* 3 4 5) ⟶ 60 and (+) ⟶ 0.
For this phase, you don't have to.
○ If any argument is not an integer, it's an Error.
● = — Equality check, arity of two.
○ You'll handle other arities later; for this phase, you can handle the 2-ary case
only.
○ Accepts integer and boolean arguments.
○ Always false across types, e.g. (= 1 #t) ⟶ #f.
● quit — arity zero; exits the REPL.

You can implement this however you choose. However, we should discuss some common
ideas. The first thing you have to do is read input. You will parse a user-supplied string as what
will likely be an S-expression tree structure wherein each item is either an atom (e.g., 5, #t, +)
or a list of elements that are also atoms or lists. For example, (), (1 2), (+ 3 #t), and (+ 4
(* 5 (+ 6 (* 7 8)))) are all lists.

It is up to you how you want to implement lists—you can use arrays, linked lists, or tree
structures. You are not expected to handle cyclic or infinite lists; they will not be used for this
project.

The evaluator—for now, as we will make adjustments in later phases—works like so:

● atoms evaluate to themselves; that is 1 ⟶ 1, #t ⟶ #t.


● the empty list also evaluates to itself; () ⟶ ().
● a general list (<expr-0> <expr-1> ... <expr-N>) is evaluated recursively like so:
○ the subexpressions are evaluated in order, left to right;
■ if any evaluate to errors, the first error is the whole evaluation's result
○ otherwise, naming the results of expressions, (<val-0> <val-1> ... <val-N>)...
■ if <val-0> is a symbol that names a function, and <val-1> ... <val-N>
match in type and number, call it with those values as arguments.
■ otherwise, return an error.

We'll tackle evaluation of symbols in Phase 2; for now, treat them as "self-evaluating" like atoms
and (), except when they occur in function (0th) position and are used to look up appropriate
functions.

For example, the lists above evaluate like so:


psi> ()
()
psi> (1 2)
ERROR (non-function in function position)
psi> (+ 3 #t)
ERROR (+ : requires numeric args)
psi> (+ 4 (* 5 (+ 6 (* 7 8))))
314

You may assume that each line of user input will never exceed 4096 characters.

Your REPL will, by necessity, allocate memory for intermediate computations. You are
responsible for free-ing everything you allocate. You should use a dynamic analysis tool like
valgrind or Clang's AddressSanitizer to be sure you are not leaking memory.

Start early.

Module 6 Writeup

Please submit your answers to questions 6.1–6.4 in PDF form, answers on one page, by
October 3. There is nothing due with regard to the code project—due October 17—but you
should get started on it now.

You might also like