Intro To C - Module 3
Intro To C - Module 3
2-8, 2024)
Some Syntax
We should cover, so there are no surprises, some syntax you will encounter if you work with C.
a + b // adds a and b
a - b // subtracts b from a
a * b // multiplies a by b
a / b // divides a by b -- may be integer division (fraction discarded)
a % b // a modulo b (or: remainder when a is divided by b)
The behavior of / is type dependent—and this can be frustrating and confusing. Consider this
program:
#include "stdio.h"
int main() {
int c = 9;
int d = 2;
int e = c / d;
double f = c / d;
printf("%d %f\n", e, f);
return 0;
}
The output is: 4 4.000000 . Why? Because the program performs integer division before
assigning the result to the double, f. A fix for this is to typecast—in this case, promote—the
variable c into the more inclusive type:
double f = (double) c / d;
The compiler will then also promote d, and use floating-point division.
When you go the other way, the compiler rounds toward zero—thus, (int) 3.999999 is 3, and
(int) -4.5 is -4. Note that integer division also rounds toward zero—thus, -7 / 3 is -2, not -3.
The comparison operators were invented before the bool type existed, and therefore return int
values—1 if true, 0 if false.
int g = 3 < 5; // g will be 1 (true)
int h = 3 == (1 + 2); // h will be 1 (true)
int i = 3 > 5; // i will be 0 (false)
int j = (0 < 1) + (0 > 1) + (0 < 3); // j will be 1 + 0 + 1 = 2
You may be familiar with the logical operators, which operate on numbers as bit-strings.
You probably won't be doing this low-level bit-twiddling often, but it's important to recognize
these operators when they are used. For example, you'll sometimes see people write a >> 1
instead of a / 2; historically, the former was faster, and therefore preferred. Likewise, & (AND)
and a bitmask were used in lieu of modular arithmetic, when the modulus is a power of two—for
example, a & 255 (or a & 0xFF) is equivalent to a % 256.
The bitwise NOT (~) and the logical NOT (!) are not interchangeable. The former flips every bit
of an integer; the latter maps 0 to 1 and all nonzero values to 0. If you're using an expression as
a condition, the logical version is almost always what you want.
C also offers short-circuiting, logical AND and OR—&& and ||—which are what you're used to
from languages like Python. The following if-statement is not safe:
Division By Zero
We know what happens in Python or Java when you do this highly illegal thing—an exception is
thrown, which you can either catch or let crash your program. C doesn't have native exceptions.
What do you think happens in C? What does the following program, when compiled and run,
do?
#include "stdio.h"
int main() {
int d = 1 / 0;
printf("%d\n", d);
return 0;
}
I got #10. Is this a substantial mathematical finding—that 1/0 is, in fact, 317964304? Of course
not. This is a case of undefined behavior (UB).
The compiler is cooperative. It wants to make your program better, so it applies all sorts of
optimizations to the code. We've already discussed one (strength reduction) by which, if it's
going to be faster to compute a & 0xFF than a % 256, it will issue instructions for the
(equivalent) former expression.
Similarly, if it sees:
int d = 1001 / 7;
int d = 143;
That's the good news. You can write code for humans and not have your program's
performance suffer. Compilers do a lot of nice things for you that you don't have to think
about—you don't even know about them.
The bad news is that, whether you know it or not, you are in a contract with your compiler that
you will not do certain prohibited things—integer division by zero is one of them; signed integer
overflow (as discussed in the prior module) is another; dereferencing a null pointer (which we'll
talk about soon) is a third. The compiler assumes that undefined behavior will never occur. Its
job is to compile correct code—and if a transformation makes correct code better, but changes
the results in the case of undefined behavior, it will do so. This means you can assume nothing
about what may occur after UB is triggered.
In my case, the compiler decided that, since integer division by zero "never happens," per
contract, this nonsensical division instruction can be deleted—I agreed never to do it, then I did
it, but the compiler has the right to pretend I didn't. In fact, the compiler was nice about it; when
it spotted the suspicious division, it issued a warning. (Always heed compiler warnings!) In this
particular case, the compiler allocated space for d, but did not write anything there—the
"constant" expression I gave it, 1 / 0, was trash that it threw away. Thus, the program executed
with d uninitialized, containing a nondeterministic "garbage" value. This led to buggy output and
a normal exit—no indication of anything wrong with the program!
So, you can't assume that dividing by zero always summons the nasal demons. It may or may
not summon the nasal demons. It may summon some of the nasal demons but not summon
other nasal demons. You can't depend on a specific failure mode—anything is possible! And the
bugs introduced by undefined behavior, at high optimization levels (e.g., -O3), get weirder and
harder to debug.
Python and Java automatically check these things for you—that you're not dividing by zero, or
dereferencing a bad pointer—at a small performance cost that is tolerable for most applications.
C, designed for heavy optimization, does not. If you skip the check, C assumes you have
proved it safe to do so.
Undefined behavior is a justifiably feared and often hated feature of C. In general, we prefer that
buggy programs tell us what went wrong, then crash. Why would anyone design a language
where, instead, you might get utter nonsense?
The answer is that C compilers optimize ruthlessly; generating extremely fast executables is C's
purpose. Now, an optimizing compiler might do dozens or hundreds of passes, each
transforming the code into a faster version of itself—inlining and specializing function calls;
replacing x % 256 with x & 0xFF; replacing x + 0 with just x, and so on. Some optimization
passes exist solely to expose other optimizations—for example, x + 0 is rarely seen in
human-written source code, but may occur after dozens of previous passes, exposing an
opportunity to delete an addition.
Let's say that, during the course of optimization, a series of passes produces this pattern:
if ((x + 1) < x)
f();
Even better, let's say this is the only place where f is called. Since x + 1 > x always, the
condition will always be false, and we can delete the whole expression—and f, which becomes
dead code we can get rid of. Right?
If x is unsigned, the answer is no. Overflow behavior for unsigned integers, recall, is defined—all
arithmetic is modulo 2N. So, there is exactly one value for which (x + 1) < x, and that's 2N - 1,
because x + 1 is 0. The compiler cannot legally remove this statement.
On the other hand, with signed integers, overflow behavior is undefined—the compiler, thus, can
assume signed-integer overflow never occurs, and therefore safely rewrite (x + 1) < x as
false, which enables the optimization to go through.
The compiler always wants to improve your code—undefined behavior is not a case of it
deliberately being vicious when you (unintentionally!) break the rules, but one of it ignoring
corner cases that, if it were to service them, would make correct code significantly slower.
If this seems pedantic, well... it is, but we have to understand undefined behavior in order to
tackle C's most distinct and interesting language feature: pointers.
Pointers—A Motivation
It's probably obvious, by now, that =, the assignment operator, is not mathematical equality. You
can write x = 3, but the mathematically equivalent 3 = x is not legal code, because you cannot
assign a new value to the constant 3. The left operand is interpreted as a place; the right one as
a value. So, x = x + 1 means, "Take the value in x, add 1 to it, and store it in x's place."
C functions use the pass-by-value convention, which means that they operate on their
arguments' values—not their places. This can be confusing, and frustrating. Let's say that you'd
like to create a counter and pass it around to (imagine this) count events as they occur during
the course of a program.
void use_counter_doesnt_work() {
int my_counter = 0;
increment(my_counter);
increment(my_counter);
printf("The thing happened %d times.\n", my_counter) // prints 0, not 2
}
There's nothing wrong with the increment function—it takes x and returns x + 1—but it
doesn't do what we want. Pass-by-value means that arguments are copied into local
variables—counter in increment diverges from my_counter in use_counter_doesnt_work.
So, we correctly increment the copy, but not the original my_counter.
Often, pass-by-value semantics are desirable—we don't usually want work performed to have
"side effects" on the original source. When we do, however, we need a different type—a type
that allows us to manipulate the place where the counter's int is stored.
void use_counter() {
int counter_val = 0;
int* counter = &counter_val; // @B
increment_ptr(counter);
increment_ptr(counter);
printf("The thing happened %d times.\n", counter_val); // prints 2
}
Pointers are still passed by value—that doesn't change—but they give us direct access to the
places in memory where the things we care about—in this case, counter_val—are stored. In
use_counter, we use (@B) the address-of operator, &, to get that pointer—an address in (virtual)
memory. One way to remember this syntax is to note that the & looks like a handle—it gives us
"a handle on" the object so we can manipulate it. The type of counter is int*, or a pointer to an
int—a piece of data whose sole purpose is to tell us where in memory we can find that int..
In increment_ptr, we (@A) use the * (dereference) operator to access the value that exists in
counter_val—the value that counter points to—then store the incremented value back into
the (unchanged) place where the pointer points. This distinction is crucial. We would get
incorrect results if either of the dereferencings were omitted:
● counter = *counter + 1—sets the pointer (not what it points to!) to *counter + 1,
which probably doesn't point anywhere.
● *counter = counter + 1—stores in counter_val not 1 plus its prior value, but 1 plus
its address—an irrelevant and probably large integer.
● counter = counter + 1—legal, but only alters a local int*, having no observable
effect.
If this isn't 100% clear, ask questions on Ed, because this is one of the topics where people
struggle the most.
An int* points to a value of type int, but it also has a value. For example, we might add
following two lines to the end of use_counter:
We use the %d format directive in the first case, because counter_val is a regular int; since
counter is a pointer, we have to use the %p directive instead—bad things happen in printf if
the types don't align with the format string.
counter value: 2
pointer value: 0x7ffd6a7efc8c
Numeric types have a default value that exists, regardless of their size—zero. Pointer types
have also been given such a value—the problem is that there might be nothing to point to! You
can create a widget* even if no widgets exist yet—the constant NULL exists for the value of a
pointer that doesn't point to anything—internally, its value is 0, an illegal address on modern
machines.
You can neither write to, nor read from, a null pointer—both kinds of access are undefined
behavior. (Usually, you'll get a seg-fault—the operating system recognizing an illegal read, and
killing the program.) Thus, if there's any danger of a pointer being NULL, you must check it.
Since NULL is false, you can use if (ptr) as shorthand for if (ptr != NULL).
Arrays
C's most native collection type is the array, as used below.
int main() {
int a[10] = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}; // @D
square_array(10, a); // @E
print_array(10, a);
return 0;
}
Compile and run this on your machine. It should print the first ten positive square numbers.
The first thing this program does (@D) is create an array of length 10 and initialize it with the
values {1...10}. C does not require you to initialize arrays, but—as with all variables—you must
do so before you read from them. If nothing else, you can always zero-initialize your array;
something like int b[100] = {0} will give each slot a consistent starting value of zero.
Our functions print_array and square_array are designed to operate on, well, arrays. So
why do they take pointers (@C) as arguments? Well, confusingly, C does allow you the syntax of
passing an array as an argument—but you're not actually allowed to do that. I repeat: You
cannot pass an array into a function. What you are passing, even when you use array syntax, is
a pointer. This is probably want you want—you'd usually prefer not to have direct pass-by-value
semantics against a 100,000,000-element array, as the copying would be expensive.
Instead, when you "pass an array" you are, in reality, passing a pointer to it—the syntax used
above, which I prefer, makes it explicit that this is happening—to tell the function where the
array is. Of course, these functions also need to know where their arrays end, so we pass (@E)
the size (10) as well, which we must do, because the pointer a "doesn't know" that it's the head
of an array. It's literally just an address—to be specific, the address of the head (0th) element.
When you have an int* called a, you can refer to a + 1 and C will do something
useful—instead of adding one byte to a, it will add sizeof(int), giving us the address of the
next int over. In other words, if a is 1000 and sizeof(int) is—as on most modern systems—4,
then a+1 will be 1004 and a+2 will be 1008 and so on... allowing us to address a contiguous
block of data as if it were a real array.
This is what C gives you. It gives you methods to set aside blocks of memory, so they won't be
used for anything else. When you declare int a[10], you get 10 * sizeof(int) bytes that no
other variables should interfere with. It gives you pointer arithmetic and allows you to use
brackets as "syntactic sugar"—a[i] is valid shorthand for *(a + i). Similarly, if you need to
refer to the location of the i-th array element, you can use either &(a[i])—the address of (&)
the i-th element in a—or a + i. Both work.
In the example above, the compiler allocates enough space to store 10 ints. Addresses for
a[-1] and a[10] are mathematically valid, of course, but point at data other than the array.
Reading them might "peek" at unrelated data, or alternatively at an unmapped address, which
will cause the OS to kill the program (seg fault). Writing to such values, likewise. In the bad old
days before virtual memory and process isolation, this could take down a system—and is still
triggered intentionally by bad actors today.
Any time you work with arrays, you need to know where they end. Java—and other high-level
languages—achieve this by including a .length field in the data structure. C doesn't. It's up to
you to set and follow your policy.
It might look like out-of-bounds access is taking place, since word[5] will be read, but this is
legal and will produce valid output:
Since ASCII text strings can never contain the null character—it's a "meta" character used in no
human language—this behavior is benign, when it comes to English text strings, but it also
means that you should never use C's string methods for general byte arrays.
We can now explain the type signature of main that you often see—
main(int argc, char** argv). Java uses String[] args, and C's argv is, conceptually, an
array of (char*) strings, but since C doesn't have real arrays, we need to know how
long—hence, argc—it is. It is correct but not useful to think of argv as "a pointer to a pointer to
a char"; it is a pointer to the head of an array of char*—and each of those points to the head of
a string.
Although arrays are implemented using pointers, you cannot always treat their syntax as
interchangeable, as we'll see below.
int main() {
char word[6];
strcpy(word, "hello");
word[0] = 'j';
printf("%s\n", word);
return 0;
}
So is this one:
int main() {
char word[6] = "hello";
word[0] = 'j';
printf("%s\n", word);
return 0;
}
This one, however, crashes (on my system):
int main() {
char* word = "hello";
word[0] = 'j';
printf("%s\n", word);
return 0;
}
The first one allocates a char[6] that main owns, and, at runtime, uses the strcpy function to
initialize it. The second argument to strcpy is a pointer to a string literal—a constant put
somewhere in the program for use by it. As main owns word, its own copy of the string, it can
modify it at will. No issue here.
The second program creates a char[6] and initializes it on creation; the usage above is
equivalent to:
That's fine; word is a char[6] owned by main and it can do what it wants with it.
In the third program, we are initializing the char* called word to point NOT to a block of storage
that main owns, but to the place where the string literal for "hello" is stored. This... could be a
problem. Historically, strings were a substantial fraction of an executable's storage—for
example, half of a 1980s RPG's ROM might be dialogue text—and so it was desired to store
them in only one place. As the compiler has the right to use the same destination for a dozen
distinct uses of "hello", it will often prefer read-only memory—it would be undesirable if one
user of a string "constant" could modify it and cause breaking changes for other users. Thus,
you can get a seg-fault from something that may look, at first glance, like it should be legal.
You should think of string literals, therefore, as const char*, and never mutate them. Instead, if
you must, use strcpy to get your own copy.
Module 3 Questions
3.1: Read the documentation on strcpy and strncpy. What is the main difference between
these functions? Why might you use one and not the other?
3.2: Is this legal code? Why or why not? What does it do?
void mystery(char* x) {
while (*x) {
if (*x > 96) *x -= 32;
x++;
}
}
#include "stdio.h"
What does it do? Why does it give different sizeof values in different places?
Week 3 Project
Create a command line utility that tabulates the letters in all words (which do not have to be
valid English words) supplied as arguments and prints out the number of each letter in the
whole list. All words will be in lower-case ASCII letters, of length 1 to 128, and there will be no
more than 255 of them. Example usages:
$ ./tabulate
Please submit your answers to 3.1-3.4 by PDF in Canvas. Your answers to the first three
questions should fit within one page. Include 3.4 on a separate page(s).