Coding For Economist
Coding For Economist
Coding For Economist
Organized by
Harvard Economics Professional Development
Spring 2019
The first part of the presentation focuses on general computer science concepts,
guidelines, and programming tips with an aim to:
The second part of the presentation with Frank Pinter will introduce version
control via Git.
2. I have not personally used all of the below tricks in my work, but am
providing them here for your reference.
The two main principles for coding and managing data are1 :
1 Like the agents we study, we too as programmers can be time inconsistent - the goal is to go from a naif
to a sophisticate.
Ljubica “LJ” Ristovska Coding for Economists February 21, 2019 4 / 41
General Principles
The two main principles for coding and managing data are1 :
Specifically:
1 Like the agents we study, we too as programmers can be time inconsistent - the goal is to go from a naif
to a sophisticate.
Ljubica “LJ” Ristovska Coding for Economists February 21, 2019 4 / 41
Outline
1 Organization
3 Abstraction
5 Documentation
6 Efficiency
7 Miscellaneous
Adopt an organizational system and adhere to it. This system should include
consistency in:
• Directory structure
• Code and comment style
• File naming conventions
• Variable naming conventions
• Output data structure
• Data
• Raw
• Analytic
• Temp
• Code
• Output
• Tables
• Figures
• Documentation Wow, I know where everything is!
• Data
• Raw
• Analytic
• Temp
• Code
• Output
• Tables
• Figures
• Documentation Wow, I know where everything is!
Why is this useful? Preserves organization over time and makes automating
certain tasks (e.g., file input and output) easier.
Why is this useful? Makes code readable for people (and future you), not just
machines.
More resources for code style by Google: for Python and R, applicable broadly.
• Use distinctive and informative file and variable names (e.g., temp vs.
gender).
• Be generous and consistent with use of mixed case and underscores (e.g.,
laborElasticity vs. laborelasticity vs.
labor elasticity).
• Note: not all languages will distinguish between upper and lower case
characters in variable names – make sure to check!
• Know that there are different types of input data sets (e.g., .csv, .xls, .dat)
and each programming language has a specific output data set format (e.g.,
.mat, .dta, etc.)
• Know which variables uniquely identify an observation for each data set
(called keys)
• Values do not have any internal structure, i.e., do not need to process
variables to use them
• The data does not contain any redundant information, e.g., do not save a
panel data set as a set of dummies year 1980, year 1981...
• Know that there are different types of input data sets (e.g., .csv, .xls, .dat)
and each programming language has a specific output data set format (e.g.,
.mat, .dta, etc.)
• Know which variables uniquely identify an observation for each data set
(called keys)
• Values do not have any internal structure, i.e., do not need to process
variables to use them
• The data does not contain any redundant information, e.g., do not save a
panel data set as a set of dummies year 1980, year 1981...
Why is this useful? Clarifies relationships between different data sets. Makes for
easy merges of data sets.
Ljubica “LJ” Ristovska Coding for Economists February 21, 2019 10 / 41
Outline
1 Organization
3 Abstraction
5 Documentation
6 Efficiency
7 Miscellaneous
• While every language has the aforementioned primitive data types, some
languages have more advanced data structures (e.g., vectors, matrices, lists,
data frames, stacks, queues, dictionaries...).
• Different data structures are stored/linked differently and are more conducive
to certain types of operations than others.
• While every language has the aforementioned primitive data types, some
languages have more advanced data structures (e.g., vectors, matrices, lists,
data frames, stacks, queues, dictionaries...).
• Different data structures are stored/linked differently and are more conducive
to certain types of operations than others.
Why is this useful? Using the appropriate data structure for the task at hand
can improve code readability and efficiency.
• Whatever you don’t want to keep, you can comment out, flag out, or save in
a separate file (Is it really a good idea to delete it? More with version control
in part two of this presentation.)
• Whatever you don’t want to keep, you can comment out, flag out, or save in
a separate file (Is it really a good idea to delete it? More with version control
in part two of this presentation.)
Why is this useful? Maintains a clear record of steps that were taken and done
on the project.
• A frequent mistake: tabulate the data and get the min/max/mean values of
a variable from that tabulation and hard code it for use later on.
• Instead, consider tabulating the data and storing the min/max/mean values of
the variable you want to iterate through in a local variable.
• Then, whenever the data changes, you do not need to change these values,
they will automatically update.
• Consider using Makefile: a list of shell commands containing rules that
dictate what files to execute.
• A frequent mistake: tabulate the data and get the min/max/mean values of
a variable from that tabulation and hard code it for use later on.
• Instead, consider tabulating the data and storing the min/max/mean values of
the variable you want to iterate through in a local variable.
• Then, whenever the data changes, you do not need to change these values,
they will automatically update.
• Consider using Makefile: a list of shell commands containing rules that
dictate what files to execute.
Why is this useful? Automation saves future-you time, reduces errors, and
makes it easy to replicate results.
Generally it is not a good idea to use globals – it is difficult to keep track of which
functions have access to it and update it.
Generally it is not a good idea to use globals – it is difficult to keep track of which
functions have access to it and update it.
Generally it is not a good idea to use globals – it is difficult to keep track of which
functions have access to it and update it.
Why is this useful? Locals are very useful for defining fixed parameters (e.g., a
CRRA of 3), doing text substitution, or using them for automating file
input/output.
• With a consistent folder structure, you can refer to each of the subdirectories
of your project without having to re-type the entire directory path again.
• With a consistent folder structure, you can refer to each of the subdirectories
of your project without having to re-type the entire directory path again.
Why is this useful? Avoids manual file input/output, allows you to re-use code
across projects by only changing the directory, avoids the need for
re-inputting/outputting data when the data changes.
Ljubica “LJ” Ristovska Coding for Economists February 21, 2019 18 / 41
Outline
1 Organization
3 Abstraction
5 Documentation
6 Efficiency
7 Miscellaneous
Suppose you perform the same repetitive set of commands in multiple sections of
one script, or across many scripts, but you just change the inputs to this set of
commands.
Many programming languages allow you to write your own functions and
modularize code.
Abstraction is very powerful. It is clean, less error-prone, and allows for re-use.
Not everything should be abstracted. The key theme is that things should be
separated based on function. Random pieces of code should not belong in a
function that has a specific role and/or encapsulates a certain behavior.
Abstraction is very powerful. It is clean, less error-prone, and allows for re-use.
Not everything should be abstracted. The key theme is that things should be
separated based on function. Random pieces of code should not belong in a
function that has a specific role and/or encapsulates a certain behavior.
Abstraction is very powerful. It is clean, less error-prone, and allows for re-use.
Not everything should be abstracted. The key theme is that things should be
separated based on function. Random pieces of code should not belong in a
function that has a specific role and/or encapsulates a certain behavior.
Why is this useful? Abstraction is one of the most powerful concepts of good
computer programming. Because it describes behavior, it allows for re-using of
code across many different projects and applications. It saves time, and increases
readability of code.
The perfect example of the power of abstraction is that I have been using the same function I wrote to output summary statistics from Stata into Excel
using the format that I like for the past 5 years for many problem sets and projects.
1 Organization
3 Abstraction
5 Documentation
6 Efficiency
7 Miscellaneous
Testing is trickier, since it depends on the code’s function. The key questions to
ask while testing are:
Testing is trickier, since it depends on the code’s function. The key questions to
ask while testing are:
What does this mean? Test code piece by piece, ensuring that each piece works
correctly before moving on to the next.
• This becomes crucial when dealing with abstraction – generally good practice
to ensure a function works correctly on a simple example, before checking
that the code that calls that function works.
• Modular testing is made easy by splitting up your code into multiple scripts
based on function.
What does this mean? Test code piece by piece, ensuring that each piece works
correctly before moving on to the next.
• This becomes crucial when dealing with abstraction – generally good practice
to ensure a function works correctly on a simple example, before checking
that the code that calls that function works.
• Modular testing is made easy by splitting up your code into multiple scripts
based on function.
Why is this useful? Testing code all at once takes longer and makes it easy to
miss errors/mistakes. Testing it modularly allows you to verify accuracy of code
piece by piece and narrow down the areas where there could be a mistake.
Ljubica “LJ” Ristovska Coding for Economists February 21, 2019 25 / 41
Other testing tips
• Do not test in the interpreter, test in a script (and keep those scripts).
• Check values of key variables and data sets.
• Use the built-in debugger: adding break-points is very helpful for checking
values of key variables/data sets while the code runs.
• With large data, you might not want to test the code on the first pass with
the entire data set. Many languages allow you to run a piece of code on a
sub-sample of 100 observations. Make use of this feature to save some time
when testing!
• If you have code that runs slowly, print notes to yourself to delineate what
pieces of code have finished running. That way, if the code stops due to an
error, you’ll know where the error was.
• Don’t trust your future self: instead of adding comments in the code that
say “check for XYZ when updated data arrives” or “be careful: data must be
in numeric format”, add your own assertions and error checks.
• Read your results! Think through them! You’ll be surprised how many
special cases and errors you can catch that way.
1 Organization
3 Abstraction
5 Documentation
6 Efficiency
7 Miscellaneous
Most people document in-code, which means that they add comments
documenting variables and functions where appropriate. This is the easiest to
maintain, but it requires you to remember to update the comments when you
change something.
Most people document in-code, which means that they add comments
documenting variables and functions where appropriate. This is the easiest to
maintain, but it requires you to remember to update the comments when you
change something.
1 Organization
3 Abstraction
5 Documentation
6 Efficiency
7 Miscellaneous
• Do not run things in the graphical user interface – learn how to run code
from the terminal.
• Use flags in your master script to turn on/off scripts to run.
• Parallelize: makes your computer executes many operations simultaneously
(e.g., use multiple cores of your processor, or use multiple processors on
different machines - clusters).
• When working with small data sets, parallelizing won’t save you much time –
better optimize your sequential code.
• Very useful to know how to parallelize if you are working with large data sets.
• Vectorize: subset of parallelization, but does not require separate
cores/clusters. Involves using data structures that are conducive to applying
a procedure to multiple items .
• Know your data structures and know when vectors/matrices and matrix
algebra is faster.
• Matrix algebra can in some instances replace loops and provide efficiency.
1 Organization
3 Abstraction
5 Documentation
6 Efficiency
7 Miscellaneous
Regular expressions represent a way to denote patterns in variable, value, file, even
directory names.
Why is this useful? Makes searching for values/files that satisfy a particular
pattern easy.
The two main principles for coding and managing data are:
Specifically:
1 Organization
3 Abstraction
5 Documentation
6 Efficiency
7 Miscellaneous