The secrets of inverse brogramming

•Download as ODP, PDF•

1 like•1,096 views

Brogramming is the art of looking good while you code. This talk is about the opposite: the art of writing good looking code (in R).

More Related Content

The secrets of inverse brogramming

1. 1© TDX Group | Private and confidential© TDX Group | Private and confidential The secrets of inverse brogramming By Richie Cotton July 2013

2. bit.ly/18IllEo

4. hsl.gov.uk

5. commons.wikimedia.org

6. cc2e.com

7. oreilly.com

9. Idea 1: Use a style guide commons.wikimedia.org

10. getCRANmirrors (utils) ARMAtoMA (stats) setMomentsFMstable (FMStable)

11. yaml.load_file (yaml) tk_choose.files (tcltk)

12. lowerCamelCase Java, C# UpperCamelCase lower_under_case python, ruby

13. add/remove increment/ decrement open/close begin/end insert/delete show/hide create/destroy lock/unlock source/target first/last min/max start/stop get/put next/previous up/down get/set old/new From Code Complete, 2nd Edition, p172

14. get/assign (base) readRDS/saveRDS (base)

15. y<-sin(2*pi+x)^2

16. y<-sin(2*pi+x)^2 y <- sin(2 * pi + x) ^ 2

17. bwplot(decrease ~ treatment, OrchardSprays, groups = rowpos, panel = "panel.superpose", panel.groups = "panel.linejoin", xlab = "treatment", key = list(lines = Rows(trellis.par.get("superpose.line"), c(1:7, 1)), text = list(lab = as.character(unique(OrchardSprays$rowpos))), columns = 4, title = "Row position"))

18. bwplot( decrease ~ treatment, OrchardSprays, groups = rowpos, panel = "panel.superpose", panel.groups = "panel.linejoin", xlab = "treatment", key = list( lines = Rows( trellis.par.get("superpose.line"), c(1:7, 1) ), text = list( lab = as.character( unique(OrchardSprays$rowpos) ) ), columns = 4, title = "Row position" ) )

19. library(sig) library(assertive) setwd("workpace/my project") source("backend.R")

20. Idea 2: Functions are a black box commons.wikimedia.org

21. library(sig) sig(read.csv) read.csv <- function(file, header = TRUE, sep = ",", quote = """, dec = ".", fill = TRUE, comment.char = "", ...)

22. list_sigs(pkg2env(tools)) add_datalist <- function(pkgpath, force = FALSE) bibstyle <- function(style, envir, ..., .init = FALSE, .default = TRUE) buildVignettes <- function(package, dir, lib.loc = NULL, quiet = TRUE, clean = TRUE, tangle = FALSE) check_packages_in_dir <- function(dir, check_args = character(), check_args_db = list(), reverse = NULL, check_env = character(), xvfb = FALSE, Ncpus = getOption("Ncpus", 1), clean = TRUE, ...)

23. list_sigs(baseenv(), pattern = "apply") apply <- function(X, MARGIN, FUN, ...) eapply <- function(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE) lapply <- function(X, FUN, ...) mapply <- function(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) rapply <- function(object, f, classes = "ANY", deflt = NULL, how = c("unlist", "replace", "list"), ...) sapply <- function(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) tapply <- function(X, INDEX, FUN = NULL, ..., simplify = TRUE) vapply <- function(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

24. list_sigs(baseenv(), pattern = "apply") apply <- function(X, MARGIN, FUN, ...) eapply <- function(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE) lapply <- function(X, FUN, ...) mapply <- function(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) rapply <- function(object, f, classes = "ANY", deflt = NULL, how = c("unlist", "replace", "list"), ...) sapply <- function(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) tapply <- function(X, INDEX, FUN = NULL, ..., simplify = TRUE) vapply <- function(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

25. list_sigs(baseenv(), pattern = "apply") apply <- function(X, MARGIN, FUN, ...) eapply <- function(env, FUN, ..., all.names = FALSE, USE.NAMES = TRUE) lapply <- function(X, FUN, ...) mapply <- function(FUN, ..., MoreArgs = NULL, SIMPLIFY = TRUE, USE.NAMES = TRUE) rapply <- function(object, f, classes = "ANY", deflt = NULL, how = c("unlist", "replace", "list"), ...) sapply <- function(X, FUN, ..., simplify = TRUE, USE.NAMES = TRUE) tapply <- function(X, INDEX, FUN = NULL, ..., simplify = TRUE) vapply <- function(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

26. write_sigs( pkg2env(tools), "tools sigs.R" )

27. sig_report( pkg2env(Hmisc), too_many_args = 25, too_many_lines = 200 )

28. The environment contains 509 variables of which 504 are functions. Distribution of the number of input arguments to the functions: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4 62 117 90 41 41 22 17 14 17 15 10 8 4 6 3 1 17 18 19 20 21 22 23 24 27 28 30 33 35 48 66 4 5 3 2 4 2 2 2 2 1 1 1 1 1 1 These functions have more than 25 input args: [1] dotchart2 event.chart [3] labcurve latex.default [5] latex.summary.formula.reverse panel.xYplot [7] rlegend transcan Distribution of the number of lines of the functions: 1 2 [3,4] [5,8] [9,16] [17,32] 1 47 15 57 98 108 [33,64] [65,128] [129,256] [257,512] [513,1024] 81 58 30 8 1 These functions have more than 200 lines: [1] areg aregImpute [3] event.chart event.history [5] format.df labcurve [7] latex.default panel.xYplot [9] plot.curveRep plot.summary.formula.reverse [11] print.char.list rcspline.plot [13] redun rlegend [15] rm.boot sas.get [17] summary.formula transcan

29. use a style guide try the sig package show your code to other people read Code Complete buy my book

30. 30© TDX Group | Private and confidential© TDX Group | Private and confidential The information contained herein is strictly confidential and may not be reproduced, distributed or published without the written permission of TDX Group. The information contained within this document is believed to be reliable but no warranty, expressed or implied, is given by TDX Group that the information is complete or accurate nor that it is fit for a particular purpose. All such warranties are expressly disclaimed and excluded. TDX Group shall not be liable (including in negligence) for direct, indirect or consequential losses, damages, costs or expenses arising out of or in connection with the use of or reliance on the information contained in this document. Thank you

Editor's Notes

Hi there. I realise that the title is rather cryptic so I suppose I'd better explain myself. Did any of you see Rasmus Bååth's talk earlier? Good, coz this is the same talk, more or less. But it's important so some of you need telling twice. Brogramming is the art of looking good while you code. It's a reaction against the idea that in order to write software, you have to be a shy retiring nerd. I really like the idea the coding is now mainstream enough that it has gone beyond a niche group of people. If jocks can code, then anyone can. That's jocks in the American sense of sporty people, not jocks as in Scottish people. I'm sure there are lots of talented Scottish coders. This is important because with R, you can't just be a statistician, you have to be a programmer as well. And so today, I'm going to talk about writing code. It's the opposite idea to looking good while you code; its the idea that you should write good looking code.
I've collected together a few links related to the presentation at this bitly address.
Just to introduce myself. I work in the debt collection industry at TDX Group in Nottingham in the UK, and I also have my own stats consultancy called The Damned Liars. Debt collection is a booming industry right now, so TDX is hiring. If everything I say to you in the next quarter of an hour sounds really obvious, then come chat with me afterwards if you are interested.
My first statistics job was at the Health and Safety Laboratory. This is part of the UK civil service, and we worked in very multidisciplinary teams. So I spent a lot of time working with chemists and engineers and also proper software developers.
One guy in particular taught me a great deal. His name is Alex Hogg. I don't have a photo of him, but Uncle Fester is a surprisingly close approximation. Every now and then, I'd get stuck on a problem, and I'd ask Alex for help. He'd come over, look at my screen, and - every single time, I kid you not - he'd wince when he saw my code. After a while, this started to get to me, and I realised that I had to do something about it. So I spent about two years reading a load of books on software development practise.
The good news is that you don't need to spend two yers studying good coding practise. Pretty much everything that you need to know and more is contained in this one book. You should absolutely read this. Has anyone read it? The bad news is that this book is 850 pages, so it's still a bit of effort. It's very readable, but it still took me a few months to get through. There is one more piece of good news. You can get most of the benefit from just two simple ideas. In a quarter of an hour's time, you'll understand them both. I'm hoping that every thing I say is going to sound familiar to some of you. Most of the big problems with writing code have been solved by software developers. In my experience however, statisticians don't get software development training, so even quite senior statisticians who should know better can end up doing things the hard way.
While I'm recommending things, I'd also like to recommend my own book. It's called Learning R, and it's out in September, but you can preorder it now. It's a gentle introduction to the language. The first half teaches you R vocabulary - all the data types and the common functions that you should know - and the second half walks you through a data analysis workflow.
This is Guy, one of my colleagues. He is a sterotypical brogrammer. He's got the shades, the weights, the protein drink. And naturally, he's working on two laptops and tablet simultaneously. I don't think this photo was even posed; its just the way Guy is. So that's brogramming. Now I'm sure you are all itching to get the point of this? What are the two ideas for inverse brogramming?
This is Lady Gaga demonstrating the importance of style. Actually I think that that 's one of her more conservative outfits. In programming terms, a style guide is just a set of rules describing how your code should look. It has three components: it describes rules for how you name your variables, how you lay out your code, and in what order it should occur. I have a style guide online; there's a link to it from the bitly address, so you don't need to write your own.
One thing that bugs me when you name variables is weird casing. There are lots of examples of this throughout R. In cases like these, remembering which bits of the function name are upper case and which bits are lower case takes extra mental effort, which is silly, because I want to focus my brain power on things like “what is this data telling me”, rather than silly syntax issues. The last one is particularly ridiculous because the “s” in FMStable is capitalised in the package name, but not in the function name. Consistency works wonders.
One other thing that bugs me is separating words using a mixture of dots and underscores. For S3 methods it's necessary, but otherwise it just looks odd. There are many more naming crimes, but I think that you are starting to get the idea. Fortunately, the naming problem has been solved, and there are only two sensible naming conventions around today.
Java and C# and most of the descendents of C use lower camel case for variable names and upper camel case for classes and methods. python and ruby on the other hand use lower under case for everything. Both these styles are used in R, so you can pick either one. Since functions in R behave the same way as any other variable type, the convention seems to be to use lower camel case for them. I'm also a big fan of C# so having lower camel case functions looks odd to me, so I tend to prefer under case.
Functions often come in opposite pairs. This table is taken from Code Complete, and it lists some common actions that you might want to take. The idea is that if a user knows one function, then it is easy to guess what the other function is going to be. You see this a lot in R: read.csv has write.csv, save has load, and so on. When it doesn't happen, you notice it more.
Here are two examples from the base package that annoy me. If you hunt around, I'm sure you can find lots more. The second one in particular is horrendous. Having readRDS and writeRDS is fine. Having saveRDS and loadRDS is fine. But when you mix the two together, it has a really jarring effect.
The second style point is how you lay out your code. Here we have a simple assignment. To me, it looks a bit squashed. If I see a long mathematical expression crammed together like this I find it hard to read.
Adding whitespace make it is much clearer to me. Your opinion may be different, and that's OK. Some people don't like spaces around the power of operator for example. The important thing is that it is best to have rules in place so that you do the same thing every time. If you are consistent then you'll develop a habit, and that means that you don't need to think about the problem any more; it just becomes automatic. the other benefit it that if you are working in a tem, and you all write code in the same way, then it becomes easier to read each other's code.
It gets more interesting with a more complicated example. This is one of the examples from the xyplot help page. lattice function calls make a lot of use of nested lists, which makes them complicated, so they make good test cases for your style guide.
This is a rewrite of the same code with a different layout. I tend to have code on one side of my screen and an execution console on the other so I prefer my code to be taller rather than wider. Again, your preference may be different; you can choose a style that suits you. I like having equals signs aligned for named arguments so that you can quickly see which arguments are at which level. Other people think that is too fussy. Again, choose what suits you, but make a rule and stick to it.
The final style point is that it helps to keep your code in the same order in each script. I find that for any new analysis, I need to load a few packages, change the working directory and maybe source in some unpackaged functions. By keeping these commands in the same order at the top of the script file, I can easily see what I need and where I'm working from.
The second big idea of this talk is that functions should be a black box. That is, you shouldn't need to look at their insides in order to understand them. Every function has what is called a signature: that's it's name, the arguments it takes and the value it returns. Based on these three things, you should be able to have a reasonable idea of what the function does. Of course, sometimes you need to read documentation on the algorithm, but in most cases a good function should be understandable from its signature.
This is where the sig package comes in useful. It takes a function, and prints the name and arguments of that function. The return value isn't easy available in R, of course, so sig doesn't worry about it. On its own, that isn't that useful; you can already do much the same thing with that args function or the formals function. The sig package has a couple of extra tricks up its sleeve.
The list_sigs function, as the name suggests, lists the sigs of all the functions in an environment or a file. Here you can see the first few results from the tools package. Already after four functions we can see that there is no consistent naming strategy. add_datalist and check_package_in_dir are under cased, buildVignettes is camel cased and bibstyle is plain lowercased. My point isn't to pick on the tools package. It's been written by a lot of people, and some of the names were chosen to match names in S, so there is a lot of legacy naming to do. My point is that if you have the choice, don't name your functions like this.
The list_sigs function also has a pattern argument that lets you limit the output using a regular expression, in the way that ls does. Here you can see all the functions in the base package containing the word apply.
When you're naming arguments, consistency is just as important as when you are naming functions. I've highlighted the first argument of each function, and you can see that in most cases the formal arguemtn is called 'X'. For no good reason, in rapply the name has been changed to 'object', and in 'eapply' it has been changed to 'env'. mapply swaps the argument ordering so the FUN argument comes first. All this adds up to a considerable mental overhead, and these are the exact reasons that the plyr package was created. All that package does is create a consistent interface to the functions to stop you having to remember arguments. The caret package does the same for modelling functions.
The list_sigs function, as the name suggests, lists the sigs of all the functions in an environment or a file. Here you can see the first few results from the tools package. Already after four functions we can see that there is no consistent naming strategy. add_datalist and check_package_in_dir are under cased, buildVignettes is camel cased and bibstyle is plain lowercased. My point isn't to pick on the tools package. It's been written by a lot of people, and some of the names were chosen to match names in S, so there is a lot of legacy naming to do. My point is that if you have the choice, don't name your functions like this.
write_sigs lets you write those lists of sigs to a file. The idea behind this is a game I call black box. You take a package that you've written, write all the function signatures to a file. Then you print it out, give it to a friend and have them guess what your functions do. It isn't quite Settlers of Catan, but it's a useful test to see if you've been naming things clearly.
There are some other things that we can learn from the function signatures. Functions that have a lot of input arguments are generally more complicated to use. If your function takes a lot of inputs, then it's a sign that you might be trying to do too much, or that the function has several different purposes. Similarly, if the function is really long, then it's a good indicator that it should be broken down into smaller chunks. The sig_report function takes an environment or a file, and gives you a report of all the telling you which functions have too many input arguments and which ones are too long. The defaults are 10 arguments and 50 lines.
Here's the output of a report. I picked Hmisc because it's a really bad offender. There are loads of epic functions that are ripe for refactoring into smaller pieces.
For those of you that fell asleep half way through, here's my list of good ideas for you to try.
Thanks very much for listening. One last thing is that Iwant to set up an R user group for Nottingham. If you live anywhere near the city, or you know anyone from the East Midlands in the UK that might be interested, then let me know. Thanks again.

The secrets of inverse brogramming

Related slideshows

More Related Content

The secrets of inverse brogramming

Editor's Notes