Introduction to Data Technologies by Paul Murrell

Jürgen Symanzik

Introduction to Data Technologies by Paul Murrell

2011, Technometrics

Australian & New Zealand Journal of Statistics Aust. N. Z. J. Stat. 52(4), 2010, 469–470 doi: 10.1111/j.1467-842X.2010.00592.x BOOK REVIEW Introduction to Data Technologies. By Paul Murrell, CRC Press, Boca Raton, Florida, USA. 2009. xxvi+418p, US$69.95. IBSN 978-1-4200-6517-6. This is a very gentle book. It enables students and statisticians, particularly those just entering the profession, to begin to familiarize themselves with important concepts and tools from the world of databases, these tools and concepts having been chosen in a coherent, if somewhat eclectic, way. The book grew out of teaching materials for a second-year course in statistics, and so has to be fairly gentle. To me it is encouraging that such topics are finding their way into statistics courses at all. The book is organized into four main parts, namely Writing Computer Code , Data Storage , Data Queries and Data Processing . Each part consists of several chapters. The first deals with the substantive topic, and the others provide reference material for either the languages or the tools introduced in the lead chapter. In the first part, on Writing Computer Code (a skill every real statistician must, in my view, eventually acquire), no detail, it seems, is too small. Aspects such as choosing an appropriate supporting editor for the language, indenting code to clarify the structure of the program, choosing variable names that are at least partially self-documenting, such as using ‘dateOfBirth’ rather than ‘x’, and the like, are all covered. These are habits that experienced and productive programmers eventually adopt, but for some people it can take a long time. It is great to see such down-to-earth advice in print for those entering the profession. The language chosen for this part is HTML, which is of course useful in itself, but the principles espoused are general. The part on Data Storage is concerned with how data are actually stored in the computer and how languages such as XML work to provide a text-based representation of complex data structures. The language XPATH for querying XML is also briefly treated. The Data Queries part provides an introduction to relational database tools and SQL, at least up to the point at which researchers can start to do useful things with them. The final part, on Data Processing , is where R (R Development Core Team 2009) makes its entrance. The discussion is fairly closely confined to using R as a data processing language, which is often the first place that those learning R get stuck. I did, however, find this chapter slightly disappointing: data processing in R is possible, but it is not its main strength. In my view it would have been better, if more adventurous, to have introduced a scripting language such as PERL. Murrell does indeed raise this possibility (p. 362) but decides in favour of the higher-level language, R. I agree that there are pros and cons. I found the style of the book very engaging (or at least those parts of it that are intended to be read: the reference chapters are intended to be consulted rather than read). It has the Paul Murrell light touch, first evident to me in his eminently readable and comprehensive book on R graphics (Murrell 2005). Like that one, the present book has interesting, occasionally slightly unusual examples and an easy and elegant writing style. The book does not hesitate to offer plain, direct advice in contexts in which other authors might simply let readers exercise their personal preferences. For students, particularly, I think this is a good thing. Students will make up their own minds anyway, but having definite suggestions in front of them certainly helps things along. Opinion is divided over to what extent data technologies should be of interest to statisticians. In the words of Lamia Gurdleneck, ‘It’s not the figures themselves . . . it’s what you do with them that matters’; data technologies can seem like a fascination with the figures themselves. My view changed when I left the university scene and joined CSIRO. At university, most of my statistical work had been in consultancy, enabling one more-or-less to dictate to clients the form in which their data should be presented. At CSIRO, the work was largely as part of a research team, with members ranging from scientists in the application area and field-work specialists through to analysts, modellers and statisticians. There was usually a database specialist charged with designing, implementing and C 2010 Australian Statistical Publishing Association Inc. Published by Blackwell Publishing Asia Pty Ltd. 470 BOOK REVIEW managing the data collection and recording aspects of the project. It is essential that a statistician can talk to the database specialist, and, as a team member, the statistician, along with most others, will be expected to be able to use the database facilities for most purposes by themselves, and of course advise on aspects of the design. There is always much preliminary ‘data cleaning’ to do before an analysis can begin, almost regardless of how good a job is done by the database specialist. A little incident may serve to illustrate this need for at least some data technology insight. Not long ago, a research project on which I was working needed to use some length–weight relationship data on prawns. A PhD student had collected precisely what we needed some years before, but the only record we had of it was a set of four scatterplots. A rough copy appeared in the thesis, but we also had an electronic version in MICROSOFT EXCEL. We were just about to start digitizing the graphics when the database manager pointed out that, as these were EXCEL graphics, they could be saved in XML, which he proceeded to do. From the XML file the original measurements were easy to pick out with an editor, to full accuracy. Had I known more about XPATH, though, the job of extracting the numerics from the XML might have been even easier. I also learned from this that, while EXCEL hoards and tucks away all the information it receives, sometimes you need to be rather ingenious to persuade it to divulge it again. BILL VENABLES CSIRO Mathematics, Informatics and Statistics Cleveland, Queensland, Australia References MURRELL, P. (2005). R Graphics . Boca Raton, FL: Chapman & Hall/CRC. R Development Core Team (2009). R: A Language and Environment for Statistical Computing . Vienna, Austria: R Foundation for Statistical Computing. C 2010 Australian Statistical Publishing Association Inc.

Log In

Introduction to Data Technologies by Paul Murrell

Related papers

Related papers

Related topics