Posts

Showing posts from 2015

Replacing Sawzall — a case study in domain-specific language migration

Image
by AARON BECKER In a previous post, we described how data scientists at Google used Sawzall to perform powerful, scalable analysis. However, over the last three years we’ve eliminated almost all our Sawzall code, and now the niche that Sawzall occupied in our software ecosystem is mostly filled by Go. In this post, we’ll describe Sawzall’s role in Google’s analysis ecosystem, explain some of the problems we encountered as Sawzall use increased which motivated our migration, and detail the techniques we applied to achieve language-agnostic analysis while maintaining strong access controls and the ability to write fast, scalable analyses. Any successful programming language has its own evolutionary niche, a set of problems that it solves unusually well. Sometimes this niche is created by language features. For example, Erlang has strong tools for constructing distributed systems built into the language. In other cases, features such as standard libraries and a language’s communit

How to get a job at Google — as a data scientist

by SEAN GERRISH If you are a regular at this blog, thanks for reading. We will continue to bring you posts from the range of data science activities at Google. This post is different. It is for those who are interested enough in our activities to consider joining us. We briefly highlight some of the things we look for in data scientists we hire at Google and give tips on ways to prepare. At Google we’re always looking for talented people, and we’re interested in hiring great data scientists. It’s not easy to find people with enough passion and talent. In this short post, I’ll talk about how to get a job at Google as a data scientist. As you may have heard, the interviews at Google can be pretty tough. We do set our hiring bar high, but this post will give you guidance on what you can do to prepare. Know your stats. Math like linear algebra and calculus are more or less expected of anyone we’d hire as a data scientist, and we look for people who live and breathe probab

Using Empirical Bayes to approximate posteriors for large "black box" estimators

Image
by OMKAR MURALIDHARAN Many machine learning applications have some kind of regression at their core, so understanding large-scale regression systems is important. But doing this can be hard, for reasons not typically encountered in problems with smaller or less critical regression systems. In this post, we describe the challenges posed by one problem — how to get approximate posteriors — and an approach that we have found useful. Suppose we want to estimate the number of times an ad will be clicked, or whether a user is looking for images, or the time a user will spend watching a video. All these problems can be phrased as large-scale regressions. We have a collection of items with covariates (i.e. predictive features) and responses (i.e. observed labels), and for each item, we want to estimate a parameter that governs the response. This problem is usually solved by training a big regression system, like a penalized GLM, neural net, or random forest. We often use large regr

Data scientist as scientist

Image
by NIALL CARDIN, OMKAR MURALIDHARAN, and AMIR NAJMI When working with complex systems or phenomena, the data scientist must often operate with incomplete and provisional understanding, even as she works to advance the state of knowledge. This is very much what scientists do. Our post describes how we arrived at recent changes to design principles for the Google search page, and thus highlights aspects of a data scientist’s role which involve practicing the scientific method. There has been debate as to whether the term “data science” is necessary. Some don’t see the point. Others argue that attaching the “science” is clear indication of a “wannabe” (think physics, chemistry, biology as opposed to computer science, social science and even creation science ). We’re not going to engage in this debate but in this blog post we do focus on science. Not science pertaining to a presumed discipline of data science but rather science of the domain within which a data scientist operates.

Experiment design and modeling for long-term studies in ads

Image
by HENNING HOHNHOLD, DEIRDRE O'BRIEN, and DIANE TANG In this post we discuss the challenges in measuring and modeling the long-term effect of ads on user behavior. We describe experiment designs which have proven effective for us and discuss the subtleties of trying to generalize the results via modeling. A/B testing is used widely in information technology companies to guide product development and improvements. For questions as disparate as website design and UI, prediction algorithms, or user flows within apps, live traffic tests help developers understand what works well for users and the business, and what doesn’t.  Nevertheless, A/B testing has challenges and blind spots, such as: the difficulty of identifying suitable metrics that give "works well" a measurable meaning. This is essentially the same as finding a truly useful objective to optimize. capturing long-term user behavior  changes that develop over time periods exceeding the typical durat

Causal attribution in an era of big time-series data

Image
by KAY BRODERSEN For the first time in the history of statistics, recent innovations in big data might allow us to estimate fine-grained causal effects, automatically and at scale. But the analytical challenges are substantial. Every idea at Google begins with a simple question. How can we predict the benefits the idea's realization would create for users, publishers, developers, or advertisers? How can we establish that there is a causal link between our idea and the outcome metric we care about? Revealing one causal law can be more powerful than describing a thousand correlations — since only a causal relationship enables us to understand the true consequences of our actions. It's why estimating causal effects has been at the heart of data science. Analyses might begin by exploring, visualizing, correlating. But ultimately, we'll often want to identify the drivers that determine why things are the way they are. The golden path to estimating a causal effect is throu