Posts

Data scientist as scientist

Image
by NIALL CARDIN, OMKAR MURALIDHARAN, and AMIR NAJMI When working with complex systems or phenomena, the data scientist must often operate with incomplete and provisional understanding, even as she works to advance the state of knowledge. This is very much what scientists do. Our post describes how we arrived at recent changes to design principles for the Google search page, and thus highlights aspects of a data scientist’s role which involve practicing the scientific method. There has been debate as to whether the term “data science” is necessary. Some don’t see the point. Others argue that attaching the “science” is clear indication of a “wannabe” (think physics, chemistry, biology as opposed to computer science, social science and even creation science ). We’re not going to engage in this debate but in this blog post we do focus on science. Not science pertaining to a presumed discipline of data science but rather science of the domain within which a data scientist operates.

Experiment design and modeling for long-term studies in ads

Image
by HENNING HOHNHOLD, DEIRDRE O'BRIEN, and DIANE TANG In this post we discuss the challenges in measuring and modeling the long-term effect of ads on user behavior. We describe experiment designs which have proven effective for us and discuss the subtleties of trying to generalize the results via modeling. A/B testing is used widely in information technology companies to guide product development and improvements. For questions as disparate as website design and UI, prediction algorithms, or user flows within apps, live traffic tests help developers understand what works well for users and the business, and what doesn’t.  Nevertheless, A/B testing has challenges and blind spots, such as: the difficulty of identifying suitable metrics that give "works well" a measurable meaning. This is essentially the same as finding a truly useful objective to optimize. capturing long-term user behavior  changes that develop over time periods exceeding the typical durat

Causal attribution in an era of big time-series data

Image
by KAY BRODERSEN For the first time in the history of statistics, recent innovations in big data might allow us to estimate fine-grained causal effects, automatically and at scale. But the analytical challenges are substantial. Every idea at Google begins with a simple question. How can we predict the benefits the idea's realization would create for users, publishers, developers, or advertisers? How can we establish that there is a causal link between our idea and the outcome metric we care about? Revealing one causal law can be more powerful than describing a thousand correlations — since only a causal relationship enables us to understand the true consequences of our actions. It's why estimating causal effects has been at the heart of data science. Analyses might begin by exploring, visualizing, correlating. But ultimately, we'll often want to identify the drivers that determine why things are the way they are. The golden path to estimating a causal effect is throu

On procedural and declarative programming in MapReduce

Image
by SEAN GERRISH and AMIR NAJMI To deliver the services our users have come to rely upon, Googlers have to process a lot of data — often at web-scale. For doing analyses quickly, it helps to abstract away as much of the repeated work as possible. In this post, we’ll describe some things we have learned about mixing declarative and procedural programing paradigms to simplify MapReduce as used by data scientists. One goal of a data scientist’s software stack is to eliminate as much routine work as possible so she can spend more time on her comparative advantage: reasoning about data. Because a data scientist’s work is more abstract than a typical software engineer’s, the languages she uses often include declarative patterns — constructs by which the analyst specifies what she wants rather than how to get it, with the framework doing magic under the hood to get her the results. There are many examples of declarative programming constructs out there for data gathering, SQL be

An introduction to the Poisson bootstrap

by AMIR NAJMI The bootstrap is a powerful resampling procedure which makes it easy to compute the distribution of any statistical estimator. However, doing the standard bootstrap on big data (i.e. which won’t fit in the memory of a single computer) can be computationally prohibitive. In this post I describe a simple “statistical fix” to the standard bootstrap procedure allowing us to compute bootstrap estimates of standard error in a single pass or in parallel. At Google, data scientists are just too much in demand. Thus, anytime we can replace data scientist thinking with machine thinking, we consider it a win. Anticipating the ubiquity of cheap computing, Efron introduced the bootstrap back in 1979 [1]. What makes bootstrap so attractive is that it doesn’t require any parametric assumptions about the data, or any math at all, and can be applied generically to a wide variety of statistical estimators. As simple as the bootstrap procedure is, its theory is far from trivial and

Welcome to the unofficial Google data science blog

Despite Google’s technical achievements with big data, it may come as a surprise that there is no official Google blog for data science. True, Google Research puts out many academic papers and has a  blog  describing matters of interest to researchers. But what has been missing to date is a conversation about the nuts-and-bolts, the day-to-day of large scale analytical systems Google builds to serve its users. We’d like to change that. We are a group of individuals from across several engineering teams at Google whose job it is to design and build the analytics used in Google’s products and services. While most of us have PhDs in statistics, machine learning or a related field, ours is not a blog aimed at academia. We’ll provide academic references if necessary, but we mean for this to be a practitioners’ blog. At the same time, the problems we face are often complex enough to require highly technical solutions in statistics and computation. Thus many of our posts might not be