This document discusses what makes an effective data team. It begins with introductions from Alex Dean, CEO of Snowplow Analytics. It then discusses how Snowplow helps companies collect and analyze customer event data. The document outlines a hierarchy of needs for a data team, beginning with ensuring data is available and ending with data scientists doing industry-leading work. It provides advice on each level of the hierarchy to help data teams become more effective.
3. Introducing me, Alex Dean
• CEO and co-founder at Snowplow Analytics[1], the company
behind Snowplow, the open source event data pipeline[2]
• Our mission at Snowplow is to help companies make better
decisions
• I have been at different stages of my working life a data
engineer and a business analyst, but never a data scientist!
• Weekend writer of Event Streams in Action (Manning)[3]
[1] https://snowplowanalytics.com
[2] https://github.com/snowplow/snowplow
[3] https://www.manning.com/books/event-streams-in-action
@datasciencefest
@SnowplowData
4. Snowplow is a real-time event data pipeline designed for
the data team
• My co-founder Yali and I created Snowplow so that
companies could own their customer event data without a
huge data engineering effort
• When we started Snowplow, we thought we would spend
6 months building a data pipeline and then get back to
data analytics...
• ... 7 years later, we are still building event pipelines J
• Customer base of 150 worldwide and large open-source
community of enterprises and high-growth startups
• Snowplow is designed from the ground up for data teams
(data scientists, data engineers, business analysts)
DATA TEAM
CDO
Data lifecycle
@datasciencefest
@SnowplowData
5. Framing: how we think about the software landscape*
Systems of
intelligence
Systems of
record
* “Four types of system” taxonomy comes from Satya Nadella, Microsoft CEO (source)
which is in turn an unbundling of Jerry Chen’s “three types of system” taxonomy (source)
• Control end-user interactions – e.g. ad tech,
support desks, marketing automation
• AI and machine learning platforms
• Web analytics, mobile analytics, product analytics
• Event data pipelines
• Cloud data integration providers
• IoT platforms
• CRM, HR/HCM, ERP/Financials
• Powers a critical business function
Systems of
engagement
Systems of
observation
6. Framing: how we think about data maturity@datasciencefest
@SnowplowData
8. Let’s apply Maslow’s Hierarchy of Needs to the data team
Data scientists are doing industry-leading work
Leadership believes it is running a “data company”
Company is structured for data success
Data is high quality
Data is available
@datasciencefest
@SnowplowData
10. “Data is available” sounds obvious – but if it’s not true, then
you will not be doing much data science
Data collection is the foundation
of the data value chain
There are two “types” of data
companies need to collect
Data collection is a solved
problem in 2019
• A wide variety of commercial and open-
source systems of observation exist to
capture real-time event data and/or
slowly-evolving data
• And remember that you don’t have to
capture “all the things”: there is plenty of
duplicated signal across a company. You
just need the original ”signal”
• Before you can drive value from data,
you need an accurate, comprehensive
data set to work with, otherwise:
• Collecting data is hard:
• Multiple sources of data e.g. web,
mobile, email, social
• Sources are often evolving /
breaking their data “contracts”
1. “Real-time event data”
• Describes what is happening as it is
happening
• Includes web site, mobile data, call
center, email, in-store etc.
2. Slowly- evolving data
• Found in operational databases or
behind APIs
• Includes product catalogues, CRM,
content databases
11. Solved problem 1: collecting your real-time event data
from websites, mobile apps, ESPs etc
Tag managers / Customer Data Platforms (CDPs)
• Integrate one SDK in
web / mobile app ->
send data to many
destinations e.g.
marketing providers
• Data warehouse one of
many categories of
destination
Real-time event data pipelines
• Available open-source
or running in your
cloud as a managed
service
• Focus on data quality,
data richness and
flexibility
• Events are available in
real-time
@datasciencefest
@SnowplowData
12. Solved problem 2: collecting your slowly-evolving data
from operational databases, SaaS platforms etc
Next gen ETL/ELT-as-a-service
• SaaS ETL providers aka cloud data
integrators aka iPaas
• Specialize in warehousing data
from third party APIs and
databases
• Amortize the cost of maintaining
hundreds of connectors to
unreliable source systems across
large customer bases
@datasciencefest
@SnowplowData
13. A warning about your data engineers building this themselves
• Sometimes data engineers like to break out the tools and build
data pipelines from scratch:
• “The pre-built solution is too expensive, I can build this in a week”
• “The pre-built solution doesn’t understand all the specifics of our
business, which is not like any other business on the planet”
• “If we use a pre-built solution, we will be locked into that vendor”
• Try and dissuade your data engineers from wasting their time on
this – you need to keep your data engineers available for the
much harder problems coming in the next section J
15. Data quality is a really tough problem throughout the data lifecycle –
you will be leaning on your data engineers here
Store
Socialize
Activate
Create / collectExpire / delete
Data lifecycle
16. By data quality we are not just talking about data completeness and
cleanliness – it’s much broader
Making sure the data is
complete and correct
Making sure the data is
semantically understood
Making sure the data is
regulatory compliant
• Identify, report and recover
data which doesn’t comply to
schema
• Report on anomalies in the
data
• Deliver and maintain a single
source of truth for the data
• Create a unified semantic
layer explaining the data,
making clear what data and
derivations are available
• Clarify what assumptions
have been made in
processing the data –
tracking data lineage
• Ensure usage of data is
consistent with basis of
collection and changing data
subject preferences
• Make it easy to demonstrate
compliance
Some of the problems your data team will be grappling with:
Build a common language between the data team
and the rest of the organisation
18. When I speak to Heads of Data and CDOs, this is the biggest
problem they are grappling with
DATA TEAM
Operationalizing the work
• Moving out of the “lab environment”
and getting results in actual
operational systems
• Dealing with differences in data
sources, data processing
• Handing conflicts with existing
operational rulesets
“Selling” the work to other teams
• Convincing the control-freak CEO that
the algorithm is working
• Convincing a team that the outcomes of
data science will make their lives easier –
it’s not just an integration chore
• Helping other teams understand that
their work will change as automated
decisioning comes in
Handling dependencies on other
teams
Learning to let go
• Built, Operate, Transfer
• Understanding that business users
have insights and enhancements
that they can make to your work
• Moving out of the MacGyver mode,
having to rely on other teams to get
things done (e.g. event tracking
instrumentation)
• Fitting into those teams’
agile/scrum/etc processes
Data insights and science
Working to discover insights, build and
test models and then make sure that
that work has impact in the business
CDO
19. You don’t need one culture for successful data science – you need
an evolving culture with different practices
• I like the analogy of starting with a data “MacGyver” and
then migrating to a data “A-Team”
• This is similar to Simon Wardley’s analogy of needing
Pioneers, Settlers and Town Planners[1] at various stages
of your product development process
• You will need different structures at different times - stay
flexible and keep adapting to the needs of your company
[1] https://blog.gardeviance.org/2015/03/on-pioneers-settlers-town-planners-and.html
If the interface between your data team(s) and the rest
of the company starts breaking down, change it!
21. • Digital platforms make it
possible to collect much more
data than ever before about
how companies engage with
individual users, and provide
users with a personalised
experience
• AI and other advances in
analytic technology means
more insight can be driven
from that data
• Real-time data processing
means the data can be
“activated” in real-time
The world is changing…
• Data enables companies to
compete: those companies
that use data as a strategic
asset, to best understand their
users are able to drive
sustained competitive
advantage
...creating opportunities
• Executing on these
opportunities is hard: requires
strategic and operational
(process, culture, people,
technology) aspects
• Data-enabled competitors are
a threat – every industry has
these challengers now
• Data poses a significant
liability as well as opportunity
(e.g. GDPR)
...and challenges
• Rise of the Chief Data Officer
(CDO): identify and execute on
opportunities to use data to
drive strategic value across
the business
• Rise of the data team under
the CDO. Responsible for:
• Systematically growing a
company’s data set and
capability to use that data
to drive value
• Empowering other teams
with data
• Managing data liability /
compliance
Companies have to adapt
Advances in data technology are transforming the ways companies
do business
22. How do you tell if leadership is truly betting on data, or is
just playing Big Data Bingo?
Committing to
change
Publicising
concrete wins
Investing
Engaging with
the ethics /
CSR
• The hardest to fake! Is the company investing in growing team and technology?
• Is the company investing in training and upskilling the existing team members?
• Is data science quarantined in an innovation lab, or is it in a disruptively central position?
• How many hops/gatekeepers from the Head of Data to the CEO and Board?
• Is the company talking to the market in generalities about “becoming a data company”, or is it
calling out specific victories which were driven by the data team?
• Is the leadership team seriously grappling with the ethical dimension of your data work?
• The implosion of Google’s AI Ethics Board is a cautionary tale – perils of PR–driven approach
@datasciencefest
@SnowplowData
24. If all the lower “floors” are in place, hopefully you are now
empowered to do industry-leading work
What does an industry-leading environment look like? Some suggestions:
• Publishing in data science
journals
• Technical blog posts,
tutorials and technical
reports
• Educating the wider
company on what the
data team is doing and
how they can help
• Gaining significant
competitive advantage in
your market
• Winning industry awards
for innovative or
breakthrough use of data
• Tension with publishing –
suddenly your employer
sees your team’s work as
“secret sauce”!
• Investing in growing the
team
• Investing in the best tools
and processes to support
the team
• Investing in you – making
sure your personal
development keeps you at
the company and
maintains your edge
Publishing, writing and
training
Beating the competition Investing and scaling
26. Thank you! Questions
• I always love talking to data scientists and the
rest of the data team – you can reach me on:
alex@snowplowanalytics.com
@alexcrdean
• And huge thanks to Data Science Festival:
#DataScienceFest
@datasciencefest
@SnowplowData