Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
ALEX DEAN
What makes an
effective data
team?
@alexcrdean
What makes an
effective data
team?
Introducing me, Alex Dean
• CEO and co-founder at Snowplow Analytics[1], the company
behind Snowplow, the open source event data pipeline[2]
• Our mission at Snowplow is to help companies make better
decisions
• I have been at different stages of my working life a data
engineer and a business analyst, but never a data scientist!
• Weekend writer of Event Streams in Action (Manning)[3]
[1] https://snowplowanalytics.com
[2] https://github.com/snowplow/snowplow
[3] https://www.manning.com/books/event-streams-in-action
@datasciencefest
@SnowplowData
Snowplow is a real-time event data pipeline designed for
the data team
• My co-founder Yali and I created Snowplow so that
companies could own their customer event data without a
huge data engineering effort
• When we started Snowplow, we thought we would spend
6 months building a data pipeline and then get back to
data analytics...
• ... 7 years later, we are still building event pipelines J
• Customer base of 150 worldwide and large open-source
community of enterprises and high-growth startups
• Snowplow is designed from the ground up for data teams
(data scientists, data engineers, business analysts)
DATA TEAM
CDO
Data lifecycle
@datasciencefest
@SnowplowData
Framing: how we think about the software landscape*
Systems of
intelligence
Systems of
record
* “Four types of system” taxonomy comes from Satya Nadella, Microsoft CEO (source)
which is in turn an unbundling of Jerry Chen’s “three types of system” taxonomy (source)
• Control end-user interactions – e.g. ad tech,
support desks, marketing automation
• AI and machine learning platforms
• Web analytics, mobile analytics, product analytics
• Event data pipelines
• Cloud data integration providers
• IoT platforms
• CRM, HR/HCM, ERP/Financials
• Powers a critical business function
Systems of
engagement
Systems of
observation
Framing: how we think about data maturity@datasciencefest
@SnowplowData
A data team’s
hierarchy of
needs
Random forest
Let’s apply Maslow’s Hierarchy of Needs to the data team
Data scientists are doing industry-leading work
Leadership believes it is running a “data company”
Company is structured for data success
Data is high quality
Data is available
@datasciencefest
@SnowplowData
Data is
available
“Data is available” sounds obvious – but if it’s not true, then
you will not be doing much data science
Data collection is the foundation
of the data value chain
There are two “types” of data
companies need to collect
Data collection is a solved
problem in 2019
• A wide variety of commercial and open-
source systems of observation exist to
capture real-time event data and/or
slowly-evolving data
• And remember that you don’t have to
capture “all the things”: there is plenty of
duplicated signal across a company. You
just need the original ”signal”
• Before you can drive value from data,
you need an accurate, comprehensive
data set to work with, otherwise:
• Collecting data is hard:
• Multiple sources of data e.g. web,
mobile, email, social
• Sources are often evolving /
breaking their data “contracts”
1. “Real-time event data”
• Describes what is happening as it is
happening
• Includes web site, mobile data, call
center, email, in-store etc.
2. Slowly- evolving data
• Found in operational databases or
behind APIs
• Includes product catalogues, CRM,
content databases
Solved problem 1: collecting your real-time event data
from websites, mobile apps, ESPs etc
Tag managers / Customer Data Platforms (CDPs)
• Integrate one SDK in
web / mobile app ->
send data to many
destinations e.g.
marketing providers
• Data warehouse one of
many categories of
destination
Real-time event data pipelines
• Available open-source
or running in your
cloud as a managed
service
• Focus on data quality,
data richness and
flexibility
• Events are available in
real-time
@datasciencefest
@SnowplowData
Solved problem 2: collecting your slowly-evolving data
from operational databases, SaaS platforms etc
Next gen ETL/ELT-as-a-service
• SaaS ETL providers aka cloud data
integrators aka iPaas
• Specialize in warehousing data
from third party APIs and
databases
• Amortize the cost of maintaining
hundreds of connectors to
unreliable source systems across
large customer bases
@datasciencefest
@SnowplowData
A warning about your data engineers building this themselves
• Sometimes data engineers like to break out the tools and build
data pipelines from scratch:
• “The pre-built solution is too expensive, I can build this in a week”
• “The pre-built solution doesn’t understand all the specifics of our
business, which is not like any other business on the planet”
• “If we use a pre-built solution, we will be locked into that vendor”
• Try and dissuade your data engineers from wasting their time on
this – you need to keep your data engineers available for the
much harder problems coming in the next section J
Data is high
quality
Data quality is a really tough problem throughout the data lifecycle –
you will be leaning on your data engineers here
Store
Socialize
Activate
Create / collectExpire / delete
Data lifecycle
By data quality we are not just talking about data completeness and
cleanliness – it’s much broader
Making sure the data is
complete and correct
Making sure the data is
semantically understood
Making sure the data is
regulatory compliant
• Identify, report and recover
data which doesn’t comply to
schema
• Report on anomalies in the
data
• Deliver and maintain a single
source of truth for the data
• Create a unified semantic
layer explaining the data,
making clear what data and
derivations are available
• Clarify what assumptions
have been made in
processing the data –
tracking data lineage
• Ensure usage of data is
consistent with basis of
collection and changing data
subject preferences
• Make it easy to demonstrate
compliance
Some of the problems your data team will be grappling with:
Build a common language between the data team
and the rest of the organisation
Company is
structured for
data success
When I speak to Heads of Data and CDOs, this is the biggest
problem they are grappling with
DATA TEAM
Operationalizing the work
• Moving out of the “lab environment”
and getting results in actual
operational systems
• Dealing with differences in data
sources, data processing
• Handing conflicts with existing
operational rulesets
“Selling” the work to other teams
• Convincing the control-freak CEO that
the algorithm is working
• Convincing a team that the outcomes of
data science will make their lives easier –
it’s not just an integration chore
• Helping other teams understand that
their work will change as automated
decisioning comes in
Handling dependencies on other
teams
Learning to let go
• Built, Operate, Transfer
• Understanding that business users
have insights and enhancements
that they can make to your work
• Moving out of the MacGyver mode,
having to rely on other teams to get
things done (e.g. event tracking
instrumentation)
• Fitting into those teams’
agile/scrum/etc processes
Data insights and science
Working to discover insights, build and
test models and then make sure that
that work has impact in the business
CDO
You don’t need one culture for successful data science – you need
an evolving culture with different practices
• I like the analogy of starting with a data “MacGyver” and
then migrating to a data “A-Team”
• This is similar to Simon Wardley’s analogy of needing
Pioneers, Settlers and Town Planners[1] at various stages
of your product development process
• You will need different structures at different times - stay
flexible and keep adapting to the needs of your company
[1] https://blog.gardeviance.org/2015/03/on-pioneers-settlers-town-planners-and.html
If the interface between your data team(s) and the rest
of the company starts breaking down, change it!
Leadership
believes it is
running a
“data
company”
• Digital platforms make it
possible to collect much more
data than ever before about
how companies engage with
individual users, and provide
users with a personalised
experience
• AI and other advances in
analytic technology means
more insight can be driven
from that data
• Real-time data processing
means the data can be
“activated” in real-time
The world is changing…
• Data enables companies to
compete: those companies
that use data as a strategic
asset, to best understand their
users are able to drive
sustained competitive
advantage
...creating opportunities
• Executing on these
opportunities is hard: requires
strategic and operational
(process, culture, people,
technology) aspects
• Data-enabled competitors are
a threat – every industry has
these challengers now
• Data poses a significant
liability as well as opportunity
(e.g. GDPR)
...and challenges
• Rise of the Chief Data Officer
(CDO): identify and execute on
opportunities to use data to
drive strategic value across
the business
• Rise of the data team under
the CDO. Responsible for:
• Systematically growing a
company’s data set and
capability to use that data
to drive value
• Empowering other teams
with data
• Managing data liability /
compliance
Companies have to adapt
Advances in data technology are transforming the ways companies
do business
How do you tell if leadership is truly betting on data, or is
just playing Big Data Bingo?
Committing to
change
Publicising
concrete wins
Investing
Engaging with
the ethics /
CSR
• The hardest to fake! Is the company investing in growing team and technology?
• Is the company investing in training and upskilling the existing team members?
• Is data science quarantined in an innovation lab, or is it in a disruptively central position?
• How many hops/gatekeepers from the Head of Data to the CEO and Board?
• Is the company talking to the market in generalities about “becoming a data company”, or is it
calling out specific victories which were driven by the data team?
• Is the leadership team seriously grappling with the ethical dimension of your data work?
• The implosion of Google’s AI Ethics Board is a cautionary tale – perils of PR–driven approach
@datasciencefest
@SnowplowData
Data scientists
are doing
industry-
leading work
If all the lower “floors” are in place, hopefully you are now
empowered to do industry-leading work
What does an industry-leading environment look like? Some suggestions:
• Publishing in data science
journals
• Technical blog posts,
tutorials and technical
reports
• Educating the wider
company on what the
data team is doing and
how they can help
• Gaining significant
competitive advantage in
your market
• Winning industry awards
for innovative or
breakthrough use of data
• Tension with publishing –
suddenly your employer
sees your team’s work as
“secret sauce”!
• Investing in growing the
team
• Investing in the best tools
and processes to support
the team
• Investing in you – making
sure your personal
development keeps you at
the company and
maintains your edge
Publishing, writing and
training
Beating the competition Investing and scaling
Conclusion
Thank you! Questions
• I always love talking to data scientists and the
rest of the data team – you can reach me on:
alex@snowplowanalytics.com
@alexcrdean
• And huge thanks to Data Science Festival:
#DataScienceFest
@datasciencefest
@SnowplowData
snowplowanalytics.com
© 2018 Snowplow Analytics Ltd.

More Related Content

What makes an effective data team?

  • 1. ALEX DEAN What makes an effective data team? @alexcrdean
  • 3. Introducing me, Alex Dean • CEO and co-founder at Snowplow Analytics[1], the company behind Snowplow, the open source event data pipeline[2] • Our mission at Snowplow is to help companies make better decisions • I have been at different stages of my working life a data engineer and a business analyst, but never a data scientist! • Weekend writer of Event Streams in Action (Manning)[3] [1] https://snowplowanalytics.com [2] https://github.com/snowplow/snowplow [3] https://www.manning.com/books/event-streams-in-action @datasciencefest @SnowplowData
  • 4. Snowplow is a real-time event data pipeline designed for the data team • My co-founder Yali and I created Snowplow so that companies could own their customer event data without a huge data engineering effort • When we started Snowplow, we thought we would spend 6 months building a data pipeline and then get back to data analytics... • ... 7 years later, we are still building event pipelines J • Customer base of 150 worldwide and large open-source community of enterprises and high-growth startups • Snowplow is designed from the ground up for data teams (data scientists, data engineers, business analysts) DATA TEAM CDO Data lifecycle @datasciencefest @SnowplowData
  • 5. Framing: how we think about the software landscape* Systems of intelligence Systems of record * “Four types of system” taxonomy comes from Satya Nadella, Microsoft CEO (source) which is in turn an unbundling of Jerry Chen’s “three types of system” taxonomy (source) • Control end-user interactions – e.g. ad tech, support desks, marketing automation • AI and machine learning platforms • Web analytics, mobile analytics, product analytics • Event data pipelines • Cloud data integration providers • IoT platforms • CRM, HR/HCM, ERP/Financials • Powers a critical business function Systems of engagement Systems of observation
  • 6. Framing: how we think about data maturity@datasciencefest @SnowplowData
  • 7. A data team’s hierarchy of needs Random forest
  • 8. Let’s apply Maslow’s Hierarchy of Needs to the data team Data scientists are doing industry-leading work Leadership believes it is running a “data company” Company is structured for data success Data is high quality Data is available @datasciencefest @SnowplowData
  • 10. “Data is available” sounds obvious – but if it’s not true, then you will not be doing much data science Data collection is the foundation of the data value chain There are two “types” of data companies need to collect Data collection is a solved problem in 2019 • A wide variety of commercial and open- source systems of observation exist to capture real-time event data and/or slowly-evolving data • And remember that you don’t have to capture “all the things”: there is plenty of duplicated signal across a company. You just need the original ”signal” • Before you can drive value from data, you need an accurate, comprehensive data set to work with, otherwise: • Collecting data is hard: • Multiple sources of data e.g. web, mobile, email, social • Sources are often evolving / breaking their data “contracts” 1. “Real-time event data” • Describes what is happening as it is happening • Includes web site, mobile data, call center, email, in-store etc. 2. Slowly- evolving data • Found in operational databases or behind APIs • Includes product catalogues, CRM, content databases
  • 11. Solved problem 1: collecting your real-time event data from websites, mobile apps, ESPs etc Tag managers / Customer Data Platforms (CDPs) • Integrate one SDK in web / mobile app -> send data to many destinations e.g. marketing providers • Data warehouse one of many categories of destination Real-time event data pipelines • Available open-source or running in your cloud as a managed service • Focus on data quality, data richness and flexibility • Events are available in real-time @datasciencefest @SnowplowData
  • 12. Solved problem 2: collecting your slowly-evolving data from operational databases, SaaS platforms etc Next gen ETL/ELT-as-a-service • SaaS ETL providers aka cloud data integrators aka iPaas • Specialize in warehousing data from third party APIs and databases • Amortize the cost of maintaining hundreds of connectors to unreliable source systems across large customer bases @datasciencefest @SnowplowData
  • 13. A warning about your data engineers building this themselves • Sometimes data engineers like to break out the tools and build data pipelines from scratch: • “The pre-built solution is too expensive, I can build this in a week” • “The pre-built solution doesn’t understand all the specifics of our business, which is not like any other business on the planet” • “If we use a pre-built solution, we will be locked into that vendor” • Try and dissuade your data engineers from wasting their time on this – you need to keep your data engineers available for the much harder problems coming in the next section J
  • 15. Data quality is a really tough problem throughout the data lifecycle – you will be leaning on your data engineers here Store Socialize Activate Create / collectExpire / delete Data lifecycle
  • 16. By data quality we are not just talking about data completeness and cleanliness – it’s much broader Making sure the data is complete and correct Making sure the data is semantically understood Making sure the data is regulatory compliant • Identify, report and recover data which doesn’t comply to schema • Report on anomalies in the data • Deliver and maintain a single source of truth for the data • Create a unified semantic layer explaining the data, making clear what data and derivations are available • Clarify what assumptions have been made in processing the data – tracking data lineage • Ensure usage of data is consistent with basis of collection and changing data subject preferences • Make it easy to demonstrate compliance Some of the problems your data team will be grappling with: Build a common language between the data team and the rest of the organisation
  • 18. When I speak to Heads of Data and CDOs, this is the biggest problem they are grappling with DATA TEAM Operationalizing the work • Moving out of the “lab environment” and getting results in actual operational systems • Dealing with differences in data sources, data processing • Handing conflicts with existing operational rulesets “Selling” the work to other teams • Convincing the control-freak CEO that the algorithm is working • Convincing a team that the outcomes of data science will make their lives easier – it’s not just an integration chore • Helping other teams understand that their work will change as automated decisioning comes in Handling dependencies on other teams Learning to let go • Built, Operate, Transfer • Understanding that business users have insights and enhancements that they can make to your work • Moving out of the MacGyver mode, having to rely on other teams to get things done (e.g. event tracking instrumentation) • Fitting into those teams’ agile/scrum/etc processes Data insights and science Working to discover insights, build and test models and then make sure that that work has impact in the business CDO
  • 19. You don’t need one culture for successful data science – you need an evolving culture with different practices • I like the analogy of starting with a data “MacGyver” and then migrating to a data “A-Team” • This is similar to Simon Wardley’s analogy of needing Pioneers, Settlers and Town Planners[1] at various stages of your product development process • You will need different structures at different times - stay flexible and keep adapting to the needs of your company [1] https://blog.gardeviance.org/2015/03/on-pioneers-settlers-town-planners-and.html If the interface between your data team(s) and the rest of the company starts breaking down, change it!
  • 20. Leadership believes it is running a “data company”
  • 21. • Digital platforms make it possible to collect much more data than ever before about how companies engage with individual users, and provide users with a personalised experience • AI and other advances in analytic technology means more insight can be driven from that data • Real-time data processing means the data can be “activated” in real-time The world is changing… • Data enables companies to compete: those companies that use data as a strategic asset, to best understand their users are able to drive sustained competitive advantage ...creating opportunities • Executing on these opportunities is hard: requires strategic and operational (process, culture, people, technology) aspects • Data-enabled competitors are a threat – every industry has these challengers now • Data poses a significant liability as well as opportunity (e.g. GDPR) ...and challenges • Rise of the Chief Data Officer (CDO): identify and execute on opportunities to use data to drive strategic value across the business • Rise of the data team under the CDO. Responsible for: • Systematically growing a company’s data set and capability to use that data to drive value • Empowering other teams with data • Managing data liability / compliance Companies have to adapt Advances in data technology are transforming the ways companies do business
  • 22. How do you tell if leadership is truly betting on data, or is just playing Big Data Bingo? Committing to change Publicising concrete wins Investing Engaging with the ethics / CSR • The hardest to fake! Is the company investing in growing team and technology? • Is the company investing in training and upskilling the existing team members? • Is data science quarantined in an innovation lab, or is it in a disruptively central position? • How many hops/gatekeepers from the Head of Data to the CEO and Board? • Is the company talking to the market in generalities about “becoming a data company”, or is it calling out specific victories which were driven by the data team? • Is the leadership team seriously grappling with the ethical dimension of your data work? • The implosion of Google’s AI Ethics Board is a cautionary tale – perils of PR–driven approach @datasciencefest @SnowplowData
  • 24. If all the lower “floors” are in place, hopefully you are now empowered to do industry-leading work What does an industry-leading environment look like? Some suggestions: • Publishing in data science journals • Technical blog posts, tutorials and technical reports • Educating the wider company on what the data team is doing and how they can help • Gaining significant competitive advantage in your market • Winning industry awards for innovative or breakthrough use of data • Tension with publishing – suddenly your employer sees your team’s work as “secret sauce”! • Investing in growing the team • Investing in the best tools and processes to support the team • Investing in you – making sure your personal development keeps you at the company and maintains your edge Publishing, writing and training Beating the competition Investing and scaling
  • 26. Thank you! Questions • I always love talking to data scientists and the rest of the data team – you can reach me on: alex@snowplowanalytics.com @alexcrdean • And huge thanks to Data Science Festival: #DataScienceFest @datasciencefest @SnowplowData