Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Q&A:	New	Tools	Mean	Big	Data	Dives	Can	Yield	Fast	Results	
Users	can	explore	and	analyze	even	massive	data	sets	quickly	with	new	tools	and	platforms	now	
available	
By	Linda	L.	Briggs	
4.15.2014	
Analytic	tools	that	can	tap	into	raw,	unstructured	machine	data	are	becoming	increasingly	important	
and	valuable,	enabling	organizations	to	explore	unstructured	data	on	the	fly	without	fixed	schemas	
or	specialized	skill	sets.	In	this	interview,	Brett	Sheppard,	director	of	big	data	product	marketing	at	
Splunk,	explains	how	new	tools	and	analytic	platforms	enable	even	non-technical	users	to	explore	
massive	data	sets,	completing	analysis	of	unstructured	data	in	Hadoop	in	hours	instead	of	weeks	and	
months.	Based	in	San	Francisco,	Sheppard	has	been	a	data	analyst	both	at	Gartner	and	the	U.S.	
Department	of	Defense	and	is	a	certified	Hadoop	system	administrator.	
BI	This	Week:	Let's	start	with	a	fairly	basic	question.	With	all	this	talk	of	more	and	more	data,	what	
are	some	of	the	newer	sources	for	all	this	raw	data?	Where's	it	coming	from?	
Brett	Sheppard:	Social,	cloud,	and	sensors	all	contribute	to	big	data.	For	social	media	as	well	as	e-
commerce	transactions,	every	click	on	a	Web	site	generates	some	type	of	clickstream	data,	as	well	as	
every	interaction	on	social	media	and	every	tweet.	There's	significant	value	for	organizations	to	have	
an	ongoing	dialog	with	their	customers	and	prospects	and	with	their	communities.	
Likewise,	every	application	in	the	cloud	generates	sets	of	data,	whether	it's	from	the	hardware,	the	
applications,	the	security	systems,	the	IT	operations	for	data,	or	apps	in	the	cloud.	Finally,	we're	
seeing	a	wealth	of	very	actionable	data	generated	by	sensors,	ranging	from	automobiles	and	
airplanes	to	building	sensors.	
For	example,	Eglin	Air	Force	Base	in	Florida	is	reducing	their	base-wide	energy	costs	by	10	percent	or	
more	by	taking	data	from	the	heating,	ventilation,	and	air	conditioning	systems,	the	hardware	
systems	--	basically	anything	in	a	building	that	generates	data	--	and	using	that	sensor	data	to	find	
where	inefficiencies	are,	whether	it's	lights	on	in	the	middle	of	the	night	or	a	center	that's	running	air
conditioners	too	much.	They	are	thus	able	to	uncover	energy	waste	and	to	find	the	right	balance	of	
heating	and	cooling.	
That's	just	one	example.	Basically,	those	three	sources	of	social,	cloud,	and	sensor	data	are	
contributing	to	a	significant	increase	in	the	volume	and	variety	of	data	as	well	as	the	need	to	address	
it	in	as	real-time	a	manner	as	possible.	
We're	also	seeing	a	lot	of	variability	in	data	formats.	For	example,	we	work	with	Ford	Motor	
Company	on	data	from	cars.	Ford	has	a	standard	that	they	have	proposed	for	automobile	makers	
regarding	sensors.	It's	not	a	standard,	though,	that's	shared	by	the	rest	of	the	industry,	so	when	we	
work	with	Ford,	we're	able	to	use	their	open	data	standard,	but	the	other	auto	makers	haven't	yet	
standardized	on	the	OpenXC	standard.	Accordingly,	there's	a	lot	of	variability	in	working	with	car	
sensor	data.	That's	something	we	see	in	a	lot	of	industries	right	now.	
How	many	companies	out	there	are	successfully	capturing	big	data	yet?	How	many	have	actually	
deployed	an	HDFS	cluster,	for	example,	and	populated	it	with	big	data?	
Well,	I	would	distinguish	between	big	data	as	a	whole	and	specific	technologies.	There	is	a	great	deal	
of	interest	in	Hadoop	and	HDFS,	and	it's	focused	on	three	areas	right	now.	We	see	a	lot	of	Hadoop	
use	in	the	federal	government,	within	Internet	companies,	and	in	Fortune	500	enterprises.	Beyond	
Hadoop,	there	are	a	variety	of	NoSQL	data	stores	...	and	some	organizations	have	big	data	stored	in	
relational	databases.	
What's	happened	is	[that]	organizations	are	able	now	to	store	these	data	types	so	inexpensively	...	
using	a	variety	of	storage	methods	such	as	Hadoop	that	the	opportunity	costs	of	throwing	the	data	
away	are	actually	more	than	the	cost	of	storing	it.	
How	many	companies	are	performing	meaningful	analytics	against	all	that	big	data?	
The	challenge	is	getting	actionable	insights	from	that	data	because	the	data	is	in	so	many	different	
formats.	Typically,	it's	raw	or	unstructured	data	and	typically	doesn't	fit	very	well	in	either	a	
relational	database	or	a	data	intelligence	tool,	which	tends	to	require	an	extreme	amount	of	ETL	
processing,	so	it	can	start	to	look	like	a	Rube	Goldberg	project	in	a	way.	There	are	all	these	steps	to	
go	from	raw	data	to	business	insight.	
That's	what	organizations	are	struggling	with	and	why	the	percent	rate	of	failure	with	big	data	
projects	is	actually	quite	high.	Companies	spend	six	months	or	more	with	five	or	10	people	working	
on	a	big	data	project,	then	find	at	the	end	that	they	just	aren't	able	to	get	the	actionable	insights	that	
they	wanted	to	from	that	raw,	unstructured	big	data.	
What	are	some	of	the	challenges	for	companies	of	working	directly	with	Hadoop	and	MapReduce?	
Why	is	it	so	hard	to	get	value	from	data	in	Hadoop?	
Beyond	Splunk-specific	offerings,	there	are	three	approaches	today	to	extract	value	from	data	in	
Hadoop.	All	three	of	those	have	ways	they	are	generating	value,	but	they	also	have	significant	
disadvantages.	
First	is	MapReduce	and	Apache	Pig,	which	is	the	way	most	organizations	get	started.	You	can	run	
searches	of	data,	but	it's	very	slow.	It	can	take	minutes	or	hours.	Unlike	with	a	relational	database,
where	you	can	have	deterministic	query	response	times,	Hadoop	does	not	have	that.	A	job	can	run	
indefinitely	--	it	could	take	minutes,	it	could	take	hours.	That	consumes	a	lot	of	resources	in	the	
cluster.	
Because	of	that,	most	organizations	also	try	one	of	the	other	two	options.	The	first	is	Apache	Hive	or	
SQL	on	Hadoop.	That	works	very	well	if	you	have	a	narrow	set	of	questions	to	ask	because	it	requires	
fixed	schemas.	If	an	organization	wants	to	replace	an	existing	ETL	framework	with	something	in	
Hadoop,	and	it's	for,	say,	static	reporting	on	a	small	number	of	data	sources,	the	SQL	on	Hadoop	or	
Apache	Hive	approach	can	work	quite	well.	
Where	that	approach	runs	into	challenges	is	with	exploratory	analytics	across	an	entire	Hadoop	
cluster,	where	it's	impossible	to	define	fixed	schemas.	A	knowledge	worker	in	that	organization	may	
want	to	iterate,	ask	questions,	see	the	results,	and	ask	follow-up	questions.	They	need	to	be	able	to	
look	at	all	the	data	that	they	have	access	to	within	their	role-based	access	controls	without	having	to	
pre-define	schemas	and	be	limited	to	the	data	returned	from	those	schemas.	
Finally,	the	third	approach	is	to	extract	data	out	of	Hadoop	and	into	an	in-memory	store.	This	could	
be	Tableau	Software,	or	SAP	HANA,	or	a	variety	of	in-memory	data	stores.	That	approach	works	
really	well	if	Hadoop	is	basically	doing	batch	ETL	--	where	you're	taking	raw	data,	you're	creating	a	
set	of	results	that	can	be	interpreted	in	a	row	and	columnar	format	in	a	relational	database,	and	
you're	able	to	export	it	out	of	Hadoop.	
Customers	come	to	Splunk	when	they	don't	want	to	move	the	data	out	of	Hadoop.	They	essentially	
want	Hadoop	to	be	a	data	lake	where	they	keep	the	data	at	rest.	Organizations	may	have	security	
concerns	about	moving	data	around	too	much	or	they	don't	want	to	have	to	set	up	data	marts.	In	
those	cases,	organizations	are	using	Hadoop	as	the	data	lake,	where	they	persist	that	data	for	many	
months	or	years,	and	use	software	such	as	Hunk	to	ask	and	answer	questions	about	that	data.	
Working	with	big	data	can	also	introduce	skills	challenges,	correct?	There	just	aren't	enough	
people	around	who	understand	the	technologies.	
Absolutely.	In	fact,	that's	the	single	biggest	limitation	today	for	Hadoop	adoption.	Hadoop	is	
maturing	as	a	technology	for	storing	data	in	a	variety	of	formats	and	from	many	sources,	but	what's	
limiting	organizations	today	is	the	need	for	rare,	specialized	skill	sets	to	do	that.	...	
There's	also	a	need	to	mask	Hadoop's	complexity	so	that	non-specialists	can	ask	questions	of	the	
data.	At	the	same	time,	data	scientists	who	are	fluent	in	the	dozen-plus	projects	and	sub-projects	in	
the	Hadoop	system	can	focus	their	skills	on	advanced	statistics	and	advanced	algorithms	that	really	
benefit	from	their	knowledge.	
Unfortunately,	today	many	data	scientists	end	up	wasting	their	time	as	"data	butlers,"	where	they	
have	colleagues	in	the	line	of	business	or	corporate	departments	who	have	analytics	tasks.	Those	
colleagues	don't	have	the	advanced	skill	sets	needed	to	ask	and	get	answers	to	questions	in	Hadoop.	
Accordingly,	data	scientists	are	basically	setting	up	access	for	their	non-specialist	colleagues	rather	
than	spending	their	time	doing	what	they	are	really	there	for,	which	is	advanced	statistics	and	
algorithms	that	really	do	require	a	custom,	personalized	approach.
You	mentioned	role-based	access	controls.	Why	are	they	so	important?	
That's	one	of	the	challenges	to	address	with	big	data.	Along	with	rare,	specialized	skill	sets,	role-
based	access	controls	are	needed	to	protect	non-public	information	that	may	be	stored	in	that	
Hadoop	data	lake.	
That's	a	weakness	of	Hadoop,	which	was	founded	as	shared-use	clusters.	Organizations	such	as	
Yahoo	that	were	storing	clickstream	data	in	Hadoop	had	relatively	few	security	concerns.	The	data	
didn't	contain	non-public	information	--	it	was	simply	rankings	of	public	Web	sites.	Accordingly,	
Hadoop	was	not	founded	with	the	role-based	access	controls	that	anyone	familiar	with	relational	
databases	would	be	conversant	in.	
For	working	with	big	data	the	way	most	companies	need	to,	though,	it's	important	to	have	a	
technology	that	can	mask	some	data	from	some	users,	offering	role-based	access	to	select	data	in	
Hadoop.	
In	fact,	that	issue	is	part	of	what's	held	the	average	size	of	a	cluster	in	Hadoop	to	40	nodes	rather	
than,	say,	400	or	4,000	nodes.	Organizations	have	to	restrict	the	number	of	users	who	have	access	
because	in	traditional	HDFS	and	MapReduce,	once	someone	has	access	to	the	cluster,	they	can	do	
anything	they	want	within	the	cluster.	They	can	see	all	the	data,	they	can	delete	the	data,	there	is	
limited	auditability,	and	at	best	you're	able	to	see	what	someone	has	done	after	they've	done	it.	If	
you're	able	to	define	access	by	role,	you	can	prevent	people	either	maliciously	or	inadvertently	
accessing	data	that	is	beyond	their	role.	
Splunk's	analytics	solution	for	Hadoop	is	Hunk,	which	you	introduced	last	year.	Can	you	talk	about	
what	Hunk	brings	to	the	big	data	equation?	
We're	very	excited	about	Hunk	as	part	of	the	Splunk	portfolio	for	big	data.	Splunk	has	7,000	
customers	today	who	find	data	in	real	time	with	historical	context.	Many	organizations,	though,	
want	to	persist	big	data	for	many	months	or	years,	and	although	Hadoop	is	a	convenient	and	
inexpensive	data	lake	for	long-time	historical	storage,	at	the	same	time,	organizations	want	to	be	
able	to	ask	and	answer	questions	of	that	data	in	Hadoop.	
The	benefit	of	Hunk	is	that	an	organization	can	explore,	analyze,	and	visualize	data	at	rest	in	Hadoop	
without	a	need	for	specialized	skill	sets.	Many	organizations	that	have	tested	Hunk	have	found	that	
they've	been	able	to	go	from	the	free	trial	to	searching	data	in	Hadoop	within	an	hour	or	less.	That's	
possible	because	of	the	Splunk	architecture,	which	has	schema	on	the	fly.	There's	no	need	to	apply	
fixed	schemas,	and	there's	no	need	to	migrate	data	out	of	Hadoop	into	a	fixed	store,	so	the	time	to	
value	is	significantly	faster.	You	can	also	expose	the	data	through	ODBC	drivers	to	existing	business	
intelligence	dashboards.	
	About	the	Author	
	
Linda	L.	Briggs	writes	about	technology	in	corporate,	education,	and	government	markets.	She	is	
based	in	San	Diego.		
	
lbriggs@lindabriggs.com

More Related Content

Tdwi brett-sheppard-interview-april-2014