Observations and Recommendations On Splunk Performance
Observations and Recommendations On Splunk Performance
Observations and Recommendations On Splunk Performance
2
About
Me
Member
of
Splunk
Tech
Services
>5
Years
at
Splunk
Large
scale
and
Cloud
deployments
6th
.conf
Agenda
Performance
&
BoZlenecks
Understanding
fundamentals:
– Indexing:
ê Index-‐;me
pipelines
ê Index
tes;ng
– Searching:
ê Searching
in
isola2on
&
under
indexing
load
ê Types
of
searches
ê Mixed
workload
impact
on
resources
4
Tes;ng
Disclaimers
• Tes;ng
on
arbitrary
datasets
in
a
“closed
course”
(lab)
environment
• Do
not
take
out
of
context
5
Typical
“my
Splunk
is
not
performing
well”
conversa;on
7
Iden;fying
Performance
BoZlenecks
Understand
data
flows
data
– Splunk
opera;ons
pipelines
Ingest
(Indexing)
Instrument
– Capture
metrics
for
relevant
opera;ons
Run
tests
Splunk
>
Draw
conclusions
– Chart
and
table
metrics,
looks
for
emerging
paZerns
Consume
(Search)
Make recommenda2ons
8
Put
That
In
Your
Pipeline
And
Process
It
UTF-‐8
Header
Input
Line
Breaker
Output
Converter
Extrac;on
Pipeline
Data
Splunk data flows thru several such pipelines before it gets indexed
9
Lots
Of
Pipelines
10
Index-‐;me
Processing
11
Tes;ng:
Dataset
A
10M
syslog-‐like
events:
.
.
.
08-‐24-‐2016
15:55:39.534
<syslog
message
>
08-‐24-‐2016
15:55:40.921
<syslog
message
>
08-‐24-‐2016
15:55:41.210
<syslog
message
>
.
.
.
12
Index-‐;me
Pipeline
Results
Default
9.5
MLA 8.6
LM+TF 6.3
LM+DC 5.8
13
• All
pre-‐indexing
pipelines
are
expensive
at
default
sejngs.
‒ Price
of
flexibility
• If
you’re
looking
for
Flexibility
performance,
minimize
generality
• LINE_BREAKER
• SHOULD_LINEMERGE
• MAX_TIMESTAMP_LOOKAHEAD
Performance
• TIME_PREFIX
Time
(s)
• TIME_FORMAT
14
Next:
Let’s
Index
A
Dataset
B
Generate
a
much
larger
dataset
(1TB)
– High
cardinality,
~380
Bytes/event,
2.9B
events
Forward
to
indexer
as
fast
as
possible
– Indexer:
ê Linux
2.6.32
(CentOS);
ê 2x12
Xeon
2.30
GHz
(HT
enabled)
ê 8x300GB
15k
RPM
drives
in
RAID-‐0
– No
other
load
on
the
box
Measure
15
Indexing:
CPU
And
IO
16
Indexing
Test
Findings
CPU
U;liza;on
– ~17.6%
In
this
case,
4-‐5
Real
CPU
Cores
IO
U;liza;on
– Characterized
by
both
reads
and
writes
but
not
as
demanding
as
search.
Note
the
splunk-‐op8mize
process
Inges;on
Rate
– 30MB/s
– “Speed
of
Light”
–
no
search
load
present
on
the
server
17
Index
Pipeline
Paralleliza;on
Splunk
6.3+
can
maintain
mul;ple
independent
pipelines
sets
ê i.e.
same
as
if
each
set
was
running
on
its
own
indexer
If
machine
is
under-‐u;lized
(CPU
and
I/O),
you
can
configure
the
indexer
to
run
2
such
sets
Achieve
roughly
double
the
indexing
throughput
capacity
Try
not
to
set
over
2
Be
mindful
of
associated
resource
consump;on
18
Indexing
Test
Conclusions
Distribute
as
much
as
you
can
–
Splunk
scales
horizontally
– Enable
more
pipelines
but
be
aware
of
compute
tradeoff
Tune
event
breaking
and
2mestamping
aZributes
in
props.conf
whenever
possible
19
Next:
Searching
Real-‐life
search
workloads
are
extremely
complex
and
very
varied
to
be
profiled
correctly
But,
we
can
generate
arbitrary
workloads
covering
a
wide
spectrum
of
resource
u;liza;on
and
profile
those
instead.
Actual
profile
will
fall
somewhere
in
between
IO CPU
20
Search
Pipeline
(High
Level)
Some
preparatory
steps
here
Return
progress
to
SH
Splunk’d
21
Search
Pipeline
Boundedness
Some
preparatory
steps
here
IO
Return
progress
to
SH
Splunk’d
22
Search
Pipeline
Boundedness
Some
preparatory
steps
here
IO
CPU
+
Memory
Return
progress
to
SH
Splunk’d
23
Search
Types
Dense
– Characterized
predominantly
by
returning
many
events
per
bucket
index=web
|
stats
count
by
clientip
Sparse
– Characterized
predominantly
by
returning
some
events
per
bucket
index=web
some_term
|
stats
count
by
clientip
Rare
– Characterized
predominantly
by
returning
only
a
few
events
per
index
index=web
url=onedomain*
|
stats
count
by
clientip
24
Okay,
Let’s
Test
Some
Searches
Use
our
already
indexed
data
– It
contains
many
unique
terms
with
predictable
term
density
Search
under
several
term
densi;es
and
concurrencies
– Term
density:
1/100,
1/1M,
1/100M
– Search
Concurrency:
4
–
60
– Searches:
ê Rare:
over
all
1TB
dataset
ê Dense:
over
a
preselected
2me
range
Repeat
all
of
the
above
while
under
an
indexing
workload
Measure
25
Dense
Searches
Hijng
100%
CPU
at
CPU
U2liza2on
(%)
core#=concurrency
IO Wait (%)
26
CPU
U2liza2on
(%)
Indexing
With
Dense
Searches
Hijng
100%
earlier
Indexing Only
27
Dense
Searches
Summary
Dense
workloads
are
CPU
bound
Dense
workload
comple;on
;mes
and
indexing
throughput
both
nega;vely
affected
while
running
simultaneously
Faster
disk
wont
necessarily
help
as
much
here
– Majority
of
;me
in
dense
searches
is
spent
in
CPU
decompressing
rawdata
+
other
SPL
processing
Faster
and
more
CPUs
would
have
improved
overall
performance
28
CPU
U2liza2on
(%)
Rare
Searches
IO Wait (%)
29
CPU
U2liza2on
(%)
Indexing
With
Rare
Searches
IO Wait (%)
30
Indexing
Throughput
(KB/s)
More
Numbers
Indexing
Only
31
Rare
Searches
Summary
Rare
workloads
(inves;ga;ve,
ad-‐hoc)
are
IO
bound
Rare
workload
comple;on
;mes
and
indexing
throughput
both
nega;vely
affected
while
running
simultaneously
1/100M
searches
have
a
lesser
impact
on
IO
than
1/1M
When
indexing
is
on,
in
1/1M
case
search
dura;on
increases
substan;ally
more
vs.
1/100M.
Search
and
indexing
are
both
conten;ng
for
IO
In
case
of
1/100M,
bloomfilters
help
improve
search
performance
– Bloomfilters
are
special
data
structures
that
indicate
with
100%
certainty
that
a
term
does
not
exist
in
a
bucket
(indica8ng
to
the
search
process
to
skip
that
bucket)
Faster
disks
would
have
definitely
helped
here
More
CPUs
would
not
have
improved
performance
by
much
32
Is
My
Search
CPU
Or
IO
Bound?
33
Top
Takeways/Re-‐Cap
• Indexing
CPU
– Distribute
–
Splunk
scales
horizontally
– Tune
event
breaking
and
;mestamp
extrac;on
– Faster
CPUs
will
help
with
indexing
performance
Term
• Searching
Density
– Distribute
–
Splunk
scales
horizontally
– Dense
Search
Workloads
ê CPU
Bound,
beZer
with
indexing
than
rare
workloads
IO
ê Faster
and
more
CPUs
will
help
– Rare
Search
Workloads
ê IO
Bound,
not
that
great
with
indexing
Use
case
What
Helps?
ê Bloomfilters
help
significantly
Trending,
repor;ng
More
distribu;on
ê Faster
disks
will
help
over
long
term
etc.
Faster,
more
CPUs
• Performance
– Avoid
generality,
op;mize
for
expected
case
and
add
Ad-‐hoc
analysis,
More
distribu;on
hardware
whenever
you
can
inves;ga;ve
type
Faster
Disks,
SSDs
34
Tes;ng
Disclaimer
Reminder
35
Q
&
A
You
May
Also
Like
Feedback:
dritan@splunk.com
Search:
Under
the
Hood
Worst
Prac;ces...
and
How
to
Fix
Them
Splunk
Performance
Reloaded
THANK
YOU