Data Onboarding From Scratch
Data Onboarding From Scratch
Data Onboarding From Scratch
Where Do I begin?
The forward-looking statements made in this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, this presentation may not contain current or accurate
information. We do not assume any obligation to update any forward looking statements we may make. In
addition, any information about our roadmap outlines our general product direction and is subject to change
at any time without notice. It is for informational purposes only and shall not be incorporated into any contract
or other commitment. Splunk undertakes no obligation either to develop the features or functionality
described or to include any such feature or functionality in a future release.
Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in
the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2017 Splunk Inc. All rights reserved.
Who Are You?
▶ Importance
▶ Splunk Terms/Components
▶ Pre-onboarding/Data Discovery
▶ Splunkbase
▶ Creating your own sourcetype
▶ Onboarding – inputs.conf
▶ Optimizing for performance
▶ Normalizing
Why Is This Important?
perf syslog
*nix Windows Mainframes TCP/UDP
DevOps, IoT,
Event Logs, Active Directory, OS Stats Wire Data Containers
Unix, Linux and Windows hosts HTTP Event Collector
Splunk Stream
Universal Forwarder (Agentless)
Universal Forwarder or
HTTP Event Collector
syslog hosts
and network devices
Default Fields
Six Things to Get Right at Index Time
Host
Event
Boundary / Source
LineBreaking
Date
Timestamp Sourcetype
Index
Host
▶ A default field that contains the hostname or IP address of the network device
that generated the event
▶ Use the host field in searches to narrow the search results to events that
originate from a specific device
▶ Allows you to located the originating device
Source
▶ A default field that identifies the source of an event, that is, where the event
originated
▶ For data monitored from files and directories, the source consists of the full
pathname of the file or directory
• /var/log/messages
• /var/log/messages.1
• /var/log/secure
▶ For network-based sources, the source field consists of the protocol and port
• UDP:514
• TCP:1514
Sourcetype
▶ Events with the same sourcetype can come from different sources
• /var/log/messages
• /var/log/messages.1
• udp:514
▶ sourcetype=linux_messages_syslog may retrieve events from both of those
sources
What Happens With Bad Sourcetypes
Same Regex, Same Sourcetype
Cisco Squid
Bluecoat
Index
▶ Allows Splunk to break the incoming stream of bytes into separate events
▶ Supports single-line and multi-line
▶ Splunk can usually do this automatically
Data Discovery
Find the Data
[monitor:///var/log/messages]
[stanza]
host = [tcp://:1514]
index =
sourcetype = [udp://:1514]
source = [WinEventLog://Security]
Example: With An App
[monitor:///data/syslog/cisco_asa]
index = thenetworkindex
sourcetype = cisco:asa
host_segment = 4
A Special Note On Syslog
▶ Same rule as syslog – these are data formats and carry multiple sourcetypes
Wait…What If There Is No App?
Example: With No App
[monitor:///opt/saa/incredible/useractivity.log]
index = theindex
sourcetype = ????
▶ /opt/saa/incredible/useractivity.log
• sourcetype = saa:incredible:useractivity
▶ /opt/saa/incredible/dbactivity.log
• sourcetype = saa:incredible:dbactivity
▶ /opt/saa/incredible/webui.log
• sourcetype = saa:incredible:webui
With Correct Sourcetypes
Same Regex, Different Sourcetype
Cisco Squid
Bluecoat
For Performance
Configuring these 6 settings for each sourcetype, on your indexers.
[saa:incredible:useractivity]
TIME_PREFIX = ^
SHOULD_LINEMERGE = false
LINE_BREAKER = ([\r\n]+)
MAX_TIMESTAMP_LOOKAHEAD = 30
TIME_FORMAT = %Y-%m-%d %H:%M:%S.%f%z
TRUNCATE = 10000
Data Usability
Common Information Model (CIM)
http://docs.splunk.com/Documentation/CIM/latest/User/Overview
▶ A way of normalizing your data for maximum efficiency at search time
▶ Splunk Certified TA’s typically include necessary normalizations
▶ Allows end-users to search using common fields such as “user” across many
sourcetypes
▶ Extract your fields, normalize, then tag your data
CIM Data Models
▶ index=* (sourcetype=*-* OR
sourcetype=*too_small) | stats count by
sourcetype
▶ https://docs.splunk.com/Documentation/Splunk/latest/Data/Whysourcetypesmatter
▶ https://docs.splunk.com/Documentation/AddOns/released/Overview/Sourcetypes
▶ https://www.splunk.com/blog/2012/08/10/sourcetypes-whats-in-name.html
▶ https://www.splunk.com/blog/2010/02/11/sourcetypes-gone-wild.html
© 2017 SPLUNK INC.