SEPTun
SEPTun
SEPTun
1
Chapter 1
Introduction
This guide represents and comprises our findings during the high performance tun-
ing endeavor we embarked to do 20Gbps IDSing with Suricata on commodity/COTs
hardware.
Over the years, there has been quite a few questions/speculations and ideas of how
to get the best performance out of Suricata and under what conditions is that possible
and with what hardware.
This guide tries to address a ground up approach and to emphasize and describe
the first necessary steps for high performance tuning of Suricata IDS.
Suricata’s performance has 4 major variables:
• Suricata version
• Traffic type
• Rules used
• HW capability
Not to underestimate at all though is the OS type/kernel version/NUMA-NIC lo-
cation/ CPU L3 cache when we want to take performance to the next level. Those are
described in more detail later in this paper. The most important part to understand and
tackle is that a packet has a number of hops to jump before it gets to Suricata for further
processing.
Our set up
• Up to 20Gbit/sec traffic in a long and sustained peaks. Around 2 000 000 pkt/sec.
• Suricata 3.2dev (effectively 3.2)
2
• HP ProLiant DL360 Gen9
• 2x dual port X710 and X520 - one port on each card used. Cards installed into
separate NUMA nodes.
• Ixgbe-4.4.6 from Intel, e40i drivers from Intel, not from a kernel.
We ended up not needing all the resources of such a big system for 20Gbps inspec-
tion.
3
Chapter 2
Our grand plan could be summarized like this - use CPUs L3 cache as a data bus.
Hence a major requirement for reproducing the results of this paper is to have CPUs
that have L3 cache. (E5s for example)
What we meant by that is, let your card push packets directly into the L3 on each
socket and when the kernel and Suricata later want to process this packet, it’s already
in the L3, fresh, warm and ready to be used. As we move along in this paper you will
see that the majority of our efforts has been spent on making sure the above is true and
packets are not evicted from a cache before being used/processed.
4
• Remote L3 - 80ns
Data located on the L3 cache on the remote NUMA node in relation to the CPU
processing it.
• Local RAM - 96ns
Data located in the RAM residing on the same NUMA node as the CPU processing
it.
• Remote RAM - 140ns
Data located in the RAM residing on the remote NUMA node in relation to the
CPU processing it.
Interesting observations
It takes almost 5 times longer to fetch data from RAM than from a local L3 cache. Not
trashing a cache is crucial. It takes almost as long to fetch data from remote L3 cache
as it takes from a local RAM. NUMA locality is important.
You have between 65 to a couple hundred nanoseconds to process a single packet
when dealing with tens of gigabits per second networks.
With the smallest possible cache misses we can achieve the achieve smooth packet
processing from end to end without read/fetch delays that can result in packet loss thus
impacting Suricata IDS/IPS inspection.
How we did it
• Dual CPU server and all DIMMS per socket of the same size. Needless to say, if
you have four memory channels per socket, you must use 4 DIMMS.
• Two network interfaces, on a separate cards, one card for each NUMA node.
NICs RSS, MQ, FDIR, ATR stay disabled to help avoid packets reordering by
FDIR (flow director).
• Isolate all cores from any tasks and use core 0 thread 0 as a housekeeping core.
Yes correct - isolate cores so that the no tasks run on them except the work done
by Suricata.
• Isolate hardware IRQ processing and software IRQ processing and AF_Packet
and pin it all on a dedicated core - one per card/NUMA node. All of above tasks
can usually be handled by a single core. Nothing else runs on this core.
• Finally we run Suricata pinned to the "leftover" isolated cores. Card 0 sends
packets to node 0 where they are treated from start to end, card 1 sends packets
to node 1. We used Suricata in workers mode to have one flow treated on the
same core and thread from start to end.
5
• Every buffer, ring, descriptor structure is kept as small as it can be to avoid buffer
bloat and make sure it stays in a CPU cache (our lifeline).
Life of a packet
An ideal world. A life of a packet:
1. Packet is on the wire. Card fetches it into a FIFO queue.
2. Card sends both the packet descriptor and a packet itself to a L3 cache of a CPU
that card is connected to.
3. At some point a write back happens of that data to a host memory on the same
NUMA node. At this point the packet is in one of free buffers. And this is what people
call “hardware buffers”. There’s no hardware buffer on the card, other than the FIFO
queue. We will also call the area of memory that the card writes packets to a "DMA
area".
4. Card sends or does not send an interrupt to let Linux know it should start pooling
packets with NAPI. Linux acknowledges that interrupt, if it was raised and goes back
to NAPI polling. Everything happens on a core dedicated for interrupts.
5. NAPI fetches the packet from a DMA area (a ring buffer, see 2 and 3 above)
which results in L3 hit and into SKB by doing magic tricks with the underlying memory
pages. Driver reuses half of either the same page or allocates a new one to replace a
used one. NAPI delivers packet to subscribers, like AF_Packet. We are still on the
“interrupts core”.
6. AF_Packet calculates hash and puts the packet in a buffer for a corresponding
worker thread on a different core on the same CPU, resulting in L3 hit. TPacket V3
(AF_Packet V3 in Suricata) works hard to avoid data copying.
6
7. Every now and then, Suricata worker switches to processing a next block full of
packets in a ring, with a simple pointer mathematics, without issuing a single system
call or doing copies. The mechanism has been designed to amortize the amount of
work to be done per packet - that is why they are accessed by Suricata in batches.
After Suricata is done with a packet, it marks it so AF_Packet can reuse the underlying
page. It does that without migration between user space and a kernel space. This is
hugely important for performance. Suricata does not have its own buffer for packets -
the kernel buffer is shared with userspace.
7
Countless hours were spent during this research to understand a packet
path from the wire to Suricata. We can say that Linux networking stack
is impressive (the very least). It avoids copying packets unless doing a
copy is cheaper than moving pointers around, for example when a packet
is smaller than a SKB (socket buffer) header.
8
Chapter 3
Tuning steps
In our set up we decided to hardcode cores to certain duties to avoid cache thrashing
and making sure our assumptions are correct most of the time. Here is how it works:
• Cores 0 and 14 - housekeeping cores. One per NUMA node. They are nothing
special, just the first thread from the first core of the first CPU and the first thread
of the first core from the second CPU.
It does nothing but generic OS duties, not directly related to our workload. Things
like ssh or audit or various kernel threads run here. Timekeeping is also done on this
core. Suricata flow managers threads run here as well.
• Cores 1 and 15 - hardware and software parts of IRQ processing runs here as
well as most of AFP.
This core will switch frequently between a user space and a kernel space. Since it’s
the most loaded pair of cores we advise against running anything on the HT peer.
• Leftover cores.
They run Suricata workers. As depicted below:
9
Hardware
Let’s start with the basics. Use one network card per NUMA node. Single CPU is OK,
as is dual CPU. Do not go beyond two CPUs - these platforms scale poorly. Buy 2 or 4
servers instead of one with 4 CPUs. It is also cheaper.
Use Intel E5 processors, even for a single CPU configuration. Ideally Haswell
or later. They have L3 cache that E3 do not and that cache is a critical piece of a
performance. Do not buy E7, there is no reason for that. While they have more cores,
but due to a ring architecture it does not scale that well. Plus they are expensive (and
so are servers that accept them) - buy two E5 servers instead.
Use the fastest memory you can buy. Have one DIMM per channel, avoid 2 or more
DPC (Dimm per channel) and make sure you use all memory channels. Yes, it is more
expensive to buy 8x 16GB then 16x 8GB, but with the later, memory access latency
goes up and the frequency and throughput goes down.
Use either Intel X710 (recommended) or Intel X520 card. Mellanox cards look
interesting as well, we have not tested them.
Make sure each card goes into a separate NUMA node and is connected to a PCIe
slot that has at least x8 Gen 3.0 to root. Avoid installing anything else in any extra
slots.
Where is my NUMA
Install the hwloc package and create some pretty graphs. Make sure acpi slit is enabled
in BIOS/EFI (see below).
On Debian/Ubuntu like systems:
apt-get install hwloc
Give it a try:
lstopo --logical --output-format txt
For ascii art use above, if you have it build with libcairo support you can do:
lstopo --logical --output-format png > ‘hostname‘.png
To see which NUMA node your card is connected to:
cat /sys/class/net/<INTERFACE>/device/numa_node
To see a list of cores and where they belong:
cat /sys/devices/system/node/node[X]/cpulist
To see per interface or PCI function:
cat /sys/class/net/[interface]/device/numa_node
cat /sys/devices/[PCI root]/[PCIe function]/numa_node
10
Firmware. EFI. BIOS.
• Disable IOMMU (input–output memory management unit)
It provides little security, some functionality that is not of any use in our case and
creates a huge latency problem for all memory operations.
• Disable ASPM (Active State Power Managemen) and QPI(QuickPath Intercon-
nect) power management and similar PCIe power saving states.
We are talking tens of Gigabits per second here, millions of packets. Your cards
will be busy. ASPM (contrary to C-states) has a bad reputation and likes to make the
PCIe bus “go away” at the worst possible moment. Your card wants to transmit a packet
to the ring buffer from FIFO, as the FIFO is getting full? Ups, there is no PCIe for a
moment. Packets are overwritten. Same for QPI power saving - little benefit and we
need that QPI to be active all the time.
• Disable VT-x and VT-d and SR-IOV.
You will not use them, sometimes it is hard to understand what they silently enable
and what kind of workaround kernel will enable when it sees them - and they increase
attack surface.
• Disable Node Interleaving.
Leave channel interleaving enabled. You want your locally allocated memory to
stay local and not be allocated on two NUMA nodes at the same time.
• Enable IOAT - Intel I/O Acceleration Technology (DDIO - Data Direct I/O Tech-
nology , DCA - Direct Cache Access).
Technically, DDIO is part of a greater IOAT “features package”. It consists of two
parts:
• HW prefetcher
• Adjecent sector prefetch
11
• DCU stream prefetcher
Kernel will likely overwrite your decision anyway. You will see later why we need
them.
This is how it is done on HP hardware/server:
12
13
14
AF-Packet symmetric hash
For some kernel versions there is a bug with regards to symmetric hashing as described
in the commit that fixes it.
We strongly advise you to first verify your kernel is free of the bug using the excel-
lent verification tool from Justin Azoff (the Bro IDS project).
The problem is best explained here and depicted like so below:
15
Kernel <=4.2 or 4.4.16+, 4.6.5+ or 4.7+ are safe. To be sure the issue does not exist
on your kernel version you can use the verification tool above.
16
Application Targeted Receive have never been designed for IDS/IPS, but rather for
scaling workloads like a web or a file server. In those circumstances they excel. For
IDS/IPS they must stay disabled.
Disable everything that can be disabled. The ITR number below worked well for
us and might have to be be tuned by you. Each parameter has 4 values since there were
4 ports on the system.
17
• MQ - multiqueue. We use a single queue only so packets will be processed by a
single core
• DDIO - card will pre-warm lowest level cache with packet descriptor and packet
data. This means, by the time kernel wants to fetch packet, it’s already in L3 and
does not have to be copied later.
• VMDQ - Virtual Machine Device Queues. Clouds just don’t mix with a high
performance environment.
• InterruptThrottleRate - keep hardware interrupts under control. We tuned it later
from OS, so this is mostly to show one of many ways.
Managing interrupts
We don’t want a card to send us too many hardware interrupts. Remember, the only
thing they do is they kick NAPI processing loop so it starts processing packets. Since
we are aiming at high speeds - NAPI will be running most of the time anyway sub-
ject to internal mechanisms to avoid starvation, not that important when we isolated
processing to a set of cores.
18
Lower the NIC ring descriptor size
ethtool -G p3p1 rx 512
ethtool -G p1p1 rx 512
Why? Lowering it from 4096 made the DDIO work - L3 cache miss ratio went
down from 16% to 0.2% on the core handling IRQ (soft+hard) and 0.46% for Suricata
worker threads cores. This means packets were transmitted by the L3 cache. There
was no impact on L1 and L2 cache.
Smaller buffers mean interrupts should run more frequently, so we lowered the
threshold time during which card cannot issue an interrupt, from 80 to 20 usecs. It did
not lead to increase CPU usage, since majority of time is spent in softirq with hardware
interrupts disabled.
Cache misses
Effects of ring descriptor size on DDIO and L3 hits. Measured on the core handling
software and hardware interrupts. Ring descriptor size 4096 buffers. Each buffer is
2048 bytes in size:
perf stat \
-e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches -C 1
perf stat \
-e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches -C 1 sleep 60
19
Measured on the core handling Suricata worker thread:
perf stat \
-e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches -C 10 sleep 60
20
is 0
On our system
nsm16 ~/i40e-1.5.23/scripts ethtool -k p3p1
Features for p3p1:
rx-checksumming: off
tx-checksumming: off
tx-checksum-ipv4: off
tx-checksum-ip-generic: off [fixed]
tx-checksum-ipv6: off
tx-checksum-fcoe-crc: off [fixed]
tx-checksum-sctp: off
scatter-gather: off
tx-scatter-gather: off
tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: off
tx-tcp-segmentation: off
tx-tcp-ecn-segmentation: off
tx-tcp6-segmentation: off
udp-fragmentation-offload: off [fixed]
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off [fixed]
rx-vlan-offload: off
tx-vlan-offload: off
ntuple-filters: off
receive-hashing: off
highdma: on
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-ipip-segmentation: off [fixed]
tx-sit-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off [fixed]
rx-all: off [fixed]
tx-vlan-stag-hw-insert: off [fixed]
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: off [fixed]
21
l2-fwd-offload: off [fixed]
busy-poll: off [fixed]
hw-tc-offload: off [fixed]
Why? All offloading functions will mangle packets in some way, usually merging
packets and losing header information. IDS cannot then reassemble flows correctly or
even at all.
Pin interrupts
Where they need to be:
p1p3 -> NUMA 1, CPU 1, core number 1 so ID 141
Here is how:
set_irq_affinity 1 p1p1
set_irq_affinity 14 p3p1
(the set_irq_affinity script is available in your NIC source drivers folder)
Verify with:
22
Nope, do not use more than one queue - X520 hash is asymmetric by default (al-
though it can be made symmetric) and FlowDirector reorders packets. This is caused
by an interaction between software IRQ processing and a Linux scheduler. A future
work might show how to still use RSS and multiqueue, with careful process pinning,
after changing the hash. We believe and briefly tested it with Bro IDS - symmetric hash
(can be set with ethtool), processes pinned gave us no visible signs of packet reorder-
ing. Performance of such a setup is questionable, though - every worker core would
make numerous transitions between the userspace and the kernel space. If anything,
multiple "interrupt handling cores" should be used, isolated from any other processing,
while the rest of this setup stays like it is. It might be an interesting experiment! See
also next section.
• Does pinning everything and enabling multi queue help for packet reordering?
We don’t know, repeatable tests are necessary. In theory it should work. In practice
it might have a worse performance and the only configuration that would make sense
here is to use a small number of queues each with a dedicated core instead of the usual
dividing each core’s time between Suricata and softirq processing.
• X520 with no RSS can handle as much as with RSS.
It cannot. A single queue can handle up to 2Mpps and only on a really fast CPU
(3.4-3.6Ghz per core)
• Modern Linux servers use IO/AT to accelerate data copy.
They can, but if any, there’s an advantage only for large copies, like 256KB. Ixgbe
does not use that.
• IO/AT is used automatically and transparently for all data copies.
It’s not. The driver has to use it explicitly and there’s API for that. "Hey IOAT
copy this much data data from here to there, will you" - and IOAT does. Linux does
something else. IOAT returns and signals copy is done (or not).
• There are hardware queues on the card. Multiqueue cards have multiple buffers
on them.
Receive and transmit queues exist in your host memory, not on the card. They
are simply ring buffers (descriptors with buffers attached, which in turn have pages
attached). Your hardware queue is in fact a descriptor that points to a number of buffers.
There is one descriptor per queue. The only “buffer” a card has is a FIFO, which is
there only to buy enough time to turn incoming frames into a PCIe packets. If FlowDir
is disabled X520 has 512KB FIFO.
23
• sending frames from a DMA region to a SKB
• attaching frames and SKB to AF_packet
How to do that right? Use the same NUMA node so packets will be in L3. This
creates a very efficient communication channel. Tested on 2.0GHz CPU and netsniff-
ng.
• DO NOT split hard+soft irq or RPS processing among multiple CPUs/cores or
packet reordering will occur.
You need to echo bitmasks, representing the core that will be processing softirq, for
example to process data from p1p1 on core 2 and card p3p1 on core 16:
cd /sys/class/net
echo 4 > ./p1p1/queues/rx-0/rps_cpus
echo 10000 > ./p3p1/queues/rx-0/rps_cpus
NOTE: with RSS one can configure symmetric hash, with RPS there is no control
over a hash and packet reordering might happen. Always use a single core (thread) per
NIC card for RPS.
Core isolation
Add to grub (Debian/Ubuntu like systems)
24
• nohz_full, rcu_nocbs - Omit scheduling clock ticks for CPUs where only one
task runs. Depends on your kernel version and if it has been enabled during a
kernel build, not supported on Ubuntu.
Verify with:
cat /proc/cmdline
dmesg | head
Examples from our system:
BOOT_IMAGE=/vmlinuz-4.4.0-45-generic.efi.signed \
root=UUID=dedcba7d-1909-4797-bd57-663a423a6a2f \
ro processor.max_cstate=3 intel_idle.max_cstate=3 selinux=0 apparmor=0 \
mce=ignore_ce nohz_full=1-13,15-55 \
isolcpus=1-13,15-55 rcu_nocbs=1-13,15-55
Pin IRQs
The so called - housekeeping. Pin all IRQs to core 0, thread 0:
25
do
if [[ -x "/proc/irq/$D" && $D != "0" ]]
then
echo $D
echo 1 > /proc/irq/$D/smp_affinity
fi
done
Pin rcu tasks, if any, to core 0, thread 0, NUMA 0:
for i in ‘pgrep rcu‘ ; do taskset -pc 0 $i ; done
More kernel threads pinning to core 0, thread 0:
26
The downside is - it takes time for the CPU to go any C-state and then back. That
is why we found (by trial and error) that you want your CPU to reach C3 state (and
not as far as C7). Limiting CPU to C1 did not allow it to reach maximum speeds in
TurboBoost. Important to note CPU will only go into C-state when it is idle for long
enough. A busy core will stay in C0.
On our system:
analyzing CPU 0:
driver: intel_pstate
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 0.97 ms.
hardware limits: 1.20 GHz - 3.60 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 1.20 GHz and 3.60 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 3.10 GHz (asserted by call to hardware).
boost state support:
Supported: yes
Active: yes
For Turbo mode, you must pay attention how many cores can run at what kind of
frequency. For our CPUs:
1-2 cores - 3.6GHz
3 cores - 3.4GHz
4 cores - 3.4GHz
5 cores - 3.2GHz
all cores 3.1Ghz.
Max observed 3.3Ghz.
If executing AVX (intensive vector tasks) it’s 3.3/3.1/3/2.9. On a system like this,
AVX will be used by hyperscan so Suricata cores might have lower frequency than
cores doing hardware and software IRQ.
27
Common misconceptions - part 2
• You should disable all C states otherwise your CPU will not be running at full
power.
Nope, C-states are idle states - activated when core has nothing to do. CPU does
not force itself into C-state (granted there are no temperature issues). The only state
that changes during normal operation is a P-state - which changes frequency CPU runs
at.
• Limit C states (or disable them) because switching between them costs a lot of
time.
There is some cost associated with switching between states but they allow cores
to cool down and gather thermal headroom and switch into a turbo mode if necessary.
One can limit C-states to C3, that allowed the full turbo mode - 3.0GHz on all cores or
3.3 on a few. Use the minimal number of C-states that allow reaching full Turbo Boost
- give up on C6, for example.
• Ark.intel.com will tell you everything about CPU Turbo.
• pin the Suricata worker threads to the isolated cpus (see above Core isolation) -
CPU Affinity
• enable the new (in 3.2dev) local bypass feature - If the corresponding flow is
local bypassed then it simply skips all streaming, detection and output and the
packet goes directly out in IDS mode and to verdict in IPS mode.
Local bypass
What is local bypass
"The concept of local bypass is simple: Suricata reads a packet, decodes it, checks
it in the flow table. If the corresponding flow is local bypassed then it simply skips all
streaming, detection and output and the packet goes directly out in IDS mode and to
verdict in IPS mode.
Once a flow has been local bypassed it is applied a specific timeout strategy. Idea
is that we can’t handle cleanly the end of the flow as we are not doing the streaming
reassembly anymore. So Suricata can just timeout the flow when seeing no packets.
As the flow is supposed to be really alive we can set a timeout which is shorter than the
28
established timeout. That’s why the default value is equal to the emergency established
timeout value."
Local bypass conf:
stream:
memcap: 12gb
prealloc-sessions: 200000
checksum-validation: no # reject wrong csums
inline: no # no inline mode
bypass: yes
reassembly:
memcap: 24gb
depth: 1mb
AF-Packet
AF-Packet:
# Linux high speed capture support
AF-packet:
- interface: p1p1
# Number of receive threads. "auto" uses the number of cores
threads: 11
# Default clusterid. AF_PACKET will load balance packets based on flow.
cluster-id: 99
# Default AF_PACKET cluster type. AF_PACKET can load balance per flow or
# This is only supported for Linux kernel > 3.1
# possible value are:
# * cluster_round_robin: round robin load balancing
# * cluster_flow: all packets of a given flow are send to the same socke
# * cluster_cpu: all packets treated in kernel by a CPU are send to the
# * cluster_qm: all packets linked by network card to a RSS queue are se
# socket. Requires at least Linux 3.14.
# * cluster_random: packets are sent randomly to sockets but with an equ
# Requires at least Linux 3.14.
# * cluster_rollover: kernel rotates between sockets filling each socket
# to the next. Requires at least Linux 3.10.
# Recommended modes are cluster_flow on most boxes and cluster_cpu or clu
# with capture card using RSS (require cpu affinity tuning and system irq
cluster-type: cluster_flow
# In some fragmentation case, the hash can not be computed. If "defrag" i
# to yes, the kernel will do the needed defragmentation before sending th
defrag: yes
# After Linux kernel 3.10 it is possible to activate the rollover option:
# full then kernel will send the packet on the next socket with room avai
# can minimize packet drop and increase the treated bandwidth on single i
#rollover: yes
29
# To use the ring feature of AF_PACKET, set ’use-mmap’ to yes
use-mmap: yes
# Lock memory map to avoid it goes to swap. Be careful that over subscrib
# your system
#mmap-locked: yes
# Use TPacket_v3, capture mode, only active if user-mmap is true
tpacket-v3: yes
# Ring size will be computed with respect to max_pending_packets and numb
# of threads. You can set manually the ring size in number of packets by
# the following value. If you are using flow cluster-type and have really
# intensive single-flow you could want to set the ring-size independently
# of threads:
ring-size: 400000
# Block size is used by tpacket_v3 only. It should set to a value high en
# a decent number of packets. Size is in bytes so please consider your MT
# a power of 2 and it must be multiple of page size (usually 4096).
block-size: 393216
# tpacket_v3 block timeout: an open block is passed to userspace if it is
# filled after block-timeout milliseconds.
#block-timeout: 10
# On busy system, this could help to set it to yes to recover from a pack
# phase. This will result in some packets (at max a ring flush) being non
#use-emergency-flush: yes
# recv buffer size, increase value could improve performance
#buffer-size: 1048576
##buffer-size: 262144
# Set to yes to disable promiscuous mode
# disable-promisc: no
# Choose checksum verification mode for the interface. At the moment
# of the capture, some packets may be with an invalid checksum due to
# offloading to the network card of the checksum computation.
# Possible values are:
# - kernel: use indication sent by kernel for each packet (default)
# - yes: checksum validation is forced
# - no: checksum validation is disabled
# - auto: Suricata uses a statistical approach to detect when
# checksum off-loading is used.
# Warning: ’checksum-validation’ must be set to yes to have any validatio
#checksum-checks: kernel
# BPF filter to apply to this interface. The pcap filter syntax apply her
#bpf-filter: port 80 or udp
# You can use the following variables to activate AF_PACKET tap or IPS mo
# If copy-mode is set to IPS or tap, the traffic coming to the current
# interface will be copied to the copy-iface interface. If ’tap’ is set,
# copy is complete. If ’IPS’ is set, the packet matching a ’drop’ action
# will not be copied.
30
#copy-mode: ips
#copy-iface: eth1
# Put default values here. These will be used for an interface that is not
# in the list above.
- interface: p3p1
threads: 11
cluster-id: 98
use-mmap: yes
tpacket-v3: yes
ring-size: 400000
block-size: 393216
#buffer-size: 1048576
##buffer-size: 262144
# 128KB before
cluster-type: cluster_flow
- interface: p1p1
threads: 11
cluster-id: 99
use-mmap: yes
tpacket-v3: yes
ring-size: 400000
block-size: 393216
#buffer-size: 1048576
##buffer-size: 262144
cluster-type: cluster_flow
- interface: p3p1
threads: 11
cluster-id: 98
use-mmap: yes
tpacket-v3: yes
ring-size: 400000
block-size: 393216
Threading:
threading:
set-cpu-affinity: yes
cpu-affinity:
- management-cpu-set:
cpu: [ 0,28,14,42 ]
mode: "balanced"
prio:
default: "low"
- worker-cpu-set:
cpu: ["2-13","16-27","30-41","44-55"]
mode: "exclusive"
31
prio:
default: "high"
NOTE: NIC,CPU and Suricata worker threads must reside on the same NUMA
node. In the case above we have p1p1 and p3p1 on different NUMA nodes hence
spreading the worker threads accordingly with CPU affinity.
NOTE in the config above the cluster-id rotates for the interfaces 99/98/99/98 as
well
<defrag.memcap>
<host.memcap>
<ippair.memcap>
<flow.memcap>
32
+
<number_of_threads> *
<216(TcpSession structure is 192 bytes, PoolBucket is 24 bytes)> *
<prealloc-sessions>
<stream.memcap>
<stream.reassembly.memcap>
<app-layer.protocols.dns.global-memcap>
<app-layer.protocols.http.memcap>
=
Estimated total memory consumption by Suricata (+/- 1-2%)
33
• Given a pair 0-28 (two threads on the same core) 0 is a real core and 28 is a
hyperthread, so you should avoid scheduling anything on 28 because it will only
run when 0 is idle.
Nope, both 0 and 28 are hyperthreads from the start and which thread gets access
to which resources is a function of internal processor’s state - none of the hyperthreads
are favored. The generic idea behind that is - "if hyperthread 0 does not use execution
unit A then maybe hyperthread 28 could?"
• Hyperthreads divide core resources in two, yielding lower performance.
Actually there are some duplicated resources on each core that are not used if you
don’t use Hyperthreading.
• NUMA crosstalk is wrong because there is limited bandwidth between CPUs.
There is plenty of bandwidth on the QPI (QuickPath Interconnect). The problem
is somewhere else - in frequent cache misses - the NIC card has already pre-warmed
L3 on CPU 0 with data but L3 on CPU 0 will need to fetch the same packet over QPI,
leading to the stall.
Packet Drops
Packets can be dropped in multiple places. It is important to monitor all of them,
otherwise you do not have an accurate description of a packet loss and that leads to
missed events.
• If a span port is used, then a large packet loss can happen, especially on Junipers
and older Cisco, frequently 10-20%. New Cisco switches keep the packet loss
low, single digits and even below 1%.
• If a packet broker is used, such as Arista, Netoptics / Ixia xStream / xDirector,
Gigamon, and similar then at the input buffer or at the output buffer
• At the card’s MAC layer if packets are malformed (ethtool -S)
• At the card’s FIFO buffer (rx_missed and no buffer count) if there is no place in
the RX buffer, that leads to a FIFO content being overwritten
• At the softirq layer - kernel cannot pull packets fast enough from RX buffers and
into SKB and AF_Packet - /proc/net/softnet_stat
• At the AF_Packet, if packets can’t be moved to the mmaped ring quickly enough
- getsockopt(ptv->socket, SOL_PACKET, PACKET_STATISTICS, &kstats, &len)
• Finally, Suricata can drop a packet that went all through that torture, to get here,
if a packet has invalid CRC
Suricata shows only the last parts and in most cases any drops there are most likely
an indirect effect of the drops that have occurred elsewhere.
You can (and should) monitor those drops in multiple places.
34
Card’s FIFO buffer drops/losses:
ethtool -S p1p1 | egrep ’rx_dropped|rx_missed|rx_packets|errors’
And watch the rx_dropped or rx_missed counter growing or not (depending on the
card).
SoftIRQ:
cat /proc/net/softnet_stat
Unfortunately values are in hex and documented only in the kernel source. Read
the kernel function kernel softnet_seq_show() in the kernel/net/core/net-procfs.c for
columns description. At the moment we wrote this paper, columns were like these:
1. Total number of frames processed
2. Number of frames dropped
3. Number of times softirq had more frames to process but ran out of
pre-configured time or the number of frames was too large. If this keeps
growing, than increasing the netdev_budget might help (show below).
4. Zero
5. Zero
6. Zero
7. Zero
8. Zero (seriously)
9. A number of times collision occurred when a transmit path tried to
obtain a device lock. Unimportant here.
10. Number of inter-CPU interrupts, for kicking off processing of
a backlog queue on a remote CPU, used only by RPS and RFS and
not enabled by default
11. Used only by RFS, not here
There is also a script showing and translating statistics in a real time which we
recommend - called softnet_stats.pl.
To increase the budget (if necessary):
sysctl -w net.core.netdev_budget=3000
(default is 300)
It’s impossible to tell what does that mean without looking at the driver code but
by default softirq loop exits after processing 300 / 64 ~ 5 runs of the ixgbe_poll() or
similar function, which by default tries to dequeue 64 packets at once so around 320
packets. There is also another limit - 2 jiffies, so 2ms on 1000Hz system (which is
default).
But the ifconfig drops counter is growing:
Ignore ifconfig - it reads counters that are hard to interpret and increased in the
most surprising places. Here are the most common reasons (we believe all of them, but
making a full list would mean reading entire network stack):
35
• The softnet backlog is full (but we already have a detailed troubleshooting de-
scribed)
• Bad VLAN tags (are you sure you run with promisc on?)
• Packets received with unknown or unregistered protocols (promisc again or strange
frames on the wire, ask your network admin)
• IPv6 packets when the server is configured only for ipv4 - should not be the case
here, we run promisc and just sniff everything
36
Chapter 4
Conclusion points
• There are plenty of places where a packet drop/loss or overwrite in a buffer can
occur before it is Suricata’s turn to process it.
• Suricata tuning for high performance is a processes as opposed to a config copy/pasted
from somewhere.
• The tuning itself is done for the whole system - not just Suricata.
37
In mob we trust.
38
Chapter 5
Authors
• Stamus Networks
• Mobster evangelist
39
Chapter 6
Thank you
People and organizations without whom this guide would have not been possible:
40