- Firmware Slap is a tool that automates the discovery of exploitable vulnerabilities in firmware using concolic analysis and function clustering. It recovers function prototypes from firmware binaries, runs automated analysis on the functions in parallel to find bugs, and visualizes the results in JSON and Elasticsearch/Kibana.
- The document discusses challenges with concolic analysis like memory usage and underconstraining symbolic values. It proposes techniques like starting analysis after initialization, modeling functions individually, and tracking memory more precisely.
- Function clustering is used to find similar functions that may contain similar bugs. Features are extracted from functions and k-means clustering is applied to group similar functions.
11. CONCOLIC ANALYSIS
• Symbolic Analysis + Concrete Analysis
• Lots of talks already on this subject.
• Really good at find specific inputs to trigger code
paths
• For my work in Firmware Slap I used angr!
• Concolic analysis
• CFG analysis
• Used in Cyber Grand Challenge for 3rd place!
12. BUILDING REAL INPUTS
FROM SYMBOLIC DATA
• Source level protections
• LLVM’s Clang static analyzers
• Compile time protections
• Non-executable stack
• Stack canaries
• RELRO
• _FORTIFY_SOURCE
• Operating system protections
• ASLR
• Symbolic Variable Here
• get_user_input()
• To get our “You did it”
output
• angr will create several
program states
• One has the constraints:
• x >= 200
• x < 250
• angr sends these
constraints to it’s
theorem prover to give:
• X=231 or x=217
or x=249…
13. • Symbolically represent more of the program state.
• Registers, Call Stack, Files
• Query the analysis for more interesting conditions
• Does a network read influence or corrupt the program counter?
• Does network data get fed into sensitive system calls?
• Can we track all reads and writes required to trigger a vulnerability?
14. WHERE DOES CONCOLIC
ANALYSIS FAIL?
Memory Usage
• Big code bases
• Angr is trying to map out every single
potential path through a program. Programs
of non-trivial size will eat all your resources.
• A compiled lightttpd binary might be
~200KB
• Angr will run your computer out of memory
before it can example every potential
program state in a webserver
• Embedded system’s firmware can be a lot
larger…
15. • Challenge:
• Model complicated binaries with limited resources
• Model unknown input
• Identify vulnerabilities in binaries
• Find binaries and functions that similar to one-another
17. • Underconstraining concolic analysis:
• Values from hardware peripherals and NVRAM are UNKNOWN
• Spin up and initialization consumes valuable time and resources
• Configs can be setup any number of ways
• Skip the hard stuff
• Make hardware peripherals and NVRAM return symbolic variables
• Start concolic analysis after the initialization steps
19. • angr can analyze code at this level, but
it needs to know where to start.
• Ghidra can produce a function
prototype that angr can use to analyze
a function…
MODELING FUNCTIONS
20. • Finding bugs in binaries
• Recover every function prototype using ghidra
• Build an angr program state with information with symbolic arguments from the
prototype
• Run each analysis job in parallel
22. • With less code to analyze we can introduce more heavy-weight analysis
• Tracking memory instructions imposed by all instructions
• Memory regions tainted by user supplied arguments
• Mapping memory loading actions to values in memory.
• Every step through a program
• Store any new constraints to user input
• Does user input influence a system() call or corrupt the program counter
• Does user input taint a stack or heap variable
28. FUNCTION
SIMILARITY
• Bindiff and diaphora are the
standard for binary diffing.
• They help us find what code was
actually patched when a CVE and a
patch is published.
• Uses a set of heuristics to build a
signature for every function in a binary
• Basic block count
• Basic block edges
• Function references
29. • Both of these tools are tied to IDA
• The workflow is built around one-off comparisons
30. CLUSTERING
• Helps us understand how similar
are two things?
• Extract features from each thing
• For dots on a grid it can be:
• X location
• Y location
31. K-MEANS CLUSTERING
Extract features
Pick two random points
Categorize each point to one of those
random points
•Use Euclidian or cosine distance to find which is closest
Pick new cluster center by
averaging each category by
feature and using the point closest.
Recategorize all the points into
categories.
• Rinse and repeat until points don’t move!
32. CLUSTERING – WHY
THIS WORKS
• Features don’t have to be numbers…
• They can be the existence (0 or 1) of:
• String references
• Data references
• Function arguments
• Basic block count
• All of these features can be extracted
from reverse engineering tools like…
• Ghidra, Radare2, or Binary Ninja
33. IT ONLY WORKS IF YOU GUESS THE RIGHT NUMBER OF CLUSTERS
34. SUPERVISED CLUSTERING
• Supervised anything machine learning uses KNOWN values to cluster data
• We also know how many clusters there should be
• Our functions inside our binaries could be supervised if every function was
known to be vulnerable or benign
• Embedded systems programming gives us no assurances.
35. SEMI-SUPERVISED CLUSTERING
• Semi-Supervised clustering uses SOME KNOWN values to cluster data
• If we use public CVE information to find which functions in a binary are
KNOWN vulnerable, we can guess that really similar functions might also be
vulnerable.
• We can set our cluster count to the number of known vulnerable functions in a
binary
36. • Finding features in binaries to cluster
• Wrote a Ghidra headless plugin to dump all
function information
• Data/String/Call references are changed to
binary (0/1) it exists or it doesn’t
• All numbers are normalized
• Being at offset 0x80000000 shouldn’t matter
more then having 2 function arguments.
• Throw away useless information
• A Chi^2 squared test is used to see how much a
feature defines an item.
• If every function has the same calling convention,
the Chi^2 squared test will throw it away.
37. • Taking it further…
• Selecting a better number of clusters through cluster scoring
• Silhouette score ranks how similar each cluster of functions are
• This separates functions into clusters of similar tasks
• String operation functions
• Destructors/Constructors
• File manipulation
• Web request handling
• etc..
45. FIRMWARE SLAP
Export Data to JSON and send into elastic search
Cluster Functions according to best feature set
Extract Best function features using SKlearn
Build and run angr analysis jobs
Recover Function prototypes from every binary
Locate System root
Extract Firmware
47. MITIGATIONS
• Use compile time protections
• Enable your operating system’s ASLR
• Buy a better router
48. • It’s time to bring more automation into checking our embedded systems
• Don’t blindly trust third-party embedded systems
• I’m giving you the tools to find the bugs yourself
49. RELEASING
• Firmware Slap – The tool behind the demos
• The Ghidra function dumping plugin
• The cleaned-up PoCs
• CVE-2019-13087 - CVE-2019-13092