As you’ve reverse engineered software, you’ve likely asked the following questions:
BSim is intended to help with these questions (and others) by providing a way to search collections of binaries for similar, but not necessarily identical, functions.
The idea behind BSim is to generate a feature vector for each function in a binary. The vectors are generated by Ghidra’s decompiler. Each feature represents a small piece of data flow and/or control flow of the associated function. The decompiler normalizes the feature vector representation so that different, but functionally equivalent, pieces of code often produce the same features. Certain attributes, such as values of constants, names of registers, and data types, are intentionally not incorporated into the features.
BSim vectors are compared using cosine similarity.
Discrepancies between the vectors for foo
and bar
which are caused by differences in compilers, target architectures, and/or small changes to the source code typically result in vectors which are close but not identical.
BSim vectors can be stored in a dedicated database. BSim databases intended to hold large1 numbers of vectors maintain an index based on locality-sensitive hashing. The index drastically reduces the number of vector comparisons needed and allows for rapid retrieval of results.
Querying foo
against a BSim database typically yields a number of potential matches.
Each individual match for foo
can be compared to foo
in a side-by-side view, and certain information (such as function name) can be quickly copied from a match to foo
.
We frequently call BSim vectors the BSim signature of a function, or just the signature when the context is clear.
We can think of each feature as representing a small piece of the behavior of a function, analogous to a snippet of source code. Functions whose BSim vectors are close typically have many features in common, that is, they have similar behavior. Hence the name “BSim”: Behavioral Similiarity.
Using BSim involves the following components:
There are three supported database backends for BSim:
PostgreSQL
Elasticsearch
BSimElasticPlugin
extension contains an Elasticsearch plugin for BSim.H2
Next Section: Starting Ghidra and Enabling BSim
Creating a database requires a database template, which determines the specifics of the index. Currently, Ghidra provides a medium template, intended for databases holding up to 10 million unique vectors, and a large template, intended for databases holding up to 100 million unique vectors. ↩