Welcome to Ghidra's BSim (Behavioral Similarity) Database. This database technology is designed to allow reverse engineers to ingest metadata about previously analyzed binary executables to a central server or local database, which can then be queried in the course of analyzing new, unknown, executables to quickly discover previously seen functions and libraries.
The primary record ingested into the database describes a single function. The most novel aspects of the database are that:
The primary feature set used for indexing a function is extracted from a concise description of the data-flow of the function, not the explicit encoding of the machine instructions. The data-flow description is a graph-based (abstract syntax tree) representation, based on Ghidra's intermediate representation language, p-code, and is generated by the Ghidra decompiler. The resulting function descriptions are normalized to minimize the impact of variations due to:
Records are indexed using current Text Retrieval strategies, which allow "nearest neighbor" queries. The feature set of an unknown function being queried does not have to exactly match the features of a "hit" in the database, but only a configurable percentage of them. This supplies an additional level of tolerance of "functional difference" on top of the tolerance of "functionally equivalent" variations provided by the decompiler. In other words, there can be some amount of true change in the underlying source code, and the query may still be able to find a match.
Queries are quick: For a single function, results typically come back in microseconds, even for a database containing millions of functions.
A BSim Database is built on top of one of three technologies: PostgreSQL, local H2 database, or Elasticsearch. PostgreSQL is a robust, production capable, server that supports multiple simultaneous connections and is extremely fault tolerant. Elasticsearch is a scalable search engine that allows a database to be distributed across an entire cluster of machines. The local H2 database support is provided for convenience and use with small personal collections. For any of these options, this distribution includes specific reverse engineering extensions and clients that provide the following capabilities.
bsim
command scriptThe PostgreSQL server software is currently only supported for the Linux and macOS architectures. Elasticsearch server software must be obtained separately. Small local file-based databases are supported on all platforms via an embedded H2 database engine. The BSim client software is supported on all platforms and can connect to servers on a different architecture.