BSim Database

Quick Reference Links

Overview

Welcome to Ghidra's BSim (Behavioral Similarity) Database. This database technology is designed to allow reverse engineers to ingest metadata about previously analyzed binary executables to a central server or local database, which can then be queried in the course of analyzing new, unknown, executables to quickly discover previously seen functions and libraries.

The primary record ingested into the database describes a single function. The most novel aspects of the database are that:

  • Queries are tolerant of variations in the compilation of the function.
  • All records are indexed for quick queries. (even for very large collections)

The primary feature set used for indexing a function is extracted from a concise description of the data-flow of the function, not the explicit encoding of the machine instructions. The data-flow description is a graph-based (abstract syntax tree) representation, based on Ghidra's intermediate representation language, p-code, and is generated by the Ghidra decompiler. The resulting function descriptions are normalized to minimize the impact of variations due to:

  • Equivalent machine instructions
  • Storage location (registers, stack, memory)
  • Instruction order
  • Many forms of compiler transformation
  • Even some forms of deliberate obfuscation.

Records are indexed using current Text Retrieval strategies, which allow "nearest neighbor" queries. The feature set of an unknown function being queried does not have to exactly match the features of a "hit" in the database, but only a configurable percentage of them. This supplies an additional level of tolerance of "functional difference" on top of the tolerance of "functionally equivalent" variations provided by the decompiler. In other words, there can be some amount of true change in the underlying source code, and the query may still be able to find a match.

Queries are quick: For a single function, results typically come back in microseconds, even for a database containing millions of functions.

Overview of Tools

A BSim Database is built on top of one of three technologies: PostgreSQL, local H2 database, or Elasticsearch. PostgreSQL is a robust, production capable, server that supports multiple simultaneous connections and is extremely fault tolerant. Elasticsearch is a scalable search engine that allows a database to be distributed across an entire cluster of machines. The local H2 database support is provided for convenience and use with small personal collections. For any of these options, this distribution includes specific reverse engineering extensions and clients that provide the following capabilities.

  • Integration with a Ghidra Server or local project:
    • Ingest can be with respect to a Ghidra repository from either a Ghidra Server or local project.
    • Query results can refer to executables within a repository.
    • Easy command-line ingests using the bsim command script
  • Client as a Ghidra Plug-in:
    • Ghidra includes a plug-in client that integrates a query dialog and results windows directly into the main code browser.
  • Query API:
    • Ghidra includes a Java API to the BSim server so that queries (and potentially ingest) can be incorporated into analyst scripts. The API marshals queries and results between an active Ghidra session and a BSim server.

Note

The PostgreSQL server software is currently only supported for the Linux and macOS architectures. Elasticsearch server software must be obtained separately. Small local file-based databases are supported on all platforms via an embedded H2 database engine. The BSim client software is supported on all platforms and can connect to servers on a different architecture.