This plugin trains models used to find function starts within a program. Essentially, the training set consists of addresses in a program where Ghidra's analysis was able to find functions. The models are then applied to the rest of the program. Models can also be applied to other programs.
In the motivating use case, you either don't know the toolchain which produced a program or do not have a large number of sample programs to train other types of models.
Note: in general, this plugin ensures that addresses used for training, testing, or searching for function starts are aligned relative to the processor's instruction alignment. Defined data within an executable block is an exception - all such bytes are added to the test set as examples of non-starts.
The script FindFunctionsRFExampleScript.java shows how to access the functionality of this plugin programmatically.
This table is the main interface for training and applying models.
The values in this panel control the number of models trained and the data used to train them. The first three fields: Number of Pre-Bytes (CSV), Number of Initial Bytes (CSV), and Start to Non-start Sampling Factors (CSV) accept CSVs of positive integers as input (a single integer with no comma is allowed). Models corresponding to all possible choices of the three values will be trained and evaluated. That is, if you enter two values for the Pre-bytes field, three values for the Initial Bytes field, and four values for the Sampling Factors field, a total of 2*3*4 = 24 models will be trained and evaluated.
Values in this list control how many bytes before an address are used to construct its feature vector.
Values in this list control how many bytes are used to construct the feature vector of an address, starting at the address.
Values in this list control how many non-starts (i.e., addresses in the interiors of functions) are added to the training set for each function start in the training set.
This field controls the maximum number of function starts that are added to the training set.
This field allows you to specify values of context registers. Addresses will only be added to the training/test sets if they agree with these values, and the disassembly action on the Potential Functions Table will apply the context register values first. This field accepts CSVs of the form "creg1=x,creg2=y,...". For example, to restrict Thumb mode in an ARM program, you would enter "TMode=1" in this field.
If this is selected, for every function entry in the training set, the code units immediately before it and after it are added to the training set as negative examples (and similarly for the test set).
If this is selected, a binary feature is added to the feature vector for each bit in the recorded bytes.
This value is the minimum size a function must be for its entry and interior to be included in the training and test sets.
This panel displays information about the functions in the program.
This field displays the number of functions meeting the size bound in the Minimum Function Size field. You can use this to ensure that the value in Maximum Number of Starts field doesn't cause all starts to be used for training (leaving none for testing).
If this is checked, only addresses which are zero modulo the value in the Alignment Modulus combo box are searched for function starts. This does not affect training or testing, but can be a useful optimization when applying models, for instance when the Function Alignment Table shows that all (known) functions in the program are aligned on 16-byte boundaries.
The value in this combo box determines the modulus used when computing the values in the Function Alignment Table.
The rows in this table display the number of (known) functions in the program whose address has the given remainder modulo the alignment modulus.
This panel displays the statistics about the trained models as rows in a table. Actions on these rows allow you to apply the models or see the test set failures.
This action will apply the model to the program used to train it. The addresses searched consist of all addresses which are loaded, initialized, marked as executable, and not already in a function body (this set can be modified by the user via the Restrict Search to Aligned Addresses and Minimum Length of Undefined Ranges to Search options). The results are displayed in a Function Start Table.
This action will open a dialog to select another program in the current project and then apply the model to it. Note that the only check that the model is compatible with the selected program is that any context registers specified when training must be present in the selected program.
This action will display a Debug Model Table, which shows all of the errors encountered when applying the model to its test set.
This table displays all addresses in the search set which the model thinks are function starts with probability at least .5. The table also shows the current "Interpretation" (e.g., undefined, instruction at start of basic block, etc) of the address along with the numbers of certain types of references to the address.
The following actions are defined on this table:
This action is enabled when at least one of the selected rows corresponds to an address with an interpretation of "Undefined". It begins disassembly at each "Undefined" address corresponding to a row in the selection.
This action is similar to the Disassemble Action, except it sets the context register values specified before training the model at the addresses and then disassembles.
This action is enabled whenever the selection contains at least one row whose corresponding address is the start of a basic block. This action creates functions at all such addresses.
This action is enabled when the selection contains exactly one row. It displays a table of the function starts in the training set which are most similar to the bytes at the address of the row.
This table displays the function starts in the training set which are most similar to a potential function start "from the model's point of view". Formally, similarity is measured using random forest proximity. Given a potential start p and a known start s, the similarity of p and s is the proportion of trees which end up in the same leaf node when processing p and s.
For convenience, the potential start is displayed in a table with a single row directly above the similar starts table.
This table has the same format as the Potential Functions Table but does not have the disassembly or function-creating actions (it does have the action to display similar function starts). It displays all addresses in the test set where the classifier made an error. Note that some in some cases, it might be the classifier which is correct and the original analysis which was wrong. A common example is a tail call which was optimized to a jump during compilation. If there is only one jump to this address, then analysis may (reasonably) think that the function is just part of the function containing the jump even though the classifier thinks the jump target is a function start.
This plugin has the following options. They can be set in the Tool Options menu.
This option controls the maximum size of the test sets (the test set of function starts and the test set of known non-starts which together form the model's "test set"). Each set that is larger than the maximum will be replaced with a random subset of the maximum size.
This option controls the minimum length a run of undefined bytes must be in order to be searched for function starts. This is an optimization which allows you to skip the (often quite numerous) small runs of undefined bytes between adjacent functions. Note that this option has no effect on model training or evaluation.
Provided By: RandomForestFunctionFinderPlugin