Developing new Turbinia Tasks and Evidence types

It’s easy!

Creating new Tasks for Turbinia is fairly easy, and if your Task is simple (like just executing an external command) it should only take a few lines of real code along with a bit of boiler-plate code, and a few extra lines to connect things together.

Before you start

Check out the How it Works page to see how the different components work within Turbinia.
Make sure to follow the Turbinia developer contribution guide.

Task code

Task Setup

The Worker which runs the tasks handles the following things before you even get to the run() method where most of our code will go:

Running any pre- or post-processors that need to run to prepare the Evidence.
Updating the evidence.local_path to be the local path of the evidence on the worker machine the Task is currently running on.
Setting up temporary directories (available as self.output_dir and self.tmp_dir).
Preparing a TurbiniaResult object to save results into.

Task execution

To see a relatively simple example of the code required for a new Task, see this pull request. This simply executes the strings binary on Disk-based Evidence types.

Here is the bulk of the actual Task code for the Ascii Strings Task:

    # Create the new Evidence object that will be generated by this Task.
    output_evidence = TextFile()
    # Create a path that we can write the new file to.
    base_name = os.path.basename(evidence.local_path)
    output_file_path = os.path.join(
        self.output_dir, '{0:s}.ascii'.format(base_name))
    # Add the output path to the evidence so we can automatically save it
    # later.
    output_evidence.source_path = output_file_path

    # Generate the command we want to run.
    cmd = 'strings -a -t d {0:s} > {1:s}'.format(
        evidence.local_path, output_file_path)
    # Add a log line to the result that will be returned.
    result.log('Running strings as [{0:s}]'.format(cmd))
    # Actually execute the binary
    self.execute(
        cmd, result, new_evidence=[output_evidence], close=True, shell=True)

This is mostly self explanatory from the comments, but the line that needs a little more explaining is this one:

    self.execute(
        cmd, result, new_evidence=[output_evidence], close=True, shell=True)

This will:

Run the command as specified
Set the output evidence to be saved
Save the stdout and stderr in the results object specified
Close the Result in preparation for Task completion

Task Finalization and Saving Results

Before a Task completes and returns, the Result object must be “closed” which finalizes the results in preparation for them to be returned to the server. Closing a Result does a few things like set Task stats, save all of the output files, and run the post-processor to free up the Evidence (e.g. unmount disks, etc). In the above example of self.execute(), close=True is set, which will tells the method to handle closing the results. If you have other external commands that you want to run and save the output from, you should not close the results until after these are all complete (i.e. don’t set close=True in self.execute() in this case). If you are not calling the execute() method and implicitly closing the results that way, you’ll need to close them similar to this:

result.close(self, success=True, status='My message about the Task status')

One important parameter that was not set in this example call of self.close() is save_files. It takes a list of file paths that you want to save (no need to add the files you linked to the Evidence earlier, it will save those automatically). This is used for non-Evidence files that you want to save (for example log files).

If you want to write files from your Task, you should do this relative to the self.output_dir. If you have temporary files you want to write, you can write these to self.tmp_dir. These directories are unique for the given Task execution.

The run() method should return the result object which will be serialized and returned to the server along with the associated Evidence that may have been created. The new Evidence created and included in the results will be checked by the Task Manager to see if there are other Jobs/Tasks that should be scheduled to process it.

Pre/Post-Processing

Each Task can set the Evidence state that is required (e.g. mounted, attached, etc) prior to execution by setting the state in Task.REQUIRED_STATES. Each Evidence object specifies which states it supports in the Evidence.POSSIBLE_STATES list attribute for that Task (e.g. see the GoogleCloudDisk possible states here - Line 717). These states are set up by the pre-processors and then after the Task is completed, the post-processor will tear down this state (e.g. unmount or detach, etc). For more details on the different states and how the pre/post-processors will set these up at runtime, see the Evidence.preprocess() docstrings - Line 405.

Evidence Paths

As mentioned above, the pre-processor that runs before the Task is executed will set the path evidence.local_path to point to the local Evidence. If the Task generates any new output Evidence objects, you must set the .source_path attribute for that object before you add it to the results. The .source_path is the original path the Evidence is created with and the .local_path is the path to access the Evidence after any pre-processors are run (e.g. the path it was mounted on if it was mounted, etc). In most cases, Tasks should use .local_path when processing or operating on the input Evidence and .source_path for the output Evidence that is created by the Task (and will be processed by other Tasks).

Since not all Tasks can process all types/states of Evidence (e.g. device files and mounted directories), they can also reference other more specific paths for the input evidence if needed (e.g. device_path or mount_path), but generally this should not be needed as long as you set the TurbiniaTask.REQUIRED_STATES for the Task to match your actual requirements since the local_path should always be created by the pre-processors. See the docstrings for these attributes in the Evidence object - Line 203 for more details.

Recipe configuration

Tasks can also specify a set of variables that can be configured and set through recipes. This allows users to pre-define set configurations for the runtime behavior of Tasks along with which Jobs/Tasks should be run. Each Task has a TASK_CONFIG dictionary set at the object level that defines each of the variables that can be used along with the default values that the Task will use when the recipe does not specify that variable, or there is no recipe used. See the Plaso Task - Line 29 TASK_CONFIG as an example. Tasks can access these variables by referencing the dictionary at self.task_config.

Boilerplate and Glue

The only two interesting bits for the Job definition in turbinia/jobs/strings.py are this one that sets the allowable input and output Evidence types for the Task (so the Task Manager knows what kinds of Tasks to schedule):

  evidence_input = [RawDisk, GoogleCloudDisk, GoogleCloudDiskRawEmbedded]
  evidence_output = [TextFile]

And this one, which just sets up the Tasks for both Task types (Ascii and Unicode):

    tasks = [StringsAsciiTask() for _ in evidence]
    tasks.extend([StringsUniTask() for _ in evidence])
    return tasks

In this case we have two separate Tasks that we are executing for the Job, but it’s possible that there could be more or less depending on how much you want to split it up. Then you just need to add a reference to the new job in turbinia/jobs/__init__.py.

Logging

Using result.log() is recommended for logging within a Task instead of the normal python logger. result.log() will log to the standard logger and will also log the data in the Task result and creates a Task specific log file with the name of worker.txt in the Task output directory. This makes it easier to debug and find Task specific logs. result.log() also has a level parameter that takes a log level (e.g. logging.INFO) for control.

Reporting

Tasks can return report data in Markdown format by adding it as a string to result.report_data. If high priority findings are found, you can change result.report_priority. Priorities are 0 - 100, and the highest priority is 0. This will affect the ordering in the report, and if the priority is a value less than what is set with --priority_filter (i.e. a higher priority), then the full report data will be printed out when --full_report is specified.

Tasks can use the helper methods in turbinia.lib.text_formatter to help format the text with formatting like bold and code. Note that when setting headings in a task report, do not use heading1 through heading3 because these are used in other sections of the report, but you can use heading4 and above.

Tips

If possible, set a meaningful status message that summarizes the Task execution or output. This can be done by either by setting the status parameter when calling result.close(), or by explicitly setting the result.status attribute. This should be a single line and is the output that shows up for each task when running turbiniactl status. If a Task has a low report_priority, then the full report data will not show up in the turbiniactl status, and so the status may be the only place that Task info will bubble up in the output by default, so setting it to something useful can be important.
If your Task executes an external command that can generate a log file, it’s helpful to specify the appropriate flags to generate this and then automatically save it by setting the log_files parameter when calling self.execute(). Additionally, if there are flags that control the verbosity of this log file, it’s helpful to check the config.DEBUG_TASKS config parameter and log accordingly, and this way all tasks can generate debug output when this is configured.

Testing

There is a TestTurbiniaTaskBase object that task tests can sub-class for relatively easy testing of the basic run method. See the photorec test for a simple example. For a task test with reporting output see the sshd test as an example.

Notes

The reason we separate out the strings processing into two separate Tasks is so we can do them in parallel and save on wall-time.
One caveat about Task development is that it is possible to create a cycle in the Task Manager by generating Evidence types that your Task (or any of its parent’s tasks) also listens to. Check out the Job and Evidence graph generator if you want to verify that there aren’t any cycles in the graph.