# Developing new Turbinia Tasks and Evidence types

## It's easy!

Creating new Tasks for Turbinia is fairly easy, and if your Task is simple (like
just executing an external command) it should only take a few lines of real code
along with a bit of boiler-plate code, and a few extra lines to connect things
together.

## Before you start

*   Check out the [How it Works](../user/how-it-works.md) page to see how the
    different components work within Turbinia.
*   Make sure to follow the Turbinia
    [developer contribution guide](contributing.md).

## Task code
### Task Setup

The Worker which runs the tasks handles the following things before you even get
to the `run()` method where most of our code will go:

*   Running any pre- or post-processors that need to run to prepare the
    Evidence.
*   Updating the `evidence.local_path` to be the local path of the evidence on
    the worker machine the Task is currently running on.
*   Setting up temporary directories (available as `self.output_dir` and
    `self.tmp_dir`).
*   Preparing a TurbiniaResult object to save results into.


### Task execution

To see a relatively simple example of the code required for a new Task, see this
[pull request](https://github.com/google/turbinia/pull/207). This simply
executes the strings binary on Disk-based Evidence types.

Here is the bulk of the actual Task code for the Ascii Strings Task:

```python
    # Create the new Evidence object that will be generated by this Task.
    output_evidence = TextFile()
    # Create a path that we can write the new file to.
    base_name = os.path.basename(evidence.local_path)
    output_file_path = os.path.join(
        self.output_dir, '{0:s}.ascii'.format(base_name))
    # Add the output path to the evidence so we can automatically save it
    # later.
    output_evidence.source_path = output_file_path

    # Generate the command we want to run.
    cmd = 'strings -a -t d {0:s} > {1:s}'.format(
        evidence.local_path, output_file_path)
    # Add a log line to the result that will be returned.
    result.log('Running strings as [{0:s}]'.format(cmd))
    # Actually execute the binary
    self.execute(
        cmd, result, new_evidence=[output_evidence], close=True, shell=True)
```

This is mostly self explanatory from the comments, but the line that needs a
little more explaining is this one:

```python
    self.execute(
        cmd, result, new_evidence=[output_evidence], close=True, shell=True)
```

This will:

*   Run the command as specified
*   Set the output evidence to be saved
*   Save the stdout and stderr in the results object specified
*   Close the Result in preparation for Task completion


### Task Finalization and Saving Results

Before a Task completes and returns, the Result object must be "closed" which
finalizes the results in preparation for them to be returned to the server.
Closing a Result does a few things like set Task stats, save all of the output
files, and run the post-processor to free up the Evidence (e.g. unmount disks,
etc).  In the above example of `self.execute()`, `close=True` is set, which
will tells the method to handle closing the results. If you have other external
commands that you want to run and save the output from, you should not close
the results until after these are all complete (i.e. don't set `close=True` in
`self.execute()` in this case).  If you are not calling the `execute()` method
and implicitly closing the results that way, you'll need to close them similar
to this:

```python
result.close(self, success=True, status='My message about the Task status')
```

One important parameter that was not set in this example call of `self.close()`
is `save_files`. It takes a list of file paths that you want to save (no need
to add the files you linked to the Evidence earlier, it will save those
automatically). This is used for non-Evidence files that you want to save (for
example log files).

If you want to write files from your Task, you should do this relative to the
`self.output_dir`. If you have temporary files you want to write, you can write
these to `self.tmp_dir`. These directories are unique for the given Task
execution.

The `run()` method should return the `result` object which will be serialized
and returned to the server along with the associated Evidence that may have
been created. The new Evidence created and included in the results will be
checked by the Task Manager to see if there are other Jobs/Tasks that should be
scheduled to process it.

### Pre/Post-Processing

Each Task can set the Evidence state that is required (e.g. mounted, attached,
etc) prior to execution by setting the state in `Task.REQUIRED_STATES`.  Each
Evidence object specifies which states it supports in the
`Evidence.POSSIBLE_STATES` list attribute for that Task (e.g. see the
[`GoogleCloudDisk` possible states
here - Line 717](https://github.com/google/turbinia/blob/master/turbinia/evidence.py)).
These states are set up by the pre-processors and then after the Task is
completed, the post-processor will tear down this state (e.g. unmount or
detach, etc).  For more details on the different states and how the
pre/post-processors will set these up at runtime, see the
[`Evidence.preprocess()`
docstrings - Line 405](https://github.com/google/turbinia/blob/master/turbinia/evidence.py).

### Evidence Paths

As mentioned above, the pre-processor that runs before the Task is executed
will set the path `evidence.local_path` to point to the local Evidence. If the
Task generates any new output Evidence objects, you must set the `.source_path`
attribute for that object before you add it to the results.  The `.source_path`
is the original path the Evidence is created with and the `.local_path` is the
path to access the Evidence after any pre-processors are run (e.g. the path it
was mounted on if it was mounted, etc). In most cases, Tasks should use
`.local_path` when processing or operating on the input Evidence and
`.source_path` for the output Evidence that is created by the Task (and will
be processed by other Tasks).

Since not all Tasks can process all types/states of Evidence (e.g. device files
and mounted directories), they can also reference other more specific paths for
the input evidence if needed (e.g. `device_path` or `mount_path`), but
generally this should not be needed as long as you set the
`TurbiniaTask.REQUIRED_STATES` for the Task to match your actual requirements
since the `local_path` should always be created by the pre-processors.
See the [docstrings for these attributes in the Evidence
object - Line 203](https://github.com/google/turbinia/blob/master/turbinia/evidence.py)
for more details. 

### Recipe configuration
Tasks can also specify a set of variables that can be configured and set
through [recipes](../user/recipes.md).  This allows users to pre-define set
configurations for the runtime behavior of Tasks along with which Jobs/Tasks
should be run.  Each Task has a `TASK_CONFIG` dictionary set at the object
level that defines each of the variables that can be used along with the
default values that the Task will use when the recipe does not specify that
variable, or there is no recipe used.  See the [Plaso
Task - Line 29](https://github.com/google/turbinia/blob/master/turbinia/workers/plaso.py)
`TASK_CONFIG` as an example. Tasks can access these variables by referencing
the dictionary at `self.task_config`.

## Boilerplate and Glue

The only two interesting bits for the Job definition in
`turbinia/jobs/strings.py` are this one that sets the allowable input and
output Evidence types for the Task (so the Task Manager knows what kinds of
Tasks to schedule):

```python
  evidence_input = [RawDisk, GoogleCloudDisk, GoogleCloudDiskRawEmbedded]
  evidence_output = [TextFile]
```

And this one, which just sets up the Tasks for both Task types (Ascii and
Unicode):

```python
    tasks = [StringsAsciiTask() for _ in evidence]
    tasks.extend([StringsUniTask() for _ in evidence])
    return tasks
```

In this case we have two separate Tasks that we are executing for the Job, but
it's possible that there could be more or less depending on how much you want to
split it up. Then you just need to add a reference to the new job in
`turbinia/jobs/__init__.py`.

## Logging

Using `result.log()` is recommended for logging within a Task instead of the
normal python logger. `result.log()` will log to the standard logger and
will also log the data in the Task result and creates a Task specific log
file with the name of `worker.txt` in the Task output directory. This makes it
easier to debug and find Task specific logs. `result.log()` also has a `level`
parameter that takes a log level (e.g. `logging.INFO`) for control.

## Reporting

Tasks can return report data in Markdown format by adding it as a string to
`result.report_data`.  If high priority findings are found, you can change
`result.report_priority`.  Priorities are 0 - 100, and the highest priority is 0.
This will affect the ordering in the report, and if the priority is a value less
than what is set with `--priority_filter` (i.e. a higher priority), then the
full report data will be printed out when `--full_report` is specified.

Tasks can use the helper methods in turbinia.lib.text_formatter to help format
the text with formatting like bold and code.  Note that when setting headings in
a task report, do not use heading1 through heading3 because these are used in
other sections of the report, but you can use heading4 and above.

## Tips
*   If possible, set a meaningful `status` message that summarizes the Task
    execution or output.  This can be done by either by setting the `status`
    parameter when calling `result.close()`, or by explicitly setting the
    `result.status` attribute.  This should be a single line and is the output
    that shows up for each task when running `turbiniactl status`.  If a Task
    has a low `report_priority`, then the full report data will not show up in
    the `turbiniactl status`, and so the status may be the only place that
    Task info will bubble up in the output by default, so setting it to
    something useful can be important.
*   If your Task executes an external command that can generate a log file,
    it's helpful to specify the appropriate flags to generate this and then
    automatically save it by setting the `log_files` parameter when calling
    `self.execute()`.  Additionally, if there are flags that control the
    verbosity of this log file, it's helpful to check the `config.DEBUG_TASKS`
    config parameter and log accordingly, and this way all tasks can generate
    debug output when this is configured.

## Testing
There is a `TestTurbiniaTaskBase` object that task tests can sub-class for
relatively easy testing of the basic run method.  See the [photorec
test](https://github.com/google/turbinia/blob/master/turbinia/workers/photorec_test.py)
for a simple example. For a task test with reporting output see the [sshd
test](https://github.com/google/turbinia/blob/master/turbinia/workers/analysis/sshd_test.py) as an example.

## Notes

*   The reason we separate out the strings processing into two separate Tasks is
    so we can do them in parallel and save on wall-time.
*   One caveat about Task development is that it is possible to create a cycle
    in the Task Manager by generating Evidence types that your Task (or any of
    its parent's tasks) also listens to. Check out the
    [Job and Evidence graph generator](https://github.com/google/turbinia/blob/master/tools/turbinia_job_graph.py)
    if you want to verify that there aren't any cycles in the graph.