# Developing new Turbinia Tasks and Evidence types ## It's easy! Creating new Tasks for Turbinia is fairly easy, and if your Task is simple (like just executing an external command) it should only take a few lines of real code along with a bit of boiler-plate code, and a few extra lines to connect things together. ## Before you start * Check out the [How it Works](../user/how-it-works.md) page to see how the different components work within Turbinia. * Make sure to follow the Turbinia [developer contribution guide](contributing.md). ## Task code ### Task Setup The Worker which runs the tasks handles the following things before you even get to the `run()` method where most of our code will go: * Running any pre- or post-processors that need to run to prepare the Evidence. * Updating the `evidence.local_path` to be the local path of the evidence on the worker machine the Task is currently running on. * Setting up temporary directories (available as `self.output_dir` and `self.tmp_dir`). * Preparing a TurbiniaResult object to save results into. ### Task execution To see a relatively simple example of the code required for a new Task, see this [pull request](https://github.com/google/turbinia/pull/207). This simply executes the strings binary on Disk-based Evidence types. Here is the bulk of the actual Task code for the Ascii Strings Task: ```python # Create the new Evidence object that will be generated by this Task. output_evidence = TextFile() # Create a path that we can write the new file to. base_name = os.path.basename(evidence.local_path) output_file_path = os.path.join( self.output_dir, '{0:s}.ascii'.format(base_name)) # Add the output path to the evidence so we can automatically save it # later. output_evidence.source_path = output_file_path # Generate the command we want to run. cmd = 'strings -a -t d {0:s} > {1:s}'.format( evidence.local_path, output_file_path) # Add a log line to the result that will be returned. result.log('Running strings as [{0:s}]'.format(cmd)) # Actually execute the binary self.execute( cmd, result, new_evidence=[output_evidence], close=True, shell=True) ``` This is mostly self explanatory from the comments, but the line that needs a little more explaining is this one: ```python self.execute( cmd, result, new_evidence=[output_evidence], close=True, shell=True) ``` This will: * Run the command as specified * Set the output evidence to be saved * Save the stdout and stderr in the results object specified * Close the Result in preparation for Task completion ### Task Finalization and Saving Results Before a Task completes and returns, the Result object must be "closed" which finalizes the results in preparation for them to be returned to the server. Closing a Result does a few things like set Task stats, save all of the output files, and run the post-processor to free up the Evidence (e.g. unmount disks, etc). In the above example of `self.execute()`, `close=True` is set, which will tells the method to handle closing the results. If you have other external commands that you want to run and save the output from, you should not close the results until after these are all complete (i.e. don't set `close=True` in `self.execute()` in this case). If you are not calling the `execute()` method and implicitly closing the results that way, you'll need to close them similar to this: ```python result.close(self, success=True, status='My message about the Task status') ``` One important parameter that was not set in this example call of `self.close()` is `save_files`. It takes a list of file paths that you want to save (no need to add the files you linked to the Evidence earlier, it will save those automatically). This is used for non-Evidence files that you want to save (for example log files). If you want to write files from your Task, you should do this relative to the `self.output_dir`. If you have temporary files you want to write, you can write these to `self.tmp_dir`. These directories are unique for the given Task execution. The `run()` method should return the `result` object which will be serialized and returned to the server along with the associated Evidence that may have been created. The new Evidence created and included in the results will be checked by the Task Manager to see if there are other Jobs/Tasks that should be scheduled to process it. ### Pre/Post-Processing Each Task can set the Evidence state that is required (e.g. mounted, attached, etc) prior to execution by setting the state in `Task.REQUIRED_STATES`. Each Evidence object specifies which states it supports in the `Evidence.POSSIBLE_STATES` list attribute for that Task (e.g. see the [`GoogleCloudDisk` possible states here - Line 717](https://github.com/google/turbinia/blob/master/turbinia/evidence.py)). These states are set up by the pre-processors and then after the Task is completed, the post-processor will tear down this state (e.g. unmount or detach, etc). For more details on the different states and how the pre/post-processors will set these up at runtime, see the [`Evidence.preprocess()` docstrings - Line 405](https://github.com/google/turbinia/blob/master/turbinia/evidence.py). ### Evidence Paths As mentioned above, the pre-processor that runs before the Task is executed will set the path `evidence.local_path` to point to the local Evidence. If the Task generates any new output Evidence objects, you must set the `.source_path` attribute for that object before you add it to the results. The `.source_path` is the original path the Evidence is created with and the `.local_path` is the path to access the Evidence after any pre-processors are run (e.g. the path it was mounted on if it was mounted, etc). In most cases, Tasks should use `.local_path` when processing or operating on the input Evidence and `.source_path` for the output Evidence that is created by the Task (and will be processed by other Tasks). Since not all Tasks can process all types/states of Evidence (e.g. device files and mounted directories), they can also reference other more specific paths for the input evidence if needed (e.g. `device_path` or `mount_path`), but generally this should not be needed as long as you set the `TurbiniaTask.REQUIRED_STATES` for the Task to match your actual requirements since the `local_path` should always be created by the pre-processors. See the [docstrings for these attributes in the Evidence object - Line 203](https://github.com/google/turbinia/blob/master/turbinia/evidence.py) for more details. ### Recipe configuration Tasks can also specify a set of variables that can be configured and set through [recipes](../user/recipes.md). This allows users to pre-define set configurations for the runtime behavior of Tasks along with which Jobs/Tasks should be run. Each Task has a `TASK_CONFIG` dictionary set at the object level that defines each of the variables that can be used along with the default values that the Task will use when the recipe does not specify that variable, or there is no recipe used. See the [Plaso Task - Line 29](https://github.com/google/turbinia/blob/master/turbinia/workers/plaso.py) `TASK_CONFIG` as an example. Tasks can access these variables by referencing the dictionary at `self.task_config`. ## Boilerplate and Glue The only two interesting bits for the Job definition in `turbinia/jobs/strings.py` are this one that sets the allowable input and output Evidence types for the Task (so the Task Manager knows what kinds of Tasks to schedule): ```python evidence_input = [RawDisk, GoogleCloudDisk, GoogleCloudDiskRawEmbedded] evidence_output = [TextFile] ``` And this one, which just sets up the Tasks for both Task types (Ascii and Unicode): ```python tasks = [StringsAsciiTask() for _ in evidence] tasks.extend([StringsUniTask() for _ in evidence]) return tasks ``` In this case we have two separate Tasks that we are executing for the Job, but it's possible that there could be more or less depending on how much you want to split it up. Then you just need to add a reference to the new job in `turbinia/jobs/__init__.py`. ## Logging Using `result.log()` is recommended for logging within a Task instead of the normal python logger. `result.log()` will log to the standard logger and will also log the data in the Task result and creates a Task specific log file with the name of `worker.txt` in the Task output directory. This makes it easier to debug and find Task specific logs. `result.log()` also has a `level` parameter that takes a log level (e.g. `logging.INFO`) for control. ## Reporting Tasks can return report data in Markdown format by adding it as a string to `result.report_data`. If high priority findings are found, you can change `result.report_priority`. Priorities are 0 - 100, and the highest priority is 0. This will affect the ordering in the report, and if the priority is a value less than what is set with `--priority_filter` (i.e. a higher priority), then the full report data will be printed out when `--full_report` is specified. Tasks can use the helper methods in turbinia.lib.text_formatter to help format the text with formatting like bold and code. Note that when setting headings in a task report, do not use heading1 through heading3 because these are used in other sections of the report, but you can use heading4 and above. ## Tips * If possible, set a meaningful `status` message that summarizes the Task execution or output. This can be done by either by setting the `status` parameter when calling `result.close()`, or by explicitly setting the `result.status` attribute. This should be a single line and is the output that shows up for each task when running `turbiniactl status`. If a Task has a low `report_priority`, then the full report data will not show up in the `turbiniactl status`, and so the status may be the only place that Task info will bubble up in the output by default, so setting it to something useful can be important. * If your Task executes an external command that can generate a log file, it's helpful to specify the appropriate flags to generate this and then automatically save it by setting the `log_files` parameter when calling `self.execute()`. Additionally, if there are flags that control the verbosity of this log file, it's helpful to check the `config.DEBUG_TASKS` config parameter and log accordingly, and this way all tasks can generate debug output when this is configured. ## Testing There is a `TestTurbiniaTaskBase` object that task tests can sub-class for relatively easy testing of the basic run method. See the [photorec test](https://github.com/google/turbinia/blob/master/turbinia/workers/photorec_test.py) for a simple example. For a task test with reporting output see the [sshd test](https://github.com/google/turbinia/blob/master/turbinia/workers/analysis/sshd_test.py) as an example. ## Notes * The reason we separate out the strings processing into two separate Tasks is so we can do them in parallel and save on wall-time. * One caveat about Task development is that it is possible to create a cycle in the Task Manager by generating Evidence types that your Task (or any of its parent's tasks) also listens to. Check out the [Job and Evidence graph generator](https://github.com/google/turbinia/blob/master/tools/turbinia_job_graph.py) if you want to verify that there aren't any cycles in the graph.