Eoulsan – Workflow file

Last Published: 2022-12-13 14:46 | Version: 2.6.1

GenomiqueENS genomics core facility | École normale supérieure

About Eoulsan

Getting Eoulsan

Documentation

Development

The workflow file

The workflow file (usually named as workflow.xml) is the file where all the steps to execute and theirs parameters will be set. This file use the XML syntax and is divided in 3 sections :

The description section that contains information about the name, the description and the author of the workflow file.
The constants section that contains constants that can be used in parameters values.
The steps section that contain the list of the steps to execute and theirs parameters. The parameters of the built-in steps are described in the Built-in steps section.
The global section that contains global parameters (override configuration settings) that can be use in all the steps of the analysis.

In all parameter values you can use variables (e.g. ${variable}) that contains values for :

Built-in variables (${eoulsan.version}, ${eoulsan.build.number}, ${eoulsan.build.date}, ${design.file.path}, ${workflow.file.path}, ${output.path}, ${job.path}, ${job.id}, ${job.uuid} and ${available.processors})
Design header entries (e.g. ${design.header.Project}, ${design.header.GenomeFile}...)
java properties (e.g. ${java.version})
System environment variables (e.g. ${PATH}, ${PWD})
user defined constants

User can also insert in parameter or attribute values the output of a shell command with expression between "`":

	 <value>`cat /proc/cpuinfo | grep processor | wc -l`</value>
	 <value>`pwd`/tmp</value>
	 <value>`basedir ${user.home}`/tmp</value>
	 <step> skip="false" discardoutput="true" requiredprocs="`nprocs`"</step>

All the tags must be in lower case. The following source show the structure of a typical workflow.xml file:

<analysis>
    <formatversion>1.0</formatversion>
    <name>my analysis</name>
    <description>Demo analysis</description>
    <author>Laurent Jourdren</author>

    <constants>
        <parameter>
	        <name>my.constant</name>
	        <value>myconstantvalue</value>
        </parameter>
    </constants>


    <steps>

        <!-- Filter reads -->
        <step id="filterreads" skip="false">
                <name>filterreads</name>
                <parameters>
                        <parameter>
                                <name>trim.length.threshold</name>
                                <value>11</value>
                        </parameter>
                        <parameter>
                                <name>quality.threshold</name>
                                <value>12</value>
                        </parameter>
                </parameters>
        </step>

        <!-- Map reads -->
        <step id="mapreads" skip="false">
                <module>mapreads</module>
                <parameters>
                        <parameter>
                                <name>mapper</name>
                                <value>bowtie</value>
                        </parameter>
                        <parameter>
                                <name>mapper.arguments</name>
                                <value>--best -k 2</value>
                        </parameter>
                </parameters>
        </step>

        <!-- SAM filter -->
        <step id="filtersam"  skip="false">
                <module>filtersam</module>
                <parameters>
                        <parameter>
                                <name>removeunmapped</name>
                                <value></value>
                        </parameter>
                        <parameter>
                                <name>removemultimatches</name>
                                <value></value>
                        </parameter>
                </parameters>
        </step>

        <!-- Expression -->
        <step id="expression" skip="false">
                <module>expression</module>
                <parameters>
                        <parameter>
                                <name>counter</name>
                                <value>htseq-count</value>
                        </parameter>
                        <parameter>
                                <name>genomictype</name>
                                <value>gene</value>
                        </parameter>
                        <parameter>
                                <name>attributeid</name>
                                <value>ID</value>
                        </parameter>
                        <parameter>
                                <name>stranded</name>
                                <value>no</value>
                        </parameter>
                        <parameter>
                                <name>overlapmode</name>
                                <value>union</value>
                        </parameter>
                        <parameter>
                                <name>removeambiguouscases</name>
                                <value>true</value>
                        </parameter>
                </parameters>
        </step>

        <!-- Normalization -->
        <step id="normalization" skip="false">
                <module>normalization</module>
                <parameters/>
        </step>

        <!-- Diffana -->
        <step id="diffana" skip="false">
                <module>diffana</module>

                <parameters>
                        <parameter>
                                <name>disp.est.method</name>
                                <value>pooled</value>
                        </parameter>
                        <parameter>
                                <name>disp.est.sharing.mode</name>
                                <value>maximum</value>
                        </parameter>
                        <parameter>
                                <name>disp.est.fit.type</name>
                                <value>local</value>
                        </parameter>
                </parameters>
        </step>


    </steps>

    <globals>
        <parameter>
	        <name>main.tmp.dir</name>
	        <value>/tmp</value>
        </parameter>
    </globals>

</analysis>

Description section

The first tags of the workflow file allow to set some information about the file:

formatversion: The version of the format of this workflow file.
name: The name of this workflow file.
description: The description of this workflow file.
author: The author of this workflow file.

Constants section

The constant section allow to define additional variables that can be used in the values of the parameters with the ${variable} syntax. Previously defined constants (and other variables) can be used in a new constant.

Note that the constants section is optional.

    <constants>
        <parameter>
	        <name>my.constant1</name>
	        <value>foo</value>
        </parameter>
         <parameter>
	        <name>my.constant2</name>
	        <value>${my.constant1}-bar</value>
        </parameter>
    </constants>

Steps section

The steps section contains the list all the steps to execute. Each step has a name and parameters and optionnaly a version and inputs:

Tag	Type	Optional	Description
module	string	False	The name of the module to execute by the step
version	string	True	The version of the step to use
inputs	XML tags	True	Manually define the data sources to use by the step
parameters	XML tags	True	The parameters of the step

The step tag can have 3 optional attributes:

Attribute	Type	Default value	Description
id	string	The name of the module to execute	This value define the identifier of the step. The id value must be unique in a workflow. The identifier is used to named output filenames of the step
discardoutput	string	no	When this attribute is set to success, the output files of the step will be saved in the working directory instead of the output directory of the workflow and will be removed at the end of the workflow if successful. If you use asap instead of success, the output files of the step will be removed once all the steps that require the outputs will be completed.
skip	boolean	false	The `skip` attribute allow to skip a step if its value is set to true
requiredprocs	integer	-1	The `requiredprocs` attribute allow to set the number of processors to use by the step. By default one processor will be used to process each task of a step (except for steps that in local mode that handle their parallelization like the mapping step).
requiredmemory	integer	-1	The `requiredmemory` attribute allow to set the amount of memory required in megabytes by the step. This value is only used in clusterexec mode. If not set, Eoulsan will require to the cluster scheduler the same amount of memory allocated to Eoulsan JVM. Unit prefixes like MB, M, GB, G can be used for the required memory value (e.g. 8GB).
dataproduct	string	cross	The `dataproduct` attribute allow to set the method to use for combining data before executing a step. By default a `cross` product is used. If you need that all the input data have the same name and must be executed together use `match` method instead.

If not set by the user, the Eoulsan workflow engine will take as data source for each input port the last previous step that generate data of the format that requested by the input port. If the user do not want to use the last source of data, it can manually define using input tags and its port, fromstep and fromport subtags.


    <steps>

        <!-- Filter reads -->
        <step id="myfilterreadstep" discardoutput="true"  skip="false" dataproduct="cross">
                <module>filterreads</module>
                <version>2.6.1</version>
                <parameters>
                        <parameter>
                                <name>trim.length.threshold</name>
                                <value>11</value>
                        </parameter>
                        <parameter>
                                <name>quality.threshold</name>
                                <value>12</value>
                        </parameter>
                </parameters>
        </step>

        <!-- Map reads -->
        <step id="mapping" skip="false" requiredprocs="4" requiredmemory="8GB" dataproduct="cross">
                <module>mapreads</module>
                <version>2.6.1</version>

                <inputs>
                        <input>
                                <port>reads</port>
                                <fromstep>myfilterreadstep</fromstep>
                                <fromport>output</fromport>
                        </input>
                </inputs>

                <parameters>
                        <parameter>
                                <name>mapper</name>
                                <value>soap</value>
                        </parameter>
                </parameters>
        </step>

        ...

    </steps>

Global parameter section

The global parameter section contains parameters that are shared by all the steps. The syntax of the global parameters is the same as in the steps.

    <globals>
        <parameter>
	        <name>main.tmp.dir</name>
	        <value>/home/jourdren/tmp</value>
        </parameter>
    </globals>

The global parameters override the values of the configuration file. For more information about the configuration file see the configuration file page.