Introduction
Splitting file data in small chucks is a common trick to scale data processing when you launch an analysis on a cluster. When you use Eoulsan in distributed mode,
Eoulsan automatically split and merge common biological data (FASTQ files and SAM files) using the Hadoop framework with a low overhead. You can also use this
strategy to achieve computation parallelization with non-hadoop cluster providing you manually declare in the workflow file when data must be split and merged.
Spliter step
This module allow to split data in small chucks.
- Internal name: splitter
- Available: Both local and distributed mode
- Input port:
- input: data to split (format defined in the parameters)
- Output port:
- output: split data (format defined in the parameters)
- Mandatory parameter:
Parameter |
Type |
Description |
format |
format |
Name of the format of the data to split. See below to get the list of the format that can be split |
- Optional parameters: Splitters can have optional arguments to set the splitting method according to the data format
- Configuration example:
<!-- Split reads step (100,000,000 max entries by file) -->
<step id="mysplitterstep" skip="false" discardoutput="false">
<module>splitter</module>
<parameters>
<parameter>
<name>format</name>
<value>fastq</value>
</parameter>
<parameter>
<name>max.entries</name>
<value>1000000</value>
</parameter>
</parameters>
</step>
Merger module
This module allow to merge small chucks of data in a large file.
Technical merger module
This module allow to merge all the data related to the same technical replicates. This module use the RepTechGroup column of the design to define the data to merge.
Supported formats
fastq
- Format name: reads_fastq or fastq
- Description: FASTQ format
- Splitter optional parameters:
Parameter |
Type |
Default value |
Description |
max.entries |
integer |
1000000 |
The maximal number of entries in splitter output files |
- Merger optional parameters: None
sam
- Format name: mapper_results_sam or sam
- Description: SAM format
- Splitter optional parameters:
Parameter |
Type |
Default value |
Description |
max.entries |
integer |
1000000 |
The maximal number of entries in splitter output files |
chromosomes |
boolean |
false |
Split the origin SAM file in files that only contains entries that map on the same chromosome.
This option cannot be used with the max.line option |
- Merger optional parameters: None
bam
- Format name: mapper_results_bam or bam
- Description: BAM format
- Splitter optional parameters:
Parameter |
Type |
Default value |
Description |
max.entries |
integer |
1000000 |
The maximal number of entries in splitter output files |
chromosomes |
boolean |
false |
Split the origin BAM file in files that only contains entries that map on the same chromosome.
This option cannot be used with the max.line option |
- Merger optional parameters: None
expression
- Format name: expression_results_tsv or expression
- Description: Expression format
- Splitter optional parameters:
Parameter |
Type |
Default value |
Description |
max.entries |
integer |
10000 |
The maximal number of entries in splitter output files |
- Merger optional parameters: None