Eoulsan – Splitter and Merger modules

Last Published: 2022-12-13 14:46 | Version: 2.6.1

GenomiqueENS genomics core facility | École normale supérieure

About Eoulsan

Getting Eoulsan

Documentation

Development

Introduction

Splitting file data in small chucks is a common trick to scale data processing when you launch an analysis on a cluster. When you use Eoulsan in distributed mode, Eoulsan automatically split and merge common biological data (FASTQ files and SAM files) using the Hadoop framework with a low overhead. You can also use this strategy to achieve computation parallelization with non-hadoop cluster providing you manually declare in the workflow file when data must be split and merged.

Spliter step

This module allow to split data in small chucks.

Internal name: splitter
Available: Both local and distributed mode

Input port:
- input: data to split (format defined in the parameters)

Output port:
- output: split data (format defined in the parameters)

Mandatory parameter:

Parameter	Type	Description
format	format	Name of the format of the data to split. See below to get the list of the format that can be split

Optional parameters: Splitters can have optional arguments to set the splitting method according to the data format
Configuration example:

<!-- Split reads step (100,000,000 max entries by file) -->
<step id="mysplitterstep" skip="false" discardoutput="false">
	<module>splitter</module>
	<parameters>
		<parameter>
			<name>format</name>
			<value>fastq</value>
		</parameter>
		<parameter>
			<name>max.entries</name>
			<value>1000000</value>
		</parameter>
	</parameters>
</step>

Merger module

This module allow to merge small chucks of data in a large file.

Internal name: merger
Available: Both local and distributed mode
Multithreaded in local mode: N/A
Input: Data in the format defined in the parameters
Output: Data merged in the same format as the input

Mandatory parameter:

Parameter	Type	Description
format	format	Name of the format of the data to merge. See below to get the list of the format that can be merged

Optional parameters: Mergers can have optional arguments to set the merge method according to the data format
Configuration example:

<!-- Merge Sam files step -->
<step id="mymergerstep" skip="false" discardoutput="false">
	<module>merger</module>
	<parameters>
		<parameter>
			<name>format</name>
			<value>sam</value>
		</parameter>
	</parameters>
</step>

Technical merger module

This module allow to merge all the data related to the same technical replicates. This module use the RepTechGroup column of the design to define the data to merge.

Internal name: technicalreplicatemerger
Available: Both local and distributed mode
Multithreaded in local mode: N/A
Input: Data in the format defined in the parameters
Output: Data merged in the same format as the input

Mandatory parameter:

Parameter	Type	Description
format	format	Name of the format of the data to merge. See below to get the list of the format that can be merged

Optional parameters: Mergers can have optional arguments to set the merge method according to the data format
Configuration example:

<!-- Merge technical replicates step -->
<step id="mytechrepmerger" skip="false" discardoutput="false">
	<module>technicalreplicatemerger</module>
	<parameters>
		<parameter>
			<name>format</name>
			<value>fastq</value>
		</parameter>
	</parameters>
</step>

Supported formats

fastq

Format name: reads_fastq or fastq
Description: FASTQ format

Splitter optional parameters:

Parameter	Type	Default value	Description
max.entries	integer	1000000	The maximal number of entries in splitter output files

Merger optional parameters: None

sam

Format name: mapper_results_sam or sam
Description: SAM format

Splitter optional parameters:

Parameter	Type	Default value	Description
max.entries	integer	1000000	The maximal number of entries in splitter output files
chromosomes	boolean	false	Split the origin SAM file in files that only contains entries that map on the same chromosome. This option cannot be used with the max.line option

Merger optional parameters: None

bam

Format name: mapper_results_bam or bam
Description: BAM format

Splitter optional parameters:

Parameter	Type	Default value	Description
max.entries	integer	1000000	The maximal number of entries in splitter output files
chromosomes	boolean	false	Split the origin BAM file in files that only contains entries that map on the same chromosome. This option cannot be used with the max.line option

Merger optional parameters: None

expression

Format name: expression_results_tsv or expression
Description: Expression format

Splitter optional parameters:

Parameter	Type	Default value	Description
max.entries	integer	10000	The maximal number of entries in splitter output files

Merger optional parameters: None