To avoid duplication of genome, annotation, additional annotation and genome index files, Eoulsan handles data repositories. It is very useful for genome indexes used in the mapping step witch computation is quite long for large genomes. The genome index repository store the result an index computation for the next analysis using this genome.
The configuration of this repositories are quite the same. You must define the path of the root of the repository by setting the following global parameters (In configuration file or in the globals section of the workflow file) :
Parameter | Type | Description |
---|---|---|
main.genome.storage.path | string | Path to the genomes repository |
main.gff.storage.path | string | Path to the GFF annotations repository |
main.gtf.storage.path | string | Path to the GTF annotations repository |
main.additional.annotation.storage.path | string | Path to the additional annotations repository |
The path of the repositories can be URL (e.g. on webserver or on ftp server).
In following example, we can see the content of a genome repository. Using symbolic links allow to define several alias to the same genome.
-rw-r--r-- 1 nobody nobody 4123941 2010-02-15 15:45 mouse-37.fasta.bz2 lrwxrwxrwx 1 nobody nobody 16 2011-12-25 17:42 mouse -> mouse-37.fasta.bz2 -rw-r--r-- 1 nobody nobody 4123327 2010-02-15 15:45 mouse-36.fasta.bz2 lrwxrwxrwx 1 nobody nobody 513422555 2012-01-09 17:04 hg19.fasta.bz2
To access repositories from design file, user must use dedicated protocols:
Repository type | Protocol | Protocol usage |
---|---|---|
genome | genome | genome://<genome name> (e.g. genome://mouse-37) |
GFF annotation | gff | gff://<annotation name> (e.g. gff://mouse-37) |
GTF annotation | gtf | gtf://<annotation name> (e.g. gtf://mouse-37) |
additional annotation | additionalannotation | additionalannotation://<annotation name> (e.g. additionalannotation://mouse-37) |
File extension (e.g. .fasta, .gff) and file compression extensions must be avoided in the genome and annotation URL. Eoulsan automatically add the file extension and check if a compressed file exists in the repository.
Unlike previous repositories, the genome index repository have no dedicated protocol. The only user of this repository is the genome index creation step. When a genome index must be computed, this step check if a genome index has been already computed for this genome and mapper. If true, the previous computed genome is used, if false, the genome index is computed and then stored for a next usage.
To use genome index repository, user must only define the following global parameter (In configuration file or in the globals section of the workflow file) :
Parameter | Type | Description |
---|---|---|
main.genome.mapper.index.storage.path | string | Path to the genome indexes repository |
Note: The path to the genome indexes cannot be an URL. The path must be writtable for the user to allow Eoulsan storing genome indexes.
The genome description file contains some basic informations about the genome like the chromosome lengths. This file is useful for creation of valid SAM/BAM files and using the genome index repository. This file is created from the genome file at each new analysis. However creating this file is quite long for large genome (like mouse or human genome) when compressed. The genome description repository allow to avoid useless genome sequence parsing once it has been already parsed in a previous analysis.
To use genome description repository, user must only define the following global parameter (In configuration file or in the globals section of the workflow file) :
Parameter | Type | Description |
---|---|---|
main.genome.desc.storage.path | string | Path to the genome descriptions repository |
Note: The path to the genome description cannot be an URL. The path must be writtable for the user to allow Eoulsan storing genome descriptions.