Design file

The design file is the main element of the pipeline. It contains all informations and descriptions of the experiment. It is a simple tabulated plain text file inspired by limma design file. Usually the design file is named design.txt. Here is a sample of design file:

SampleNumber	Name	Reads	Genome	Annotation	FastqFormat	Condition	RepTechGroup	Experiment Reference
1	s1	s1.fq	mouse_build37.fasta	mouse_build37.gff	fastq-illumina	c1	repT1	Experiment1	true
2	s2	s2.fq	mouse_build37.fasta	mouse_build37.gff	fastq-illumina	c2	repT2	Experiment1	false
3	s3	s3.fq	mouse_build37.fasta	mouse_build37.gff	fastq-illumina	c1	repT3	Experiment2	false
4	s4	s4.fq	mouse_build37.fasta	mouse_build37.gff	fastq-illumina	c2	repT3	Experiment2	false

In a design file 3 fields are mandatory:

  • SampleNumber: Numeric, must be unique and >0
  • Name: Name of the sample. Must be unique
  • Reads: Path to the reads data file. For backward compatibility with Eoulsan 1.0.x, this field can also be named FileName.

User can add any additional field in this file. Some of the optional field are currently used in Eoulsan:

  • Genome or GenomeFile: Path to genome file in fasta format
  • Annotation or GffFile: Path to annotation file of the genome in GFF3 format
  • GtfFile: Path to annotation file of the genome in GTF format
  • AdditionalAnnotation or AdditionalAnnotationFile: Path to additional annotations file of the genome in TSV format
  • Condition: Condition for the sample in the experiment
  • FastqFormat: Fastq format. Currently Eoulsan 4 formats: fastq-sanger (default), fastq-solexa, fastq-illumina and fastq-illumina-1.5. See Cock et al for more information about the various fastq formats.
  • RepTechGroup: Technical replicates group. Use to pool reads counts between technical replicates in differential analysis step.
  • Experiment: The name of the different experiments. Each experiment is analyzed separately. This field is also used to name plots and output files of differential analysis.
  • Reference: A boolean field. If this field is present a kinetic differential analysis. Sample with true value are consider as reference. Actually, this is the condition that is consider as reference and there must be only 1 sample of the condition marked as true.
  • UUID: Universal Unique Identifier of the sample. This field is generated by Eouslan createdesign command. In obfuscated design files, this value does not change.

All paths in the design file can be URL and compressed files in gzip (.gz) or bzip2 (.bz2) are handled by Eoulsan.

Note: For some fields like genome or annotation, only one unique value is allowed for all the samples of the design file.

Paired-end support

Paired-end files of a sample can be set between bracket and must be separated by a coma. The following design file show a example of design file with paired-end files:

SampleNumber    Name    Reads   Genome  Annotation      FastqFormat     Condition       ReplicateType   UUID
1	s1	[s1a.fq.bz2,s1b.fq.bz2]	mouse_build37.fasta	mouse_build37.gff	fastq-sanger	s1	B	705d190c-de47-4c4f-8ddf-881c9b89ca66
2	s2	[s2a.fq.bz2,s2b.fq.bz2]	mouse_build37.fasta	mouse_build37.gff	fastq-sanger	s2	B	54c8833f-77e1-4e4f-90c2-742e459df7a7

Repository for genomes and annotation

To avoid duplication of genome and annotation files (and save disk space), Eoulsan can access to central repositories dedicated for this types of data. To access to this repositories, Eoulsan support four protocols : genome, gff, gtf and additionalannotation. See the data repository to see how define repositories. Here are few example of design file using genome and gff protocols:

SampleNumber	Name	Reads	Genome	Annotation	FastqFormat	Condition	RepTechGroup	Experiment
1	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT1	project1
12	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT1	project1
3	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT2	project1
54	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT2	project1
15	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT3	project1
5	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT4	project1
25	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT3	project1
13	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT3	project1

Technical replicates handling

To handle replicates, the RepTechGroup fieldhave been add to the design file. Here is an example of a design file using RepTechGroup field:

SampleNumber	Name	Reads	Genome	Annotation	FastqFormat	Condition	RepTechGroup	Experiment
1	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT1	project1
12	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT1	project1
3	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT2	project1
54	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT2	project1
15	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	na	project1
5	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	NA	project1
25	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT3	project1
13	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT3	project1

In this example below there is three technical replicates groups. For all sample with the same RepTechGroup value, reads counts are pooled. Sample with na, NA, Na or nA value aren't pooled.

For experiments without technicals replicates, all RepTechGroup fields must have a na value. Here is an example:

SampleNumber	Name	Reads	Genome	Annotation	FastqFormat	Condition	RepTechGroup	Experiment
1	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	na	project1
12	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	Na	project1
3	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	na	project1
54	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	NA	project1
15	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	na	project1
5	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	NA	project1
25	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	na	project1
13	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	NA	project1

Multiple experiments

One design file can describe the design of multiple differential analysis experiments. This case is possible only by inputing several differents values in Experiment field. Example :
SampleNumber	Name	Reads	Genome	Annotation	FastqFormat	Condition	RepTechGroup	Experiment
1	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT1	project1
12	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT1	project1
3	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT2	project1
54	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT2	project1
15	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c3	repT3	project2
5	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c3	repT3	project2
25	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c4	repT4	project2
13	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c4	repT4	project2

Kinetic differential analysis

Some RNA-seq experiments are kinetic that means in differential analysis, all condition have to be compare only with a reference condition. This is handle by the Reference field : One sample of the reference condition must have true value in Reference field. Example :
SampleNumber	Name	Reads	Genome	Annotation	FastqFormat	Condition	RepTechGroup	Experiment	Reference
1	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT1	project1	true
12	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c1	repT1	project1	false
3	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT2	project1	false
54	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c2	repT2	project1	false
15	s1	s1.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c3	repT3	project2	false
5	s2	s2.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c3	repT3	project2	false
25	s3	s3.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c4	repT4	project2	false
13	s4	s4.fq	genome://mouse_build37	gff://mouse_build37	fastq-illumina	c4	repT4	project2	false
As you can see it is possible to describe kinetic and no kinetic experiements in the sams design.