Filter reads module

This module allow to filter input reads. Currently, this module can trim polyN read tails, remove reads with a short length and discard reads with bad base quality mean. Eoulsan provides a plugin system for reads filters. To enable a filter, a parameter for this filter must be set. If the filter takes no option, add a parameter with the name of the filter as the key and an empty string for the value.

The filters order that will be applied on the reads is the same order of the filters parameters in the workflow file. So the count of the filtered reads by each filter in the log will be different according to the filter parameters order in the workflow file.

When the parameter type is none, the value of the parameter is not read by the filter and it can be left empty.

Warning: Some filters can modify the output reads (e.g. the trimpolynend filter remove the polyN tails of the reads). So a filter like the quality filter will not produce the same output if declared before or after the trim filter.

  • Internal name: filterreads
  • Available: Both local and distributed mode

  • Input port:
    • input: reads in FASTQ format (format: reads_fastq)

  • Output port:
    • output: reads in FASTQ format (format: reads_fastq)

  • Optional parameters:
  • Parameter Type Description Default value Modify reads
    paircheck none Check if the identifiers of the two ends had the same identifier if enabled. N/A No
    pairedend.accept.paired.end boolean Remove all paired-end reads if false. Not set No
    pairedend.accept.single.end boolean Remove all single-end reads if false. Not set No
    illuminaid none Remove all reads that not pass illumina filters if enabled. N/A No
    quality.threshold float The threshold for the mean base quality. Unit in decimal quality score Not set No
    trimpolynend none This filter trim polyN tails of reads if enabled. N/A Yes
    length.minimal.length.threshold integer The minimal threshold for the reads length. Unit in bases. Not set No
    trim.length.threshold integer The threshold for the length of the reads. Unit in bases. This filter trim polyN tails of reads. This filter is deprecated, use instead trimpolyn and length.minimal.length.threshold. Not set Yes
    readnamestartwith.forbidden.prefixes string Remove all reads with id that starts with one of prefixes separated by comma. Not set No
    readnamestartwith.allowed.prefixes string Keep only the reads with id that starts with one of prefixes separated by comma. Not set No
    readnameregex.forbidden.regex string Remove all the reads with id that matches with the regular expression. Not set No
    readnameregex.allowed.regex string Keep only the reads with id that matches with the regular expression. Not set No
    hadoop.reducer.task.count integer The count of Hadoop reducer tasks to use for this step. This parameter is only used in Hadoop mode. Not set N/A
    maxlength.maximum.length.threshold integer The maximum threshold for the reads length. Unit in bases. Not set No
    readsequenceregex.forbidden.regex string Remove all the reads with pattern that matches with the regular expression. Not set No
    readsequenceregex.allowed.regex string Keep only the reads with pattern that matches with the regular expression. Not set No
    slidingwindow.arguments string Cutting once the average quality within the window falls below a threshold. Not set No
    trailing.arguments string Remove low quality bases from the end. Not set No
    leading.arguments string Remove low quality bases from the beginning. Not set No
    headcrop.arguments string Remove the specified number of bases from beginning of the read. Not set No
    crop.arguments string Remove bases regardless of quality from the end of the read. Not set No
    nanoporesequencetype.keep string Keep only a type of Nanopore reads. Available values are: template, complement and consensus. For 1D sequencing, use consensus value to keep all the reads. consensus No
    polyatail.minimal.length integer Mininal length of polyA/polyT tail. This filter just add a "tail_type" field in the read headers. 10 Header
    polyatail.maximal.error.rate float Maximal threshold allowed errors in polyA/polyT tails. 0.1 Header
    polyatail.minimal.length.for.error.rate.computation integer Minimal length of tail sequence before computing the error rate. 5 Header
    reversepolyt none This filter reverse complements reads with a "tail_type=polyT" field in read header. N/A No
    removeinvalidpolya.allowed.tail.type string This filter will keep only reads with specified value(s) for the "tail_type" field in the read headers. Reads with this field will be discarted. polyaA,polyT No
    ggghead none This filter will search for GGG head and CCC tail and add additional fields in read header fields. N/A Header
    requireggghead.allow.mismatch boolean This filter will remove any sequence without GGG head. This parameter allow one mismatch in the GGG sequence. true No
  • Configuration example:
  • <!-- Filter reads step -->
    <step id="myfilterreadsstep" skip="false" discardoutput="true">
    	<module>filterreads</module>
    	<parameters>
    		<parameter>
    			<name>illuminaid</name>
    			<value></value>
    		</parameter>
    		<parameter>
    			<name>trimpolynend</name>
    			<value></value>
    		</parameter>
    		<parameter>
    			<name>length.minimal.length.threshold</name>
    			<value>40</value>
    		</parameter>
    		<parameter>
    			<name>quality.threshold</name>
    			<value>30</value>
    		</parameter>
    	</parameters>
    </step>