Multi star

Align fastq files using STAR.

Info

ID: multi_star
Namespace: mapping

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 1.0.2 -latest \
  -main-script target/nextflow/mapping/multi_star/main.nf \
  --help

Run command

Example of params.yaml
# Input/Output
input_id: # please fill in - example: ["mysample", "mysample"]
input_r1: # please fill in - example: ["mysample_S1_L001_R1_001.fastq.gz", "mysample_S1_L002_R1_001.fastq.gz"]
# input_r2: ["mysample_S1_L001_R2_001.fastq.gz", "mysample_S1_L002_R2_001.fastq.gz"]
reference_index: # please fill in - example: "/path/to/reference"
reference_gtf: # please fill in - example: "genes.gtf"
# output: "$id.$key.output.output"

# Processing arguments
run_htseq_count: true
run_multiqc: true
min_success_rate: 0.5

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"
nextflow run openpipelines-bio/openpipeline \
  -r 1.0.2 -latest \
  -profile docker \
  -main-script target/nextflow/mapping/multi_star/main.nf \
  -params-file params.yaml
Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Input/Output

Name Description Attributes
--input_id The ID of the sample being processed. This vector should have the same length as the --input_r1 argument. List of string, required, example: "mysample", "mysample", multiple_sep: ";"
--input_r1 Paths to the sequences to be mapped. If using Illumina paired-end reads, only the R1 files should be passed. List of file, required, example: "mysample_S1_L001_R1_001.fastq.gz", "mysample_S1_L002_R1_001.fastq.gz", multiple_sep: ";"
--input_r2 Paths to the sequences to be mapped. If using Illumina paired-end reads, only the R2 files should be passed. List of file, example: "mysample_S1_L001_R2_001.fastq.gz", "mysample_S1_L002_R2_001.fastq.gz", multiple_sep: ";"
--reference_index Path to the reference built by star_build_reference. Corresponds to the –genomeDir argument in the STAR command. file, required, example: "/path/to/reference"
--reference_gtf Path to the gtf reference file. file, required, example: "genes.gtf"
--output Path to output directory. Corresponds to the –outFileNamePrefix argument in the STAR command. file, required, example: "/path/to/foo"

Processing arguments

Name Description Attributes
--run_htseq_count Whether or not to also run htseq-count after STAR. boolean, default: TRUE
--run_multiqc Whether or not to also run MultiQC at the end. boolean, default: TRUE
--min_success_rate Fail when the success rate is below this threshold. double, default: 0.5

Run Parameters

Name Description Attributes
--runRNGseed random number generator seed. integer, example: 777

Genome Parameters

Name Description Attributes
--genomeFastaFiles path(s) to the fasta files with the genome sequences, separated by spaces. These files should be plain text FASTA files, they cannot be zipped. Required for the genome generation (–runMode genomeGenerate). Can also be used in the mapping (–runMode alignReads) to add extra (new) sequences to the genome (e.g. spike-ins). List of file, multiple_sep: ";"

Splice Junctions Database

Name Description Attributes
--sjdbFileChrStartEnd path to the files with genomic coordinates (chr start end strand) for the splice junction introns. Multiple files can be supplied and will be concatenated. List of string, multiple_sep: ";"
--sjdbGTFfile path to the GTF file with annotations file
--sjdbGTFchrPrefix prefix for chromosome names in a GTF file (e.g. ‘chr’ for using ENSMEBL annotations with UCSC genomes) string
--sjdbGTFfeatureExon feature type in GTF file to be used as exons for building transcripts string, example: "exon"
--sjdbGTFtagExonParentTranscript GTF attribute name for parent transcript ID (default “transcript_id” works for GTF files) string, example: "transcript_id"
--sjdbGTFtagExonParentGene GTF attribute name for parent gene ID (default “gene_id” works for GTF files) string, example: "gene_id"
--sjdbGTFtagExonParentGeneName GTF attribute name for parent gene name List of string, example: "gene_name", multiple_sep: ";"
--sjdbGTFtagExonParentGeneType GTF attribute name for parent gene type List of string, example: "gene_type", "gene_biotype", multiple_sep: ";"
--sjdbOverhang length of the donor/acceptor sequence on each side of the junctions, ideally = (mate_length - 1) integer, example: 100
--sjdbScore extra alignment score for alignments that cross database junctions integer, example: 2
--sjdbInsertSave which files to save when sjdb junctions are inserted on the fly at the mapping step - Basic … only small junction / transcript files - All … all files including big Genome, SA and SAindex - this will create a complete genome directory string, example: "Basic"

Variation parameters

Name Description Attributes
--varVCFfile path to the VCF file that contains variation data. The 10th column should contain the genotype information, e.g. 0/1 string

Read Parameters

Name Description Attributes
--readFilesType format of input read files - Fastx … FASTA or FASTQ - SAM SE … SAM or BAM single-end reads; for BAM use –readFilesCommand samtools view - SAM PE … SAM or BAM paired-end reads; for BAM use –readFilesCommand samtools view string, example: "Fastx"
--readFilesSAMattrKeep for –readFilesType SAM SE/PE, which SAM tags to keep in the output BAM, e.g.: –readFilesSAMtagsKeep RG PL - All … keep all tags - None … do not keep any tags List of string, example: "All", multiple_sep: ";"
--readFilesManifest path to the “manifest” file with the names of read files. The manifest file should contain 3 tab-separated columns: paired-end reads: read1_file_name \(tab\) read2_file_name \(tab\) read_group_line. single-end reads: read1_file_name \(tab\) - \(tab\) read_group_line. Spaces, but not tabs are allowed in file names. If read_group_line does not start with ID:, it can only contain one ID field, and ID: will be added to it. If read_group_line starts with ID:, it can contain several fields separated by \(tab\), and all fields will be be copied verbatim into SAM @RG header line. file
--readFilesPrefix prefix for the read files names, i.e. it will be added in front of the strings in –readFilesIn string
--readFilesCommand command line to execute for each of the input file. This command should generate FASTA or FASTQ text and send it to stdout For example: zcat - to uncompress .gz files, bzcat - to uncompress .bz2 files, etc. List of string, multiple_sep: ";"
--readMapNumber number of reads to map from the beginning of the file -1: map all reads integer, example: -1
--readMatesLengthsIn Equal/NotEqual - lengths of names,sequences,qualities for both mates are the same / not the same. NotEqual is safe in all situations. string, example: "NotEqual"
--readNameSeparator character(s) separating the part of the read names that will be trimmed in output (read name after space is always trimmed) List of string, example: "/", multiple_sep: ";"
--readQualityScoreBase number to be subtracted from the ASCII code to get Phred quality score integer, example: 33

Read Clipping

Name Description Attributes
--clipAdapterType adapter clipping type - Hamming … adapter clipping based on Hamming distance, with the number of mismatches controlled by –clip5pAdapterMMp - CellRanger4 … 5p and 3p adapter clipping similar to CellRanger4. Utilizes Opal package by Martin Sosic: https://github.com/Martinsos/opal - None … no adapter clipping, all other clip* parameters are disregarded string, example: "Hamming"
--clip3pNbases number(s) of bases to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates. List of integer, example: 0, multiple_sep: ";"
--clip3pAdapterSeq adapter sequences to clip from 3p of each mate. If one value is given, it will be assumed the same for both mates. - polyA … polyA sequence with the length equal to read length List of string, multiple_sep: ";"
--clip3pAdapterMMp max proportion of mismatches for 3p adapter clipping for each mate. If one value is given, it will be assumed the same for both mates. List of double, example: 0.1, multiple_sep: ";"
--clip3pAfterAdapterNbases number of bases to clip from 3p of each mate after the adapter clipping. If one value is given, it will be assumed the same for both mates. List of integer, example: 0, multiple_sep: ";"
--clip5pNbases number(s) of bases to clip from 5p of each mate. If one value is given, it will be assumed the same for both mates. List of integer, example: 0, multiple_sep: ";"

Limits

Name Description Attributes
--limitGenomeGenerateRAM maximum available RAM (bytes) for genome generation long, example: NA
--limitIObufferSize max available buffers size (bytes) for input/output, per thread List of long, example: 30000000, 50000000, multiple_sep: ";"
--limitOutSAMoneReadBytes max size of the SAM record (bytes) for one read. Recommended value: >(2(LengthMate1+LengthMate2+100)outFilterMultimapNmax long, example: 100000
--limitOutSJoneRead max number of junctions for one read (including all multi-mappers) integer, example: 1000
--limitOutSJcollapsed max number of collapsed junctions integer, example: 1000000
--limitBAMsortRAM maximum available RAM (bytes) for sorting BAM. If =0, it will be set to the genome index size. 0 value can only be used with –genomeLoad NoSharedMemory option. long, example: 0
--limitSjdbInsertNsj maximum number of junctions to be inserted to the genome on the fly at the mapping stage, including those from annotations and those detected in the 1st step of the 2-pass run integer, example: 1000000
--limitNreadsSoft soft limit on the number of reads integer, example: -1

Output: general

Name Description Attributes
--outTmpKeep whether to keep the temporary files after STAR runs is finished - None … remove all temporary files - All … keep all files string
--outStd which output will be directed to stdout (standard out) - Log … log messages - SAM … alignments in SAM format (which normally are output to Aligned.out.sam file), normal standard output will go into Log.std.out - BAM_Unsorted … alignments in BAM format, unsorted. Requires –outSAMtype BAM Unsorted - BAM_SortedByCoordinate … alignments in BAM format, sorted by coordinate. Requires –outSAMtype BAM SortedByCoordinate - BAM_Quant … alignments to transcriptome in BAM format, unsorted. Requires –quantMode TranscriptomeSAM string, example: "Log"
--outReadsUnmapped output of unmapped and partially mapped (i.e. mapped only one mate of a paired end read) reads in separate file(s). - None … no output - Fastx … output in separate fasta/fastq files, Unmapped.out.mate1/2 string
--outQSconversionAdd add this number to the quality score (e.g. to convert from Illumina to Sanger, use -31) integer, example: 0
--outMultimapperOrder order of multimapping alignments in the output files - Old_2.4 … quasi-random order used before 2.5.0 - Random … random order of alignments for each multi-mapper. Read mates (pairs) are always adjacent, all alignment for each read stay together. This option will become default in the future releases. string, example: "Old_2.4"

Output: SAM and BAM

Name Description Attributes
--outSAMmode mode of SAM output - None … no SAM output - Full … full SAM output - NoQS … full SAM but without quality scores string, example: "Full"
--outSAMstrandField Cufflinks-like strand field flag - None … not used - intronMotif … strand derived from the intron motif. This option changes the output alignments: reads with inconsistent and/or non-canonical introns are filtered out. string
--outSAMattributes a string of desired SAM attributes, in the order desired for the output SAM. Tags can be listed in any combination/order. Presets: - None … no attributes - Standard … NH HI AS nM - All … NH HI AS nM NM MD jM jI MC ch Alignment: - NH … number of loci the reads maps to: =1 for unique mappers, >1 for multimappers. Standard SAM tag. - HI … multiple alignment index, starts with –outSAMattrIHstart (=1 by default). Standard SAM tag. - AS … local alignment score, +1/-1 for matches/mismateches, score* penalties for indels and gaps. For PE reads, total score for two mates. Stadnard SAM tag. - nM … number of mismatches. For PE reads, sum over two mates. - NM … edit distance to the reference (number of mismatched + inserted + deleted bases) for each mate. Standard SAM tag. - MD … string encoding mismatched and deleted reference bases (see standard SAM specifications). Standard SAM tag. - jM … intron motifs for all junctions (i.e. N in CIGAR): 0: non-canonical; 1: GT/AG, 2: CT/AC, 3: GC/AG, 4: CT/GC, 5: AT/AC, 6: GT/AT. If splice junctions database is used, and a junction is annotated, 20 is added to its motif value. - jI … start and end of introns for all junctions (1-based). - XS … alignment strand according to –outSAMstrandField. - MC … mate’s CIGAR string. Standard SAM tag. - ch … marks all segment of all chimeric alingments for –chimOutType WithinBAM output. - cN … number of bases clipped from the read ends: 5’ and 3’ Variation: - vA … variant allele - vG … genomic coordinate of the variant overlapped by the read. - vW … 1 - alignment passes WASP filtering; 2,3,4,5,6,7 - alignment does not pass WASP filtering. Requires –waspOutputMode SAMtag. STARsolo: - CR CY UR UY … sequences and quality scores of cell barcodes and UMIs for the solo* demultiplexing. - GX GN … gene ID and gene name for unique-gene reads. - gx gn … gene IDs and gene names for unique- and multi-gene reads. - CB UB … error-corrected cell barcodes and UMIs for solo* demultiplexing. Requires –outSAMtype BAM SortedByCoordinate. - sM … assessment of CB and UMI. - sS … sequence of the entire barcode (CB,UMI,adapter). - sQ … quality of the entire barcode. ***Unsupported/undocumented: - ha … haplotype (1/2) when mapping to the diploid genome. Requires genome generated with –genomeTransformType Diploid . - rB … alignment block read/genomic coordinates. - vR … read coordinate of the variant. List of string, example: "Standard", multiple_sep: ";"
--outSAMattrIHstart start value for the IH attribute. 0 may be required by some downstream software, such as Cufflinks or StringTie. integer, example: 1
--outSAMunmapped output of unmapped reads in the SAM format 1st word: - None … no output - Within … output unmapped reads within the main SAM file (i.e. Aligned.out.sam) 2nd word: - KeepPairs … record unmapped mate for each alignment, and, in case of unsorted output, keep it adjacent to its mapped mate. Only affects multi-mapping reads. List of string, multiple_sep: ";"
--outSAMorder type of sorting for the SAM output Paired: one mate after the other for all paired alignments PairedKeepInputOrder: one mate after the other for all paired alignments, the order is kept the same as in the input FASTQ files string, example: "Paired"
--outSAMprimaryFlag which alignments are considered primary - all others will be marked with 0x100 bit in the FLAG - OneBestScore … only one alignment with the best score is primary - AllBestScore … all alignments with the best score are primary string, example: "OneBestScore"
--outSAMreadID read ID record type - Standard … first word (until space) from the FASTx read ID line, removing /1,/2 from the end - Number … read number (index) in the FASTx file string, example: "Standard"
--outSAMmapqUnique 0 to 255: the MAPQ value for unique mappers integer, example: 255
--outSAMflagOR 0 to 65535: sam FLAG will be bitwise OR’d with this value, i.e. FLAG=FLAG | outSAMflagOR. This is applied after all flags have been set by STAR, and after outSAMflagAND. Can be used to set specific bits that are not set otherwise. integer, example: 0
--outSAMflagAND 0 to 65535: sam FLAG will be bitwise AND’d with this value, i.e. FLAG=FLAG & outSAMflagOR. This is applied after all flags have been set by STAR, but before outSAMflagOR. Can be used to unset specific bits that are not set otherwise. integer, example: 65535
--outSAMattrRGline SAM/BAM read group line. The first word contains the read group identifier and must start with “ID:”, e.g. –outSAMattrRGline ID:xxx CN:yy “DS:z z z”. xxx will be added as RG tag to each output alignment. Any spaces in the tag values have to be double quoted. Comma separated RG lines correspons to different (comma separated) input files in –readFilesIn. Commas have to be surrounded by spaces, e.g. –outSAMattrRGline ID:xxx , ID:zzz “DS:z z” , ID:yyy DS:yyyy List of string, multiple_sep: ";"
--outSAMheaderHD @HD (header) line of the SAM header List of string, multiple_sep: ";"
--outSAMheaderPG extra @PG (software) line of the SAM header (in addition to STAR) List of string, multiple_sep: ";"
--outSAMheaderCommentFile path to the file with @CO (comment) lines of the SAM header string
--outSAMfilter filter the output into main SAM/BAM files - KeepOnlyAddedReferences … only keep the reads for which all alignments are to the extra reference sequences added with –genomeFastaFiles at the mapping stage. - KeepAllAddedReferences … keep all alignments to the extra reference sequences added with –genomeFastaFiles at the mapping stage. List of string, multiple_sep: ";"
--outSAMmultNmax max number of multiple alignments for a read that will be output to the SAM/BAM files. Note that if this value is not equal to -1, the top scoring alignment will be output first - -1 … all alignments (up to –outFilterMultimapNmax) will be output integer, example: -1
--outSAMtlen calculation method for the TLEN field in the SAM/BAM files - 1 … leftmost base of the (+)strand mate to rightmost base of the (-)mate. (+)sign for the (+)strand mate - 2 … leftmost base of any mate to rightmost base of any mate. (+)sign for the mate with the leftmost base. This is different from 1 for overlapping mates with protruding ends integer, example: 1
--outBAMcompression -1 to 10 BAM compression level, -1=default compression (6?), 0=no compression, 10=maximum compression integer, example: 1
--outBAMsortingThreadN >=0: number of threads for BAM sorting. 0 will default to min(6,–runThreadN). integer, example: 0
--outBAMsortingBinsN >0: number of genome bins for coordinate-sorting integer, example: 50

BAM processing

Name Description Attributes
--bamRemoveDuplicatesType mark duplicates in the BAM file, for now only works with (i) sorted BAM fed with inputBAMfile, and (ii) for paired-end alignments only - - … no duplicate removal/marking - UniqueIdentical … mark all multimappers, and duplicate unique mappers. The coordinates, FLAG, CIGAR must be identical - UniqueIdenticalNotMulti … mark duplicate unique mappers but not multimappers. string
--bamRemoveDuplicatesMate2basesN number of bases from the 5’ of mate 2 to use in collapsing (e.g. for RAMPAGE) integer, example: 0

Output Wiggle

Name Description Attributes
--outWigType type of signal output, e.g. “bedGraph” OR “bedGraph read1_5p”. Requires sorted BAM: –outSAMtype BAM SortedByCoordinate . 1st word: - None … no signal output - bedGraph … bedGraph format - wiggle … wiggle format 2nd word: - read1_5p … signal from only 5’ of the 1st read, useful for CAGE/RAMPAGE etc - read2 … signal from only 2nd read List of string, multiple_sep: ";"
--outWigStrand strandedness of wiggle/bedGraph output - Stranded … separate strands, str1 and str2 - Unstranded … collapsed strands string, example: "Stranded"
--outWigReferencesPrefix prefix matching reference names to include in the output wiggle file, e.g. “chr”, default “-” - include all references string
--outWigNorm type of normalization for the signal - RPM … reads per million of mapped reads - None … no normalization, “raw” counts string, example: "RPM"

Output Filtering

Name Description Attributes
--outFilterType type of filtering - Normal … standard filtering using only current alignment - BySJout … keep only those reads that contain junctions that passed filtering into SJ.out.tab string, example: "Normal"
--outFilterMultimapScoreRange the score range below the maximum score for multimapping alignments integer, example: 1
--outFilterMultimapNmax maximum number of loci the read is allowed to map to. Alignments (all of them) will be output only if the read maps to no more loci than this value. Otherwise no alignments will be output, and the read will be counted as “mapped to too many loci” in the Log.final.out . integer, example: 10
--outFilterMismatchNmax alignment will be output only if it has no more mismatches than this value. integer, example: 10
--outFilterMismatchNoverLmax alignment will be output only if its ratio of mismatches to mapped length is less than or equal to this value. double, example: 0.3
--outFilterMismatchNoverReadLmax alignment will be output only if its ratio of mismatches to read length is less than or equal to this value. double, example: 1
--outFilterScoreMin alignment will be output only if its score is higher than or equal to this value. integer, example: 0
--outFilterScoreMinOverLread same as outFilterScoreMin, but normalized to read length (sum of mates’ lengths for paired-end reads) double, example: 0.66
--outFilterMatchNmin alignment will be output only if the number of matched bases is higher than or equal to this value. integer, example: 0
--outFilterMatchNminOverLread sam as outFilterMatchNmin, but normalized to the read length (sum of mates’ lengths for paired-end reads). double, example: 0.66
--outFilterIntronMotifs filter alignment using their motifs - None … no filtering - RemoveNoncanonical … filter out alignments that contain non-canonical junctions - RemoveNoncanonicalUnannotated … filter out alignments that contain non-canonical unannotated junctions when using annotated splice junctions database. The annotated non-canonical junctions will be kept. string
--outFilterIntronStrands filter alignments - RemoveInconsistentStrands … remove alignments that have junctions with inconsistent strands - None … no filtering string, example: "RemoveInconsistentStrands"

Output splice junctions (SJ.out.tab)

Name Description Attributes
--outSJtype type of splice junction output - Standard … standard SJ.out.tab output - None … no splice junction output string, example: "Standard"

Output Filtering: Splice Junctions

Name Description Attributes
--outSJfilterReads which reads to consider for collapsed splice junctions output - All … all reads, unique- and multi-mappers - Unique … uniquely mapping reads only string, example: "All"
--outSJfilterOverhangMin minimum overhang length for splice junctions on both sides for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif does not apply to annotated junctions List of integer, example: 30, 12, 12, 12, multiple_sep: ";"
--outSJfilterCountUniqueMin minimum uniquely mapping read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied does not apply to annotated junctions List of integer, example: 3, 1, 1, 1, multiple_sep: ";"
--outSJfilterCountTotalMin minimum total (multi-mapping+unique) read count per junction for: (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. -1 means no output for that motif Junctions are output if one of outSJfilterCountUniqueMin OR outSJfilterCountTotalMin conditions are satisfied does not apply to annotated junctions List of integer, example: 3, 1, 1, 1, multiple_sep: ";"
--outSJfilterDistToOtherSJmin minimum allowed distance to other junctions’ donor/acceptor does not apply to annotated junctions List of integer, example: 10, 0, 5, 10, multiple_sep: ";"
--outSJfilterIntronMaxVsReadN maximum gap allowed for junctions supported by 1,2,3,,,N reads i.e. by default junctions supported by 1 read can have gaps <=50000b, by 2 reads: <=100000b, by 3 reads: <=200000. by >=4 reads any gap <=alignIntronMax does not apply to annotated junctions List of integer, example: 50000, 100000, 200000, multiple_sep: ";"

Scoring

Name Description Attributes
--scoreGap splice junction penalty (independent on intron motif) integer, example: 0
--scoreGapNoncan non-canonical junction penalty (in addition to scoreGap) integer, example: -8
--scoreGapGCAG GC/AG and CT/GC junction penalty (in addition to scoreGap) integer, example: -4
--scoreGapATAC AT/AC and GT/AT junction penalty (in addition to scoreGap) integer, example: -8
--scoreGenomicLengthLog2scale extra score logarithmically scaled with genomic length of the alignment: scoreGenomicLengthLog2scale*log2(genomicLength) integer, example: 0
--scoreDelOpen deletion open penalty integer, example: -2
--scoreDelBase deletion extension penalty per base (in addition to scoreDelOpen) integer, example: -2
--scoreInsOpen insertion open penalty integer, example: -2
--scoreInsBase insertion extension penalty per base (in addition to scoreInsOpen) integer, example: -2
--scoreStitchSJshift maximum score reduction while searching for SJ boundaries in the stitching step integer, example: 1

Alignments and Seeding

Name Description Attributes
--seedSearchStartLmax defines the search start point through the read - the read is split into pieces no longer than this value integer, example: 50
--seedSearchStartLmaxOverLread seedSearchStartLmax normalized to read length (sum of mates’ lengths for paired-end reads) double, example: 1
--seedSearchLmax defines the maximum length of the seeds, if =0 seed length is not limited integer, example: 0
--seedMultimapNmax only pieces that map fewer than this value are utilized in the stitching procedure integer, example: 10000
--seedPerReadNmax max number of seeds per read integer, example: 1000
--seedPerWindowNmax max number of seeds per window integer, example: 50
--seedNoneLociPerWindow max number of one seed loci per window integer, example: 10
--seedSplitMin min length of the seed sequences split by Ns or mate gap integer, example: 12
--seedMapMin min length of seeds to be mapped integer, example: 5
--alignIntronMin minimum intron size, genomic gap is considered intron if its length>=alignIntronMin, otherwise it is considered Deletion integer, example: 21
--alignIntronMax maximum intron size, if 0, max intron size will be determined by (2^winBinNbits)*winAnchorDistNbins integer, example: 0
--alignMatesGapMax maximum gap between two mates, if 0, max intron gap will be determined by (2^winBinNbits)*winAnchorDistNbins integer, example: 0
--alignSJoverhangMin minimum overhang (i.e. block size) for spliced alignments integer, example: 5
--alignSJstitchMismatchNmax maximum number of mismatches for stitching of the splice junctions (-1: no limit). (1) non-canonical motifs, (2) GT/AG and CT/AC motif, (3) GC/AG and CT/GC motif, (4) AT/AC and GT/AT motif. List of integer, example: 0, -1, 0, 0, multiple_sep: ";"
--alignSJDBoverhangMin minimum overhang (i.e. block size) for annotated (sjdb) spliced alignments integer, example: 3
--alignSplicedMateMapLmin minimum mapped length for a read mate that is spliced integer, example: 0
--alignSplicedMateMapLminOverLmate alignSplicedMateMapLmin normalized to mate length double, example: 0.66
--alignWindowsPerReadNmax max number of windows per read integer, example: 10000
--alignTranscriptsPerWindowNmax max number of transcripts per window integer, example: 100
--alignTranscriptsPerReadNmax max number of different alignments per read to consider integer, example: 10000
--alignEndsType type of read ends alignment - Local … standard local alignment with soft-clipping allowed - EndToEnd … force end-to-end read alignment, do not soft-clip - Extend5pOfRead1 … fully extend only the 5p of the read1, all other ends: local alignment - Extend5pOfReads12 … fully extend only the 5p of the both read1 and read2, all other ends: local alignment string, example: "Local"
--alignEndsProtrude allow protrusion of alignment ends, i.e. start (end) of the +strand mate downstream of the start (end) of the -strand mate 1st word: int: maximum number of protrusion bases allowed 2nd word: string: - ConcordantPair … report alignments with non-zero protrusion as concordant pairs - DiscordantPair … report alignments with non-zero protrusion as discordant pairs string, example: "0 ConcordantPair"
--alignSoftClipAtReferenceEnds allow the soft-clipping of the alignments past the end of the chromosomes - Yes … allow - No … prohibit, useful for compatibility with Cufflinks string, example: "Yes"
--alignInsertionFlush how to flush ambiguous insertion positions - None … insertions are not flushed - Right … insertions are flushed to the right string

Paired-End reads

Name Description Attributes
--peOverlapNbasesMin minimum number of overlapping bases to trigger mates merging and realignment. Specify >0 value to switch on the “merginf of overlapping mates” algorithm. integer, example: 0
--peOverlapMMp maximum proportion of mismatched bases in the overlap area double, example: 0.01

Windows, Anchors, Binning

Name Description Attributes
--winAnchorMultimapNmax max number of loci anchors are allowed to map to integer, example: 50
--winBinNbits =log2(winBin), where winBin is the size of the bin for the windows/clustering, each window will occupy an integer number of bins. integer, example: 16
--winAnchorDistNbins max number of bins between two anchors that allows aggregation of anchors into one window integer, example: 9
--winFlankNbins log2(winFlank), where win Flank is the size of the left and right flanking regions for each window integer, example: 4
--winReadCoverageRelativeMin minimum relative coverage of the read sequence by the seeds in a window, for STARlong algorithm only. double, example: 0.5
--winReadCoverageBasesMin minimum number of bases covered by the seeds in a window , for STARlong algorithm only. integer, example: 0

Chimeric Alignments

Name Description Attributes
--chimOutType type of chimeric output - Junctions … Chimeric.out.junction - SeparateSAMold … output old SAM into separate Chimeric.out.sam file - WithinBAM … output into main aligned BAM files (Aligned.*.bam) - WithinBAM HardClip … (default) hard-clipping in the CIGAR for supplemental chimeric alignments (default if no 2nd word is present) - WithinBAM SoftClip … soft-clipping in the CIGAR for supplemental chimeric alignments List of string, example: "Junctions", multiple_sep: ";"
--chimSegmentMin minimum length of chimeric segment length, if ==0, no chimeric output integer, example: 0
--chimScoreMin minimum total (summed) score of the chimeric segments integer, example: 0
--chimScoreDropMax max drop (difference) of chimeric score (the sum of scores of all chimeric segments) from the read length integer, example: 20
--chimScoreSeparation minimum difference (separation) between the best chimeric score and the next one integer, example: 10
--chimScoreJunctionNonGTAG penalty for a non-GT/AG chimeric junction integer, example: -1
--chimJunctionOverhangMin minimum overhang for a chimeric junction integer, example: 20
--chimSegmentReadGapMax maximum gap in the read sequence between chimeric segments integer, example: 0
--chimFilter different filters for chimeric alignments - None … no filtering - banGenomicN … Ns are not allowed in the genome sequence around the chimeric junction List of string, example: "banGenomicN", multiple_sep: ";"
--chimMainSegmentMultNmax maximum number of multi-alignments for the main chimeric segment. =1 will prohibit multimapping main segments. integer, example: 10
--chimMultimapNmax maximum number of chimeric multi-alignments - 0 … use the old scheme for chimeric detection which only considered unique alignments integer, example: 0
--chimMultimapScoreRange the score range for multi-mapping chimeras below the best chimeric score. Only works with –chimMultimapNmax > 1 integer, example: 1
--chimNonchimScoreDropMin to trigger chimeric detection, the drop in the best non-chimeric alignment score with respect to the read length has to be greater than this value integer, example: 20
--chimOutJunctionFormat formatting type for the Chimeric.out.junction file - 0 … no comment lines/headers - 1 … comment lines at the end of the file: command line and Nreads: total, unique/multi-mapping integer, example: 0

Quantification of Annotations

Name Description Attributes
--quantMode types of quantification requested - - … none - TranscriptomeSAM … output SAM/BAM alignments to transcriptome into a separate file - GeneCounts … count reads per gene List of string, multiple_sep: ";"
--quantTranscriptomeBAMcompression -2 to 10 transcriptome BAM compression level - -2 … no BAM output - -1 … default compression (6?) - 0 … no compression - 10 … maximum compression integer, example: 1
--quantTranscriptomeBan prohibit various alignment type - IndelSoftclipSingleend … prohibit indels, soft clipping and single-end alignments - compatible with RSEM - Singleend … prohibit single-end alignments string, example: "IndelSoftclipSingleend"

2-pass Mapping

Name Description Attributes
--twopassMode 2-pass mapping mode. - None … 1-pass mapping - Basic … basic 2-pass mapping, with all 1st pass junctions inserted into the genome indices on the fly string
--twopass1readsN number of reads to process for the 1st step. Use very large number (or default -1) to map all reads in the first step. integer, example: -1

WASP parameters

Name Description Attributes
--waspOutputMode WASP allele-specific output type. This is re-implementation of the original WASP mappability filtering by Bryce van de Geijn, Graham McVicker, Yoav Gilad & Jonathan K Pritchard. Please cite the original WASP paper: Nature Methods 12, 1061-1063 (2015), https://www.nature.com/articles/nmeth.3582 . - SAMtag … add WASP tags to the alignments that pass WASP filtering string

STARsolo (single cell RNA-seq) parameters

Name Description Attributes
--soloType type of single-cell RNA-seq - CB_UMI_Simple … (a.k.a. Droplet) one UMI and one Cell Barcode of fixed length in read2, e.g. Drop-seq and 10X Chromium. - CB_UMI_Complex … multiple Cell Barcodes of varying length, one UMI of fixed length and one adapter sequence of fixed length are allowed in read2 only (e.g. inDrop, ddSeq). - CB_samTagOut … output Cell Barcode as CR and/or CB SAm tag. No UMI counting. –readFilesIn cDNA_read1 [cDNA_read2 if paired-end] CellBarcode_read . Requires –outSAMtype BAM Unsorted [and/or SortedByCoordinate] - SmartSeq … Smart-seq: each cell in a separate FASTQ (paired- or single-end), barcodes are corresponding read-groups, no UMI sequences, alignments deduplicated according to alignment start and end (after extending soft-clipped bases) List of string, multiple_sep: ";"
--soloCBwhitelist file(s) with whitelist(s) of cell barcodes. Only –soloType CB_UMI_Complex allows more than one whitelist file. - None … no whitelist: all cell barcodes are allowed List of string, multiple_sep: ";"
--soloCBstart cell barcode start base integer, example: 1
--soloCBlen cell barcode length integer, example: 16
--soloUMIstart UMI start base integer, example: 17
--soloUMIlen UMI length integer, example: 10
--soloBarcodeReadLength length of the barcode read - 1 … equal to sum of soloCBlen+soloUMIlen - 0 … not defined, do not check integer, example: 1
--soloBarcodeMate identifies which read mate contains the barcode (CB+UMI) sequence - 0 … barcode sequence is on separate read, which should always be the last file in the –readFilesIn listed - 1 … barcode sequence is a part of mate 1 - 2 … barcode sequence is a part of mate 2 integer, example: 0
--soloCBposition position of Cell Barcode(s) on the barcode read. Presently only works with –soloType CB_UMI_Complex, and barcodes are assumed to be on Read2. Format for each barcode: startAnchor_startPosition_endAnchor_endPosition start(end)Anchor defines the Anchor Base for the CB: 0: read start; 1: read end; 2: adapter start; 3: adapter end start(end)Position is the 0-based position with of the CB start(end) with respect to the Anchor Base String for different barcodes are separated by space. Example: inDrop (Zilionis et al, Nat. Protocols, 2017): –soloCBposition 0_0_2_-1 3_1_3_8 List of string, multiple_sep: ";"
--soloUMIposition position of the UMI on the barcode read, same as soloCBposition Example: inDrop (Zilionis et al, Nat. Protocols, 2017): –soloCBposition 3_9_3_14 string
--soloAdapterSequence adapter sequence to anchor barcodes. Only one adapter sequence is allowed. string
--soloAdapterMismatchesNmax maximum number of mismatches allowed in adapter sequence. integer, example: 1
--soloCBmatchWLtype matching the Cell Barcodes to the WhiteList - Exact … only exact matches allowed - 1MM … only one match in whitelist with 1 mismatched base allowed. Allowed CBs have to have at least one read with exact match. - 1MM_multi … multiple matches in whitelist with 1 mismatched base allowed, posterior probability calculation is used choose one of the matches. Allowed CBs have to have at least one read with exact match. This option matches best with CellRanger 2.2.0 - 1MM_multi_pseudocounts … same as 1MM_Multi, but pseudocounts of 1 are added to all whitelist barcodes. - 1MM_multi_Nbase_pseudocounts … same as 1MM_multi_pseudocounts, multimatching to WL is allowed for CBs with N-bases. This option matches best with CellRanger >= 3.0.0 - EditDist_2 … allow up to edit distance of 3 fpr each of the barcodes. May include one deletion + one insertion. Only works with –soloType CB_UMI_Complex. Matches to multiple passlist barcdoes are not allowed. Similar to ParseBio Split-seq pipeline. string, example: "1MM_multi"
--soloInputSAMattrBarcodeSeq when inputting reads from a SAM file (–readsFileType SAM SE/PE), these SAM attributes mark the barcode sequence (in proper order). For instance, for 10X CellRanger or STARsolo BAMs, use –soloInputSAMattrBarcodeSeq CR UR . This parameter is required when running STARsolo with input from SAM. List of string, multiple_sep: ";"
--soloInputSAMattrBarcodeQual when inputting reads from a SAM file (–readsFileType SAM SE/PE), these SAM attributes mark the barcode qualities (in proper order). For instance, for 10X CellRanger or STARsolo BAMs, use –soloInputSAMattrBarcodeQual CY UY . If this parameter is ‘-’ (default), the quality ‘H’ will be assigned to all bases. List of string, multiple_sep: ";"
--soloStrand strandedness of the solo libraries: - Unstranded … no strand information - Forward … read strand same as the original RNA molecule - Reverse … read strand opposite to the original RNA molecule string, example: "Forward"
--soloFeatures genomic features for which the UMI counts per Cell Barcode are collected - Gene … genes: reads match the gene transcript - SJ … splice junctions: reported in SJ.out.tab - GeneFull … full gene (pre-mRNA): count all reads overlapping genes’ exons and introns - GeneFull_ExonOverIntron … full gene (pre-mRNA): count all reads overlapping genes’ exons and introns: prioritize 100% overlap with exons - GeneFull_Ex50pAS … full gene (pre-RNA): count all reads overlapping genes’ exons and introns: prioritize >50% overlap with exons. Do not count reads with 100% exonic overlap in the antisense direction. List of string, example: "Gene", multiple_sep: ";"
--soloMultiMappers counting method for reads mapping to multiple genes - Unique … count only reads that map to unique genes - Uniform … uniformly distribute multi-genic UMIs to all genes - Rescue … distribute UMIs proportionally to unique+uniform counts (~ first iteration of EM) - PropUnique … distribute UMIs proportionally to unique mappers, if present, and uniformly if not. - EM … multi-gene UMIs are distributed using Expectation Maximization algorithm List of string, example: "Unique", multiple_sep: ";"
--soloUMIdedup type of UMI deduplication (collapsing) algorithm - 1MM_All … all UMIs with 1 mismatch distance to each other are collapsed (i.e. counted once). - 1MM_Directional_UMItools … follows the “directional” method from the UMI-tools by Smith, Heger and Sudbery (Genome Research 2017). - 1MM_Directional … same as 1MM_Directional_UMItools, but with more stringent criteria for duplicate UMIs - Exact … only exactly matching UMIs are collapsed. - NoDedup … no deduplication of UMIs, count all reads. - 1MM_CR … CellRanger2-4 algorithm for 1MM UMI collapsing. List of string, example: "1MM_All", multiple_sep: ";"
--soloUMIfiltering type of UMI filtering (for reads uniquely mapping to genes) - - … basic filtering: remove UMIs with N and homopolymers (similar to CellRanger 2.2.0). - MultiGeneUMI … basic + remove lower-count UMIs that map to more than one gene. - MultiGeneUMI_All … basic + remove all UMIs that map to more than one gene. - MultiGeneUMI_CR … basic + remove lower-count UMIs that map to more than one gene, matching CellRanger > 3.0.0 . Only works with –soloUMIdedup 1MM_CR List of string, multiple_sep: ";"
--soloOutFileNames file names for STARsolo output: file_name_prefix gene_names barcode_sequences cell_feature_count_matrix List of string, example: "Solo.out/", "features.tsv", "barcodes.tsv", "matrix.mtx", multiple_sep: ";"
--soloCellFilter cell filtering type and parameters - None … do not output filtered cells - TopCells … only report top cells by UMI count, followed by the exact number of cells - CellRanger2.2 … simple filtering of CellRanger 2.2. Can be followed by numbers: number of expected cells, robust maximum percentile for UMI count, maximum to minimum ratio for UMI count The harcoded values are from CellRanger: nExpectedCells=3000; maxPercentile=0.99; maxMinRatio=10 - EmptyDrops_CR … EmptyDrops filtering in CellRanger flavor. Please cite the original EmptyDrops paper: A.T.L Lun et al, Genome Biology, 20, 63 (2019): https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1662-y Can be followed by 10 numeric parameters: nExpectedCells maxPercentile maxMinRatio indMin indMax umiMin umiMinFracMedian candMaxN FDR simN The harcoded values are from CellRanger: 3000 0.99 10 45000 90000 500 0.01 20000 0.01 10000 List of string, example: "CellRanger2.2", "3000", "0.99", "10", multiple_sep: ";"
--soloOutFormatFeaturesGeneField3 field 3 in the Gene features.tsv file. If “-”, then no 3rd field is output. List of string, example: "Gene Expression", multiple_sep: ";"
--soloCellReadStats Output reads statistics for each CB - Standard … standard output string

HTSeq arguments

Name Description Attributes
--stranded Whether the data is from a strand-specific assay. ‘reverse’ means ‘yes’ with reversed strand interpretation. string, default: "yes"
--minimum_alignment_quality Skip all reads with MAPQ alignment quality lower than the given minimum value. MAPQ is the 5th column of a SAM/BAM file and its usage depends on the software used to map the reads. integer, default: 10
--type Feature type (3rd column in GTF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon) string, example: "exon"
--id_attribute GTF attribute to be used as feature ID (default, suitable for Ensembl GTF files: gene_id). All feature of the right type (see -t option) within the same GTF attribute will be added together. The typical way of using this option is to count all exonic reads from each gene and add the exons but other uses are possible as well. You can call this option multiple times: in that case, the combination of all attributes separated by colons (:) will be used as a unique identifier, e.g. for exons you might use -i gene_id -i exon_number. List of string, example: "gene_id", multiple_sep: ";"
--additional_attributes Additional feature attributes (suitable for Ensembl GTF files: gene_name). Use multiple times for more than one additional attribute. These attributes are only used as annotations in the output, while the determination of how the counts are added together is done based on option -i. List of string, example: "gene_name", multiple_sep: ";"
--add_chromosome_info Store information about the chromosome of each feature as an additional attribute (e.g. colunm in the TSV output file). boolean_true
--mode Mode to handle reads overlapping more than one feature. string, default: "union"
--non_unique Whether and how to score reads that are not uniquely aligned or ambiguously assigned to features. string, default: "none"
--secondary_alignments Whether to score secondary alignments (0x100 flag). string
--supplementary_alignments Whether to score supplementary alignments (0x800 flag). string
--counts_output_sparse Store the counts as a sparse matrix (mtx, h5ad, loom). boolean_true

Authors

  • Angela Oliveira Pisco (author)

  • Robrecht Cannoodt (author, maintainer)