Filter with scrublet

Doublet detection using the Scrublet method (Wolock, Lopez and Klein, 2019).

Info

ID: filter_with_scrublet
Namespace: filter

The method tests for potential doublets by using the expression profiles of cells to generate synthetic potential doubles which are tested against cells. The method returns a “doublet score” on which it calls for potential doublets.

For the source code please visit https://github.com/AllonKleinLab/scrublet.

For 10x we expect the doublet rates to be: Multiplet Rate (%) - # of Cells Loaded - # of Cells Recovered ~0.4% ~800 ~500 ~0.8% ~1,600 ~1,000 ~1.6% ~3,200 ~2,000 ~2.3% ~4,800 ~3,000 ~3.1% ~6,400 ~4,000 ~3.9% ~8,000 ~5,000 ~4.6% ~9,600 ~6,000 ~5.4% ~11,200 ~7,000 ~6.1% ~12,800 ~8,000 ~6.9% ~14,400 ~9,000 ~7.6% ~16,000 ~10,000

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 1.0.2 -latest \
  -main-script target/nextflow/filter/filter_with_scrublet/main.nf \
  --help

Run command

Example of params.yaml
# Arguments
input: # please fill in - example: "input.h5mu"
modality: "rna"
# layer: "foo"
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
obs_name_filter: "filter_with_scrublet"
do_subset: false
obs_name_doublet_score: "scrublet_doublet_score"
min_counts: 2
min_cells: 3
min_gene_variablity_percent: 85
num_pca_components: 30
distance_metric: "euclidean"
allow_automatic_threshold_detection_fail: false

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"
nextflow run openpipelines-bio/openpipeline \
  -r 1.0.2 -latest \
  -profile docker \
  -main-script target/nextflow/filter/filter_with_scrublet/main.nf \
  -params-file params.yaml
Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument group

Arguments

Name Description Attributes
--input Input h5mu file file, required, example: "input.h5mu"
--modality string, default: "rna"
--layer Input layer to use as data for calculating doublets. .X is used not specified. string
--output Output h5mu file. file, example: "output.h5mu"
--output_compression The compression format to be used on the output h5mu object. string, example: "gzip"
--obs_name_filter In which .obs slot to store a boolean array corresponding to which observations should be filtered out. string, default: "filter_with_scrublet"
--do_subset Whether to subset before storing the output. boolean_true
--obs_name_doublet_score Name of the doublet scores column in the obs slot of the returned object. string, default: "scrublet_doublet_score"
--min_counts The number of minimal UMI counts per cell that have to be present for initial cell detection. integer, default: 2
--min_cells The number of cells in which UMIs for a gene were detected. integer, default: 3
--min_gene_variablity_percent Used for gene filtering prior to PCA. Keep the most highly variable genes (in the top min_gene_variability_pctl percentile), as measured by the v-statistic [Klein et al., Cell 2015]. double, default: 85
--num_pca_components Number of principal components to use during PCA dimensionality reduction. integer, default: 30
--distance_metric The distance metric used for computing similarities. string, default: "euclidean"
--allow_automatic_threshold_detection_fail When scrublet fails to automatically determine the double score threshold, allow the component to continue and set the output columns to NA. boolean_true

Authors

  • Dries De Maeyer (contributor)

  • Robrecht Cannoodt (maintainer, contributor)