Filter with scrublet

Doublet detection using the Scrublet method (Wolock, Lopez and Klein, 2019).

Info

ID: filter_with_scrublet
Namespace: filter

Links

The method tests for potential doublets by using the expression profiles of cells to generate synthetic potential doubles which are tested against cells. The method returns a “doublet score” on which it calls for potential doublets.

For the source code please visit https://github.com/AllonKleinLab/scrublet.

For 10x we expect the doublet rates to be: Multiplet Rate (%) - # of Cells Loaded - # of Cells Recovered ~0.4% ~800 ~500 ~0.8% ~1,600 ~1,000 ~1.6% ~3,200 ~2,000 ~2.3% ~4,800 ~3,000 ~3.1% ~6,400 ~4,000 ~3.9% ~8,000 ~5,000 ~4.6% ~9,600 ~6,000 ~5.4% ~11,200 ~7,000 ~6.1% ~12,800 ~8,000 ~6.9% ~14,400 ~9,000 ~7.6% ~16,000 ~10,000

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -main-script target/nextflow/filter/filter_with_scrublet/main.nf \
  --help

Run command

Example of params.yaml

# Arguments
input: # please fill in - example: "input.h5mu"
modality: "rna"
# layer: "foo"
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
obs_name_filter: "filter_with_scrublet"
do_subset: false
obs_name_doublet_score: "scrublet_doublet_score"
# expected_doublet_rate: 123.0
# stdev_doublet_rate: 123.0
# n_neighbors: 123
# sim_doublet_ratio: 123.0
min_counts: 2
min_cells: 3
min_gene_variablity_percent: 85.0
num_pca_components: 30
distance_metric: "euclidean"
allow_automatic_threshold_detection_fail: false

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"

nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -profile docker \
  -main-script target/nextflow/filter/filter_with_scrublet/main.nf \
  -params-file params.yaml

Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument group

Arguments

Name	Description	Attributes
`--input`	Input h5mu file	`file`, required, example: `"input.h5mu"`
`--modality`		`string`, default: `"rna"`
`--layer`	Input layer to use as data for calculating doublets. .X is used not specified.	`string`
`--output`	Output h5mu file.	`file`, example: `"output.h5mu"`
`--output_compression`	The compression format to be used on the output h5mu object.	`string`, example: `"gzip"`
`--obs_name_filter`	In which .obs slot to store a boolean array corresponding to which observations should be filtered out.	`string`, default: `"filter_with_scrublet"`
`--do_subset`	Whether to subset before storing the output.	`boolean_true`
`--obs_name_doublet_score`	Name of the doublet scores column in the obs slot of the returned object.	`string`, default: `"scrublet_doublet_score"`
`--expected_doublet_rate`	The estimated fraction of doublets as from the experimental setup.	`double`
`--stdev_doublet_rate`	Uncertainty in the expected doublet rate.	`double`
`--n_neighbors`	Number of neighbors used to construct the KNN classifier of observed transcriptomes and simulated doublets.	`integer`
`--sim_doublet_ratio`	Number of doublets to simulate relative to the number of observed transcriptomes.	`double`
`--min_counts`	The number of minimal UMI counts per cell that have to be present for initial cell detection.	`integer`, default: `2`
`--min_cells`	The number of cells in which UMIs for a gene were detected.	`integer`, default: `3`
`--min_gene_variablity_percent`	Used for gene filtering prior to PCA. Keep the most highly variable genes (in the top min_gene_variability_pctl percentile), as measured by the v-statistic [Klein et al., Cell 2015].	`double`, default: `85`
`--num_pca_components`	Number of principal components to use during PCA dimensionality reduction.	`integer`, default: `30`
`--distance_metric`	The distance metric used for computing similarities.	`string`, default: `"euclidean"`
`--allow_automatic_threshold_detection_fail`	When scrublet fails to automatically determine the double score threshold, allow the component to continue and set the output columns to NA.	`boolean_true`

Authors

Dries De Maeyer (contributor)
Robrecht Cannoodt (maintainer, contributor)