Filter with scrublet
Info
ID: filter_with_scrublet
Namespace: filter
Links
The method tests for potential doublets by using the expression profiles of cells to generate synthetic potential doubles which are tested against cells. The method returns a “doublet score” on which it calls for potential doublets.
For the source code please visit https://github.com/AllonKleinLab/scrublet.
For 10x we expect the doublet rates to be: Multiplet Rate (%) - # of Cells Loaded - # of Cells Recovered ~0.4% ~800 ~500 ~0.8% ~1,600 ~1,000 ~1.6% ~3,200 ~2,000 ~2.3% ~4,800 ~3,000 ~3.1% ~6,400 ~4,000 ~3.9% ~8,000 ~5,000 ~4.6% ~9,600 ~6,000 ~5.4% ~11,200 ~7,000 ~6.1% ~12,800 ~8,000 ~6.9% ~14,400 ~9,000 ~7.6% ~16,000 ~10,000
Example commands
You can run the pipeline using nextflow run.
View help
You can use --help as a parameter to get an overview of the possible parameters.
nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -main-script target/nextflow/filter/filter_with_scrublet/main.nf \
  --helpRun command
Example of params.yaml
# Arguments
input: # please fill in - example: "input.h5mu"
modality: "rna"
# layer: "foo"
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
obs_name_filter: "filter_with_scrublet"
do_subset: false
obs_name_doublet_score: "scrublet_doublet_score"
# expected_doublet_rate: 123.0
# stdev_doublet_rate: 123.0
# n_neighbors: 123
# sim_doublet_ratio: 123.0
min_counts: 2
min_cells: 3
min_gene_variablity_percent: 85.0
num_pca_components: 30
distance_metric: "euclidean"
allow_automatic_threshold_detection_fail: false
# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"nextflow run openpipelines-bio/openpipeline \
  -r 2.1.1 -latest \
  -profile docker \
  -main-script target/nextflow/filter/filter_with_scrublet/main.nf \
  -params-file params.yamlReplace -profile docker with -profile podman or -profile singularity depending on the desired backend.
Argument group
Arguments
| Name | Description | Attributes | 
|---|---|---|
| --input | Input h5mu file | file, required, example:"input.h5mu" | 
| --modality | string, default:"rna" | |
| --layer | Input layer to use as data for calculating doublets. .X is used not specified. | string | 
| --output | Output h5mu file. | file, example:"output.h5mu" | 
| --output_compression | The compression format to be used on the output h5mu object. | string, example:"gzip" | 
| --obs_name_filter | In which .obs slot to store a boolean array corresponding to which observations should be filtered out. | string, default:"filter_with_scrublet" | 
| --do_subset | Whether to subset before storing the output. | boolean_true | 
| --obs_name_doublet_score | Name of the doublet scores column in the obs slot of the returned object. | string, default:"scrublet_doublet_score" | 
| --expected_doublet_rate | The estimated fraction of doublets as from the experimental setup. | double | 
| --stdev_doublet_rate | Uncertainty in the expected doublet rate. | double | 
| --n_neighbors | Number of neighbors used to construct the KNN classifier of observed transcriptomes and simulated doublets. | integer | 
| --sim_doublet_ratio | Number of doublets to simulate relative to the number of observed transcriptomes. | double | 
| --min_counts | The number of minimal UMI counts per cell that have to be present for initial cell detection. | integer, default:2 | 
| --min_cells | The number of cells in which UMIs for a gene were detected. | integer, default:3 | 
| --min_gene_variablity_percent | Used for gene filtering prior to PCA. Keep the most highly variable genes (in the top min_gene_variability_pctl percentile), as measured by the v-statistic [Klein et al., Cell 2015]. | double, default:85 | 
| --num_pca_components | Number of principal components to use during PCA dimensionality reduction. | integer, default:30 | 
| --distance_metric | The distance metric used for computing similarities. | string, default:"euclidean" | 
| --allow_automatic_threshold_detection_fail | When scrublet fails to automatically determine the double score threshold, allow the component to continue and set the output columns to NA. | boolean_true |