Umap

UMAP (Uniform Manifold Approximation and Projection) is a manifold learning technique suitable for visualizing high-dimensional data.

Info

ID: umap
Namespace: dimred

Besides tending to be faster than tSNE, it optimizes the embedding such that it best reflects the topology of the data, which we represent throughout Scanpy using a neighborhood graph. tSNE, by contrast, optimizes the distribution of nearest-neighbor distances in the embedding such that these best match the distribution of distances in the high-dimensional space. We use the implementation of umap-learn [McInnes18]. For a few comparisons of UMAP with tSNE, see this preprint

Example commands

You can run the pipeline using nextflow run.

View help

You can use --help as a parameter to get an overview of the possible parameters.

nextflow run openpipelines-bio/openpipeline \
  -r 1.0.1 -latest \
  -main-script target/nextflow/dimred/umap/main.nf \
  --help

Run command

Example of params.yaml
# Inputs
input: # please fill in - example: "input.h5mu"
modality: "rna"
uns_neighbors: "neighbors"

# Outputs
# output: "$id.$key.output.h5mu"
# output_compression: "gzip"
obsm_output: "umap"

# Arguments
min_dist: 0.5
spread: 1.0
num_components: 2
# max_iter: 123
alpha: 1.0
gamma: 1.0
negative_sample_rate: 5
init_pos: "spectral"

# Nextflow input-output arguments
publish_dir: # please fill in - example: "output/"
# param_list: "my_params.yaml"
nextflow run openpipelines-bio/openpipeline \
  -r 1.0.1 -latest \
  -profile docker \
  -main-script target/nextflow/dimred/umap/main.nf \
  -params-file params.yaml
Note

Replace -profile docker with -profile podman or -profile singularity depending on the desired backend.

Argument groups

Inputs

Name Description Attributes
--input Input h5mu file file, required, example: "input.h5mu"
--modality string, default: "rna"
--uns_neighbors The .uns neighbors slot as output by the find_neighbors component. string, default: "neighbors"

Outputs

Name Description Attributes
--output Output h5mu file. file, required, example: "output.h5mu"
--output_compression The compression format to be used on the output h5mu object. string, example: "gzip"
--obsm_output The pre/postfix under which to store the UMAP results. string, default: "umap"

Arguments

Name Description Attributes
--min_dist The effective minimum distance between embedded points. Smaller values will result in a more clustered/clumped embedding where nearby points on the manifold are drawn closer together, while larger values will result on a more even dispersal of points. The value should be set relative to the spread value, which determines the scale at which embedded points will be spread out. double, default: 0.5
--spread The effective scale of embedded points. In combination with min_dist this determines how clustered/clumped the embedded points are. double, default: 1
--num_components The number of dimensions of the embedding. integer, default: 2
--max_iter The number of iterations (epochs) of the optimization. Called n_epochs in the original UMAP. Default is set to 500 if neighbors[‘connectivities’].shape[0] <= 10000, else 200. integer
--alpha The initial learning rate for the embedding optimization. double, default: 1
--gamma Weighting applied to negative samples in low dimensional embedding optimization. Values higher than one will result in greater weight being given to negative samples. double, default: 1
--negative_sample_rate The number of negative edge/1-simplex samples to use per positive edge/1-simplex sample in optimizing the low dimensional embedding. integer, default: 5
--init_pos How to initialize the low dimensional embedding. Called init in the original UMAP. Options are: * Any key from .obsm * 'paga': positions from paga() * 'spectral': use a spectral embedding of the graph * 'random': assign initial embedding positions at random. string, default: "spectral"

Authors

  • Dries De Maeyer (maintainer)