scvi-tools

scvi-tools is an AI for Science-driven library for deep probabilistic analysis of single-cell and spatial omics data, empowering AI Agents to uncover complex biological insights.

1.6KStar

443Fork

26Watch

2026.03.17Updated

GIS Raster/Vector Processing & Formats Specialized Single-cell AI (Foundation/SSL)Visualization/Atlas Browsing/Interactive Reporting Joint Multi-omics Modeling (CITE/ATAC, etc.)Workflow & Reproducibility (modules/pipelines)Batch Correction/Integration/Cross-modality Alignment Data Formats/Reading/Preprocessing & Quality Control Clustering/Dimensionality Reduction/Trajectory/Lineage Inference UQ: Bayesian/Ensemble/MC dropout etc.

SciencePedia AI Insight

scvi-tools provides a robust AI for Science infrastructure with machine-readable deep probabilistic models, offering one-click ready functionalities for single-cell and spatial omics analysis. This allows AI Agents to seamlessly call out-of-the-box methods for batch correction, multi-omics integration, and cell type annotation, automating complex biological data workflows.

INFRASTRUCTURE STATUS:

Docker Verified

Tutorials Available

Overview

Tutorials

More Info

scvi-tools is a comprehensive and powerful library designed for the deep probabilistic analysis of single-cell and spatial omics data. It provides robust implementations of various scVI (single-cell Variational Inference) models, serving as a cornerstone for advanced biological data analysis in the AI for Science domain. Its primary purpose is to address complex challenges in single-cell genomics, enabling researchers to derive deeper insights from high-dimensional biological datasets.

The tool finds extensive application across various scientific fields, particularly in bioinformatics, systems biology, and immunology. It is instrumental in solving critical problems related to single-cell and spatial omics data processing and interpretation. For instance, in the realm of single-cell multi-omics integration, scvi-tools can effectively combine diverse data types, such as single-cell RNA sequencing (scRNA-seq) with single-cell ATAC sequencing (scATAC-seq), providing a holistic view of cellular states and regulatory landscapes, which is crucial for understanding heterogeneous tissues.

Furthermore, scvi-tools excels in multi-omics data integration, including spatial omics. It can integrate spatial transcriptomics and proteomics data, leveraging spatial coordinates to inform joint models and meticulously accounting for spatial autocorrelation, thereby preserving the intricate spatial organization of cells within tissues. This capability is vital for uncovering spatially resolved biological processes.

A significant application of scvi-tools is in batch correction and data normalization for single-cell RNA-seq datasets. It offers sophisticated methods, including the original scVI model, to remove technical variations and batch effects, allowing for accurate comparison across different experiments or samples. This is particularly valuable in systems immunology where comparing high-dimensional datasets from diverse sources or even cross-species data is common. The library also supports semi-supervised learning for tasks like label transfer through models such as scANVI, which balances data reconstruction with classification to effectively annotate cell types. Through its generative and inference models, scvi-tools provides a principled approach to handle the intrinsic noise and variability in single-cell data, offering amortized inference for efficient and scalable analysis.

Problem 1

The first step in analyzing single-cell RNA-sequencing data often involves dimensionality reduction and the correction of technical batch effects. This exercise covers the fundamental scvi-tools workflow: registering data with setup_anndata, initializing the SCVI model, training it, and retrieving the learned latent space. Mastering this core sequence is essential, as it forms the basis for nearly all subsequent analyses and applications within the scvi-tools ecosystem.

Problem: In the field of single-cell RNA sequencing (scRNA-seq) data analysis, two major challenges are the high dimensionality of the data (thousands of genes) and the presence of technical batch effects. The scvi-tools library provides a probabilistic framework to address these issues simultaneously using deep generative models.

Your task is to write a Python script that demonstrates the core workflow for using the Single-cell Variational Inference (SCVI) model to learn a low-dimensional, batch-corrected representation of scRNA-seq data.

The script must perform the following steps for a series of test cases:

Data Generation: Create a synthetic AnnData object. This object must contain a raw count matrix $X$ of shape $(N_{cells}, N_{genes})$ and a categorical column in its observation annotations (adata.obs) named 'batch' that assigns each cell to one of $N_{batches}$ batches.
Data Setup: Use the scvi.data.setup_anndata function to register the AnnData object for use with the SCVI model. You must explicitly specify that the raw counts are to be used and indicate the column containing the batch information.
Model Initialization: Initialize an scvi.model.SCVI model instance with the set-up AnnData object. Use the default hyperparameters for the model architecture.
Model Training: Train the model on the data. To ensure the script executes quickly for testing purposes, you must set max_epochs=1 and train_size=1.0 in the train() method. All other training parameters should be left at their default values.
Latent Representation Retrieval: After training, retrieve the latent representation for all cells. This is a matrix where each row corresponds to a cell and each column corresponds to a dimension in the learned latent space.

The goal is to verify that the model can be initialized, trained, and that it produces a latent representation of the expected dimensions.

Test Suite

The program must run the following test cases, where each case is defined by the tuple $(N_{cells}, N_{genes}, N_{batches})$ :

Test Case 1: $(50, 20, 2)$
Test Case 2: $(100, 50, 3)$

For each test case, your script should generate the synthetic data, run the SCVI workflow, and determine the shape of the resulting latent representation matrix.

Final Output Format

The final line of standard output must be a single list containing the shapes of the retrieved latent representation matrices for all test cases. Each shape should be a tuple $(R, C)$ , where $R$ is the number of rows and $C$ is the number of columns. For example: [(50, 10), (100, 10)].

Display Solution Process

Problem 2

After learning a latent representation, a key downstream task is identifying genes that are differentially expressed between cell populations. This practice delves into model.differential_expression(), showcasing how scvi-tools uses probabilistic modeling and Bayes factors to perform robust differential expression analysis. You will specifically leverage the model's counterfactual reasoning ability to isolate biological differences from technical batch effects, a powerful feature of generative models.

Problem: ### Context In single-cell transcriptomics, observed gene expression is a combination of biological variation (e.g., cell type) and technical variation (e.g., experimental batch). Generative models like single-cell Variational Inference (scVI) learn a low-dimensional latent representation of the data, explicitly modeling and disentangling these factors. A powerful capability of such models is counterfactual reasoning: generating what a cell's gene expression would look like under different technical conditions. This is particularly useful for differential expression (DE) analysis, where we wish to compare biological states while controlling for technical confounders.

Problem Statement

You are given a dataset of synthetic single-cell RNA sequencing data, which contains cells from two distinct cell types (cell_type '0' and '1') processed across two different technical batches (batch '0' and '1'). An scVI model has been trained on this data, learning to correct for batch effects.

Your task is to perform differential expression analysis between cell types using the trained model's probabilistic hypothesis testing framework. To isolate the biological difference between cell types, you must utilize the model's counterfactual generation capability to simulate expression profiles for all cells as if they originated from a single, fixed technical batch.

Perform the following three differential expression tests using the model.differential_expression() method:

Test Case 1 (Primary Comparison): Compare cell_type '0' (group 1) versus cell_type '1' (group 2), explicitly fixing the technical condition to batch '0' for both groups.
Test Case 2 (Batch Consistency Check): Compare cell_type '0' (group 1) versus cell_type '1' (group 2), explicitly fixing the technical condition to batch '1' for both groups.
Test Case 3 (Negative Control): Compare cell_type '0' (group 1) versus cell_type '0' (group 2) - a self-comparison - explicitly fixing the technical condition to batch '0' for both groups.

For each test case, determine the number of genes that show significant differential expression. A gene is considered significantly differentially expressed if its resulting Bayes factor is greater than 3.0.

Input Data and Model

The solution must generate its own synthetic data and train the model. Use the following parameters for data generation with scvi.data.synthetic_iid() to ensure reproducibility:

n_batches=2
n_labels=2
n_genes=50
n_cells=200
random_state=42

Train an scvi.model.SCVI model on this data with default parameters, after setting up the AnnData object with batch_key="batch" and labels_key="labels". Set a seed for reproducibility before training (e.g., scvi.settings.seed = 42).

Output Format

The final output of your program must be a single line containing a list of three integers, representing the count of significant genes for Test Case 1, Test Case 2, and Test Case 3, respectively. The format must be: [count1, count2, count3].

Display Solution Process

Problem 3

Annotating cell types is a critical but often laborious step, especially when only a subset of cells are labeled, which is where semi-supervised learning excels. This exercise demonstrates a transfer learning workflow, where a pre-trained unsupervised SCVI model is used to initialize a SCANVI model for cell type classification. This approach leverages the rich representation learned from all data while fine-tuning a classifier on the limited available labels, a cornerstone of modern single-cell analysis.

Problem: In the context of single-cell RNA sequencing (scRNA-seq) analysis, a common workflow involves two stages: first, learning a low-dimensional latent representation of the data in an unsupervised manner, and second, using this representation to build a classifier for cell types, often in a semi-supervised setting where only a subset of cells are labeled.

scvi-tools provides the SCVI (Single-cell Variational Inference) model for the first unsupervised stage and the SCANVI (Single-cell ANnotation using Variational Inference) model for the second semi-supervised stage. A powerful feature of SCANVI is its ability to be initialized from a pre-trained SCVI model. This process effectively transfers the learned parameters of the encoder and decoder networks from SCVI to SCANVI, treating the SCVI model's latent space as a foundational representation upon which a classification head is added.

Your task is to write a Python program that demonstrates this transfer learning process. The program must:

Generate a synthetic scRNA-seq dataset containing both count data and cell type labels.
Instantiate and train an unsupervised SCVI model on this dataset.
Instantiate a semi-supervised SCANVI model by taking the pre-trained SCVI model as input. This new model should be configured to use the available cell type labels for the classification task.
Train the SCANVI model.
Verify the functionality of the trained SCANVI model by using it to predict the cell types of the cells in the dataset.

The program should return True if all steps complete successfully and the SCANVI model can produce predictions, and False otherwise.

Problem Statement Requirement

Universal Applicability: The solution should be written in standard Python using the specified libraries.
Test Suite & Answer Specification:
- The program must generate its own synthetic data using scvi.data.synthetic_iid(). This serves as the single test case.
- For computational efficiency within this test environment, you should set max_epochs=1 for the training of both the SCVI and SCANVI models.
- The final answer must be a boolean value (True or False).
Final Output Format: The last line of standard output must be a list containing the single boolean result, e.g., [True]. Any other output for debugging must be printed before this final line.

Display Solution Process

Problem 4

Modern single-cell experiments often measure multiple data types from the same cell, such as RNA and surface proteins in CITE-seq, requiring integrated analysis methods. This practice introduces TOTALVI, a specialized model within scvi-tools designed for the joint analysis of paired RNA and protein expression data. You will learn the crucial steps of registering multimodal data and training the model to produce a unified latent representation, expanding your skills beyond single-modality analysis.

Problem: You are tasked with performing a joint dimensionality reduction analysis of single-cell RNA-sequencing and antibody-derived tag (protein) expression data, commonly known as CITE-seq. You will use the TOTALVI (Total Variational Inference) model provided by the scvi-tools library. The objective is to learn a joint low-dimensional latent representation of the cells that integrates information from both RNA and protein modalities.

Assume you are provided with a pre-loaded AnnData object, let's call it adata, which is structured as follows:

adata.X: Contains the raw count matrix for RNA expression.
adata.obsm["protein_counts"]: Contains the raw count matrix for protein expression.
adata.obs["batch"]: Contains categorical batch information for each cell.

Your task is to write a Python program that accomplishes the following steps in sequence for a given AnnData object and a set of model hyperparameters:

Data Registration: Inform scvi-tools about the structure of your AnnData object. This is a crucial step for multimodal models. You must explicitly specify which key in obsm holds the protein data and which column in obs holds the batch indices.
Model Initialization: Instantiate a TOTALVI model. The model should be initialized with the registered adata object and a specified number of latent dimensions, $n_{latent}$ .
Model Training: Train the initialized model using variational inference for a specified number of epochs, $max\_epochs$ . To ensure the training process is deterministic and fast for testing purposes, you must configure the training to:
- Use the entire dataset for training (i.e., no validation split).
- Disable the periodic validation check.
- Disable the training progress bar.
Latent Representation Extraction: After training, extract the joint latent representation for all cells. This is typically the posterior mean of the latent variables $z$ for each cell.
Result Verification: Return the shape of the extracted latent representation matrix.

To test your implementation, your program must include a helper function, create_synthetic_data(), that generates a small, standardized AnnData object. This function should:

Set a fixed random seed (e.g., 42) for reproducibility using numpy.random.seed.
Create an AnnData object with $N_{cells} = 20$ cells, $N_{genes} = 10$ genes, and $N_{proteins} = 5$ proteins.
Populate X (RNA) and obsm["protein_counts"] (protein) with random counts drawn from a Poisson distribution.
Populate obs["batch"] with randomly assigned batch labels (e.g., "batch1", "batch2").

The main part of your program should:

Set the global scvi seed to 42 for reproducible training.
Define a list of test cases, where each case is a tuple of hyperparameters: $(n_{latent}, max\_epochs)$ . The test cases are: [(10, 2), (5, 1)].
Iterate through each test case: a. Generate a fresh synthetic adata object using create_synthetic_data(). b. Perform the sequence of steps 1-4 described above. c. Capture the shape of the resulting latent representation matrix as a tuple $(rows, columns)$ .
Collect the resulting shapes from all test cases and print them as a list of tuples in the final line of standard output. For example: [(20, 10), (20, 5)].

Display Solution Process

Single-cell Multi-omics Integration

Systems Immunology and High-dimensional Analysis

Strategies for Multi-omics Data Integration

Tool Build Parameters

Primary Language	Python (99.98%)
License	BSD-3-Clause

SciencePedia AI Insight

Overview

Tutorials

Related Topics

More Info

Test Suite

Final Output Format

Problem Statement

Input Data and Model

Output Format

Problem Statement Requirement

Tool Build Parameters