scvi-tools is a comprehensive and powerful library designed for the deep probabilistic analysis of single-cell and spatial omics data. It provides robust implementations of various scVI (single-cell Variational Inference) models, serving as a cornerstone for advanced biological data analysis in the AI for Science domain. Its primary purpose is to address complex challenges in single-cell genomics, enabling researchers to derive deeper insights from high-dimensional biological datasets.
The tool finds extensive application across various scientific fields, particularly in bioinformatics, systems biology, and immunology. It is instrumental in solving critical problems related to single-cell and spatial omics data processing and interpretation. For instance, in the realm of single-cell multi-omics integration, scvi-tools can effectively combine diverse data types, such as single-cell RNA sequencing (scRNA-seq) with single-cell ATAC sequencing (scATAC-seq), providing a holistic view of cellular states and regulatory landscapes, which is crucial for understanding heterogeneous tissues.
Furthermore, scvi-tools excels in multi-omics data integration, including spatial omics. It can integrate spatial transcriptomics and proteomics data, leveraging spatial coordinates to inform joint models and meticulously accounting for spatial autocorrelation, thereby preserving the intricate spatial organization of cells within tissues. This capability is vital for uncovering spatially resolved biological processes.
A significant application of scvi-tools is in batch correction and data normalization for single-cell RNA-seq datasets. It offers sophisticated methods, including the original scVI model, to remove technical variations and batch effects, allowing for accurate comparison across different experiments or samples. This is particularly valuable in systems immunology where comparing high-dimensional datasets from diverse sources or even cross-species data is common. The library also supports semi-supervised learning for tasks like label transfer through models such as scANVI, which balances data reconstruction with classification to effectively annotate cell types. Through its generative and inference models, scvi-tools provides a principled approach to handle the intrinsic noise and variability in single-cell data, offering amortized inference for efficient and scalable analysis.
The first step in analyzing single-cell RNA-sequencing data often involves dimensionality reduction and the correction of technical batch effects. This exercise covers the fundamental scvi-tools workflow: registering data with setup_anndata, initializing the SCVI model, training it, and retrieving the learned latent space. Mastering this core sequence is essential, as it forms the basis for nearly all subsequent analyses and applications within the scvi-tools ecosystem.
Problem: In the field of single-cell RNA sequencing (scRNA-seq) data analysis, two major challenges are the high dimensionality of the data (thousands of genes) and the presence of technical batch effects. The scvi-tools library provides a probabilistic framework to address these issues simultaneously using deep generative models.
Your task is to write a Python script that demonstrates the core workflow for using the Single-cell Variational Inference (SCVI) model to learn a low-dimensional, batch-corrected representation of scRNA-seq data.
The script must perform the following steps for a series of test cases:
- Data Generation: Create a synthetic
AnnDataobject. This object must contain a raw count matrix of shape and a categorical column in its observation annotations (adata.obs) named'batch'that assigns each cell to one of batches. - Data Setup: Use the
scvi.data.setup_anndatafunction to register theAnnDataobject for use with the SCVI model. You must explicitly specify that the raw counts are to be used and indicate the column containing the batch information. - Model Initialization: Initialize an
scvi.model.SCVImodel instance with the set-upAnnDataobject. Use the default hyperparameters for the model architecture. - Model Training: Train the model on the data. To ensure the script executes quickly for testing purposes, you must set
max_epochs=1andtrain_size=1.0in thetrain()method. All other training parameters should be left at their default values. - Latent Representation Retrieval: After training, retrieve the latent representation for all cells. This is a matrix where each row corresponds to a cell and each column corresponds to a dimension in the learned latent space.
The goal is to verify that the model can be initialized, trained, and that it produces a latent representation of the expected dimensions.
Test Suite
The program must run the following test cases, where each case is defined by the tuple :
- Test Case 1:
- Test Case 2:
For each test case, your script should generate the synthetic data, run the SCVI workflow, and determine the shape of the resulting latent representation matrix.
Final Output Format
The final line of standard output must be a single list containing the shapes of the retrieved latent representation matrices for all test cases. Each shape should be a tuple , where is the number of rows and is the number of columns. For example: [(50, 10), (100, 10)].
After learning a latent representation, a key downstream task is identifying genes that are differentially expressed between cell populations. This practice delves into model.differential_expression(), showcasing how scvi-tools uses probabilistic modeling and Bayes factors to perform robust differential expression analysis. You will specifically leverage the model's counterfactual reasoning ability to isolate biological differences from technical batch effects, a powerful feature of generative models.
Problem: ### Context In single-cell transcriptomics, observed gene expression is a combination of biological variation (e.g., cell type) and technical variation (e.g., experimental batch). Generative models like single-cell Variational Inference (scVI) learn a low-dimensional latent representation of the data, explicitly modeling and disentangling these factors. A powerful capability of such models is counterfactual reasoning: generating what a cell's gene expression would look like under different technical conditions. This is particularly useful for differential expression (DE) analysis, where we wish to compare biological states while controlling for technical confounders.
Problem Statement
You are given a dataset of synthetic single-cell RNA sequencing data, which contains cells from two distinct cell types (cell_type '0' and '1') processed across two different technical batches (batch '0' and '1'). An scVI model has been trained on this data, learning to correct for batch effects.
Your task is to perform differential expression analysis between cell types using the trained model's probabilistic hypothesis testing framework. To isolate the biological difference between cell types, you must utilize the model's counterfactual generation capability to simulate expression profiles for all cells as if they originated from a single, fixed technical batch.
Perform the following three differential expression tests using the model.differential_expression() method:
- Test Case 1 (Primary Comparison): Compare
cell_type'0' (group 1) versuscell_type'1' (group 2), explicitly fixing the technical condition tobatch'0' for both groups. - Test Case 2 (Batch Consistency Check): Compare
cell_type'0' (group 1) versuscell_type'1' (group 2), explicitly fixing the technical condition tobatch'1' for both groups. - Test Case 3 (Negative Control): Compare
cell_type'0' (group 1) versuscell_type'0' (group 2) - a self-comparison - explicitly fixing the technical condition tobatch'0' for both groups.
For each test case, determine the number of genes that show significant differential expression. A gene is considered significantly differentially expressed if its resulting Bayes factor is greater than 3.0.
Input Data and Model
The solution must generate its own synthetic data and train the model. Use the following parameters for data generation with scvi.data.synthetic_iid() to ensure reproducibility:
n_batches=2n_labels=2n_genes=50n_cells=200random_state=42
Train an scvi.model.SCVI model on this data with default parameters, after setting up the AnnData object with batch_key="batch" and labels_key="labels". Set a seed for reproducibility before training (e.g., scvi.settings.seed = 42).
Output Format
The final output of your program must be a single line containing a list of three integers, representing the count of significant genes for Test Case 1, Test Case 2, and Test Case 3, respectively. The format must be: [count1, count2, count3].
Annotating cell types is a critical but often laborious step, especially when only a subset of cells are labeled, which is where semi-supervised learning excels. This exercise demonstrates a transfer learning workflow, where a pre-trained unsupervised SCVI model is used to initialize a SCANVI model for cell type classification. This approach leverages the rich representation learned from all data while fine-tuning a classifier on the limited available labels, a cornerstone of modern single-cell analysis.
Problem: In the context of single-cell RNA sequencing (scRNA-seq) analysis, a common workflow involves two stages: first, learning a low-dimensional latent representation of the data in an unsupervised manner, and second, using this representation to build a classifier for cell types, often in a semi-supervised setting where only a subset of cells are labeled.
scvi-tools provides the SCVI (Single-cell Variational Inference) model for the first unsupervised stage and the SCANVI (Single-cell ANnotation using Variational Inference) model for the second semi-supervised stage. A powerful feature of SCANVI is its ability to be initialized from a pre-trained SCVI model. This process effectively transfers the learned parameters of the encoder and decoder networks from SCVI to SCANVI, treating the SCVI model's latent space as a foundational representation upon which a classification head is added.
Your task is to write a Python program that demonstrates this transfer learning process. The program must:
- Generate a synthetic scRNA-seq dataset containing both count data and cell type labels.
- Instantiate and train an unsupervised
SCVImodel on this dataset. - Instantiate a semi-supervised
SCANVImodel by taking the pre-trainedSCVImodel as input. This new model should be configured to use the available cell type labels for the classification task. - Train the
SCANVImodel. - Verify the functionality of the trained
SCANVImodel by using it to predict the cell types of the cells in the dataset.
The program should return True if all steps complete successfully and the SCANVI model can produce predictions, and False otherwise.
Problem Statement Requirement
- Universal Applicability: The solution should be written in standard Python using the specified libraries.
- Test Suite & Answer Specification:
- The program must generate its own synthetic data using
scvi.data.synthetic_iid(). This serves as the single test case. - For computational efficiency within this test environment, you should set
max_epochs=1for the training of both theSCVIandSCANVImodels. - The final answer must be a boolean value (
TrueorFalse).
- The program must generate its own synthetic data using
- Final Output Format: The last line of standard output must be a list containing the single boolean result, e.g.,
[True]. Any other output for debugging must be printed before this final line.
Modern single-cell experiments often measure multiple data types from the same cell, such as RNA and surface proteins in CITE-seq, requiring integrated analysis methods. This practice introduces TOTALVI, a specialized model within scvi-tools designed for the joint analysis of paired RNA and protein expression data. You will learn the crucial steps of registering multimodal data and training the model to produce a unified latent representation, expanding your skills beyond single-modality analysis.
Problem: You are tasked with performing a joint dimensionality reduction analysis of single-cell RNA-sequencing and antibody-derived tag (protein) expression data, commonly known as CITE-seq. You will use the TOTALVI (Total Variational Inference) model provided by the scvi-tools library. The objective is to learn a joint low-dimensional latent representation of the cells that integrates information from both RNA and protein modalities.
Assume you are provided with a pre-loaded AnnData object, let's call it adata, which is structured as follows:
adata.X: Contains the raw count matrix for RNA expression.adata.obsm["protein_counts"]: Contains the raw count matrix for protein expression.adata.obs["batch"]: Contains categorical batch information for each cell.
Your task is to write a Python program that accomplishes the following steps in sequence for a given AnnData object and a set of model hyperparameters:
- Data Registration: Inform
scvi-toolsabout the structure of yourAnnDataobject. This is a crucial step for multimodal models. You must explicitly specify which key inobsmholds the protein data and which column inobsholds the batch indices. - Model Initialization: Instantiate a
TOTALVImodel. The model should be initialized with the registeredadataobject and a specified number of latent dimensions, . - Model Training: Train the initialized model using variational inference for a specified number of epochs, . To ensure the training process is deterministic and fast for testing purposes, you must configure the training to:
- Use the entire dataset for training (i.e., no validation split).
- Disable the periodic validation check.
- Disable the training progress bar.
- Latent Representation Extraction: After training, extract the joint latent representation for all cells. This is typically the posterior mean of the latent variables for each cell.
- Result Verification: Return the shape of the extracted latent representation matrix.
To test your implementation, your program must include a helper function, create_synthetic_data(), that generates a small, standardized AnnData object. This function should:
- Set a fixed random seed (e.g., 42) for reproducibility using
numpy.random.seed. - Create an
AnnDataobject with cells, genes, and proteins. - Populate
X(RNA) andobsm["protein_counts"](protein) with random counts drawn from a Poisson distribution. - Populate
obs["batch"]with randomly assigned batch labels (e.g., "batch1", "batch2").
The main part of your program should:
- Set the global
scviseed to 42 for reproducible training. - Define a list of test cases, where each case is a tuple of hyperparameters: . The test cases are:
[(10, 2), (5, 1)]. - Iterate through each test case:
a. Generate a fresh synthetic
adataobject usingcreate_synthetic_data(). b. Perform the sequence of steps 1-4 described above. c. Capture the shape of the resulting latent representation matrix as a tuple . - Collect the resulting shapes from all test cases and print them as a list of tuples in the final line of standard output. For example:
[(20, 10), (20, 5)].
Tool Build Parameters
| Primary Language | Python (99.98%) |
| License | BSD-3-Clause |

