Cellxgene Census Explorer

v20260403

cellxgene-census

Provides programmatic access to the CZ CELLxGENE Census (61M+ cells) for querying standardized single-cell expression, metadata, embeddings, and stats across tissues and diseases, supporting scanpy and ML workflows.

SingleCell Genomics Atlas API Python Bioinformatics LargeScale

Get Skill

274 downloads

Overview

CZ CELLxGENE Census

Overview

The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.

The Census includes:

61+ million cells from human and mouse
Standardized metadata (cell types, tissues, diseases, donors)
Raw gene expression matrices
Pre-calculated embeddings and statistics
Integration with PyTorch, scanpy, and other analysis tools

When to Use This Skill

This skill should be used when:

Querying single-cell expression data by cell type, tissue, or disease
Exploring available single-cell datasets and metadata
Training machine learning models on single-cell data
Performing large-scale cross-dataset analyses
Integrating Census data with scanpy or other analysis frameworks
Computing statistics across millions of cells
Accessing pre-calculated embeddings or model predictions

Installation and Setup

Install the Census API:

uv pip install cellxgene-census

For machine learning workflows, install additional dependencies:

uv pip install cellxgene-census[experimental]

Core Workflow Patterns

1. Opening the Census

Always use the context manager to ensure proper resource cleanup:

import cellxgene_census

# Open latest stable version
with cellxgene_census.open_soma() as census:
    # Work with census data

# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
    # Work with census data

Key points:

Use context manager (with statement) for automatic cleanup
Specify census_version for reproducible analyses
Default opens latest "stable" release

2. Exploring Census Information

Before querying expression data, explore available datasets and metadata.

Access summary information:

# Get summary statistics
summary = census["census_info"]["summary"].read().concat().to_pandas()
print(f"Total cells: {summary['total_cell_count'][0]}")

# Get all datasets
datasets = census["census_info"]["datasets"].read().concat().to_pandas()

# Filter datasets by criteria
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]

Query cell metadata to understand available data:

# Get unique cell types in a tissue
cell_metadata = cellxgene_census.get_obs(
    census,
    "homo_sapiens",
    value_filter="tissue_general == 'brain' and is_primary_data == True",
    column_names=["cell_type"]
)
unique_cell_types = cell_metadata["cell_type"].unique()
print(f"Found {len(unique_cell_types)} cell types in brain")

# Count cells by tissue
tissue_counts = cell_metadata.groupby("tissue_general").size()

Important: Always filter for is_primary_data == True to avoid counting duplicate cells unless specifically analyzing duplicates.

3. Querying Expression Data (Small to Medium Scale)

For queries returning < 100k cells that fit in memory, use get_anndata():

# Basic query with cell type and tissue filters
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",  # or "Mus musculus"
    obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
    obs_column_names=["assay", "disease", "sex", "donor_id"],
)

# Query specific genes with multiple filters
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
    obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
    obs_column_names=["cell_type", "tissue_general", "donor_id"],
)

Filter syntax:

Use obs_value_filter for cell filtering
Use var_value_filter for gene filtering
Combine conditions with and, or
Use in for multiple values: tissue in ['lung', 'liver']
Select only needed columns with obs_column_names

Getting metadata separately:

# Query cell metadata
cell_metadata = cellxgene_census.get_obs(
    census, "homo_sapiens",
    value_filter="disease == 'COVID-19' and is_primary_data == True",
    column_names=["cell_type", "tissue_general", "donor_id"]
)

# Query gene metadata
gene_metadata = cellxgene_census.get_var(
    census, "homo_sapiens",
    value_filter="feature_name in ['CD4', 'CD8A']",
    column_names=["feature_id", "feature_name", "feature_length"]
)

4. Large-Scale Queries (Out-of-Core Processing)

For queries exceeding available RAM, use axis_query() with iterative processing:

import tiledbsoma as soma

# Create axis query
query = census["census_data"]["homo_sapiens"].axis_query(
    measurement_name="RNA",
    obs_query=soma.AxisQuery(
        value_filter="tissue_general == 'brain' and is_primary_data == True"
    ),
    var_query=soma.AxisQuery(
        value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
    )
)

# Iterate through expression matrix in chunks
iterator = query.X("raw").tables()
for batch in iterator:
    # batch is a pyarrow.Table with columns:
    # - soma_data: expression value
    # - soma_dim_0: cell (obs) coordinate
    # - soma_dim_1: gene (var) coordinate
    process_batch(batch)

Computing incremental statistics:

# Example: Calculate mean expression
n_observations = 0
sum_values = 0.0

iterator = query.X("raw").tables()
for batch in iterator:
    values = batch["soma_data"].to_numpy()
    n_observations += len(values)
    sum_values += values.sum()

mean_expression = sum_values / n_observations

5. Machine Learning with PyTorch

For training models, use the experimental PyTorch integration:

from cellxgene_census.experimental.ml import experiment_dataloader

with cellxgene_census.open_soma() as census:
    # Create dataloader
    dataloader = experiment_dataloader(
        census["census_data"]["homo_sapiens"],
        measurement_name="RNA",
        X_name="raw",
        obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
        obs_column_names=["cell_type"],
        batch_size=128,
        shuffle=True,
    )

    # Training loop
    for epoch in range(num_epochs):
        for batch in dataloader:
            X = batch["X"]  # Gene expression tensor
            labels = batch["obs"]["cell_type"]  # Cell type labels

            # Forward pass
            outputs = model(X)
            loss = criterion(outputs, labels)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

Train/test splitting:

from cellxgene_census.experimental.ml import ExperimentDataset

# Create dataset from experiment
dataset = ExperimentDataset(
    experiment_axis_query,
    layer_name="raw",
    obs_column_names=["cell_type"],
    batch_size=128,
)

# Split into train and test
train_dataset, test_dataset = dataset.random_split(
    split=[0.8, 0.2],
    seed=42
)

6. Integration with Scanpy

Seamlessly integrate Census data with scanpy workflows:

import scanpy as sc

# Load data from Census
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
)

# Standard scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)

# Dimensionality reduction
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

# Visualization
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])

7. Multi-Dataset Integration

Query and integrate multiple datasets:

# Strategy 1: Query multiple tissues separately
tissues = ["lung", "liver", "kidney"]
adatas = []

for tissue in tissues:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
    )
    adata.obs["tissue"] = tissue
    adatas.append(adata)

# Concatenate
combined = adatas[0].concatenate(adatas[1:])

# Strategy 2: Query multiple datasets directly
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
)

Key Concepts and Best Practices

Always Filter for Primary Data

Unless analyzing duplicates, always include is_primary_data == True in queries to avoid counting cells multiple times:

obs_value_filter="cell_type == 'B cell' and is_primary_data == True"

Specify Census Version for Reproducibility

Always specify the Census version in production analyses:

census = cellxgene_census.open_soma(census_version="2023-07-25")

Estimate Query Size Before Loading

For large queries, first check the number of cells to avoid memory issues:

# Get cell count
metadata = cellxgene_census.get_obs(
    census, "homo_sapiens",
    value_filter="tissue_general == 'brain' and is_primary_data == True",
    column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"Query will return {n_cells:,} cells")

# If too large (>100k), use out-of-core processing

Use tissue_general for Broader Groupings

The tissue_general field provides coarser categories than tissue, useful for cross-tissue analyses:

# Broader grouping
obs_value_filter="tissue_general == 'immune system'"

# Specific tissue
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"

Select Only Needed Columns

Minimize data transfer by specifying only required metadata columns:

obs_column_names=["cell_type", "tissue_general", "disease"]  # Not all columns

Check Dataset Presence for Gene-Specific Queries

When analyzing specific genes, verify which datasets measured them:

presence = cellxgene_census.get_presence_matrix(
    census,
    "homo_sapiens",
    var_value_filter="feature_name in ['CD4', 'CD8A']"
)

Two-Step Workflow: Explore Then Query

First explore metadata to understand available data, then query expression:

# Step 1: Explore what's available
metadata = cellxgene_census.get_obs(
    census, "homo_sapiens",
    value_filter="disease == 'COVID-19' and is_primary_data == True",
    column_names=["cell_type", "tissue_general"]
)
print(metadata.value_counts())

# Step 2: Query based on findings
adata = cellxgene_census.get_anndata(
    census=census,
    organism="Homo sapiens",
    obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
)

Available Metadata Fields

Cell Metadata (obs)

Key fields for filtering:

cell_type, cell_type_ontology_term_id
tissue, tissue_general, tissue_ontology_term_id
disease, disease_ontology_term_id
assay, assay_ontology_term_id
donor_id, sex, self_reported_ethnicity
development_stage, development_stage_ontology_term_id
dataset_id
is_primary_data (Boolean: True = unique cell)

Gene Metadata (var)

feature_id (Ensembl gene ID, e.g., "ENSG00000161798")
feature_name (Gene symbol, e.g., "FOXP2")
feature_length (Gene length in base pairs)

Reference Documentation

This skill includes detailed reference documentation:

references/census_schema.md

Comprehensive documentation of:

Census data structure and organization
All available metadata fields
Value filter syntax and operators
SOMA object types
Data inclusion criteria

When to read: When you need detailed schema information, full list of metadata fields, or complex filter syntax.

references/common_patterns.md

Examples and patterns for:

Exploratory queries (metadata only)
Small-to-medium queries (AnnData)
Large queries (out-of-core processing)
PyTorch integration
Scanpy integration workflows
Multi-dataset integration
Best practices and common pitfalls

When to read: When implementing specific query patterns, looking for code examples, or troubleshooting common issues.

Common Use Cases

Use Case 1: Explore Cell Types in a Tissue

with cellxgene_census.open_soma() as census:
    cells = cellxgene_census.get_obs(
        census, "homo_sapiens",
        value_filter="tissue_general == 'lung' and is_primary_data == True",
        column_names=["cell_type"]
    )
    print(cells["cell_type"].value_counts())

Use Case 2: Query Marker Gene Expression

with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
        obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
    )

Use Case 3: Train Cell Type Classifier

from cellxgene_census.experimental.ml import experiment_dataloader

with cellxgene_census.open_soma() as census:
    dataloader = experiment_dataloader(
        census["census_data"]["homo_sapiens"],
        measurement_name="RNA",
        X_name="raw",
        obs_value_filter="is_primary_data == True",
        obs_column_names=["cell_type"],
        batch_size=128,
        shuffle=True,
    )

    # Train model
    for epoch in range(epochs):
        for batch in dataloader:
            # Training logic
            pass

Use Case 4: Cross-Tissue Analysis

with cellxgene_census.open_soma() as census:
    adata = cellxgene_census.get_anndata(
        census=census,
        organism="Homo sapiens",
        obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
    )

    # Analyze macrophage differences across tissues
    sc.tl.rank_genes_groups(adata, groupby="tissue_general")

Troubleshooting

Query Returns Too Many Cells

Add more specific filters to reduce scope
Use tissue instead of tissue_general for finer granularity
Filter by specific dataset_id if known
Switch to out-of-core processing for large queries

Memory Errors

Reduce query scope with more restrictive filters
Select fewer genes with var_value_filter
Use out-of-core processing with axis_query()
Process data in batches

Duplicate Cells in Results

Always include is_primary_data == True in filters
Check if intentionally querying across multiple datasets

Gene Not Found

Verify gene name spelling (case-sensitive)
Try Ensembl ID with feature_id instead of feature_name
Check dataset presence matrix to see if gene was measured
Some genes may have been filtered during Census construction

Version Inconsistencies

Always specify census_version explicitly
Use same version across all analyses
Check release notes for version-specific changes

Info

Category Data Science

Name cellxgene-census

Version v20260403

Size 10.7KB

Source K-Dense-AI/claude-scientific-skills

Updated At 2026-04-17