Analysis for "K27M in canonical and noncanonical H3 variants occurs in distinct oligodendroglial cell lineages in brain midline gliomas" (Jessa et al, Nature Genetics, 2022)
Contents:
code/scripts/
and R-4/code/scripts
folders (the execution of these scripts is perofrmed in the data/scRNAseq
, data/scATACseq
, data/ChIPseq
, etc folders, not included here).code
and R-4/code
, with the associated .md and rendered HTML files. The rendered HTML files can be viewed at https://fungenomics.github.io/HGG-oncohistones/ under the “Code to reproduce key analyses” section.Brief explanation of the directory structure:
renv
–> renv-managed folder for R 3.6renv.lock
–> lockfile containing all package versions for R 3.6 analysiscode
–> code for R 3.6 analysis, contains the .Rmd files that run the high-level analyses and produce figures included in the paper
functions
–> contains .R files with custom functions used throughout the analysisscripts
–> contains .R and bash scripts for analyses that are repeated on individual samples, as well as helper scripts e.g. for creating referencesinfos_templates
–> contains example config files for scripts in the scripts
folderseurat_v3_resources
–> this is a copy of the folder referenced by SEURAT_V3_ASSETS
in certain bash scripts, or file.path(params$assets, "resources")
in the various preprocessing_*.Rmd
scripts. It contains resources used for single-cell preprocessing.R-4
–> code for R 4.1 analysis (has a similar directory structure as the above main directory)
code
–> contains .Rmd files, functions, and scripts for R 4.1 analysisrenv
–> renv-managed folder for R 4.1renv.lock
–> lockfile containing all package version for R 4.1 analysisinclude
–> contains templates, palettes, etc, for this repositoryrr_helpers.R
–> contains helper functions for working with this GitHub repository template (rr
)Code to reproduce analyses is saved in code
and R-4/code
.
(See here for why two different R versions are used.) When these analyses depend
on inputs from pipelines, I’ve tried to note within the R Markdown documents where
these scripts/pipelines are located.
This table contains pointers to code for the key analyses associated with each figure. The links in the Analysis
column lead to rendered HTMLs.
Figure | Analysis | Path |
---|---|---|
Fig 1 | Oncoprints summarizing tumor and cell line cohort | ./code/00-oncoprints.Rmd |
Ext Fig 1 | Summary figures for extended mouse brain scRNAseq atlas | ./code/05-mouse_atlas.Rmd |
Fig 1, Ext Fig 2 | cNMF analysis of variable gene programs | ./R-4/code/01-cNMF_programs.Rmd |
Fig 1 | Cell type identity in tumors with automated consensus projections | ./R-4/code/02-consensus_projections.Rmd |
Ext Fig 2 | Analysis of human fetal thalamus and hindbrain scRNAseq data | ./code/01A-human_thalamus.{Rmd,html} and ./code/01B-human_hindbrain.{Rmd,html} |
Ext Fig 2 | Validation of cell type projections using human thalamic fetal brain reference | ./R-4/code/02-consensus_projections.Rmd |
Ext Fig 3 | Characterization of malignant ependymal cells | ./code/01C-ependymal_cells.Rmd |
Fig 2, 4 | Scatterplots for RNAseq/K27ac/K27me3 between H3K27M tumor subtypes | ./code/02-bulk_comparisons.Rmd |
Fig 2, Ext Fig 4 | Systematic HOX analysis/quantification | ./code/03A-HOX.Rmd |
Fig 3 | Analysis of thalamic patterning | ./code/04-thalamus.Rmd |
Fig 4-6, Ext Fig 4-6 | Analysis of dorsal-ventral patterning and NKX6-1/PAX3 activation | ./code/03B-NKX61_PAX3.Rmd |
Fig 6 | Analysis of ACVR1 cell lines | ./code/06-ACVR1.Rmd |
Fig 7, 8, Ext Fig 7 | Analysis of histone marks in tumors & cell lines | ./code/07A-histone_marks.Rmd |
Ext Fig 8 | Comparison of tumor epigenomes to scChIP of normal cell types | ./R-4/code/03A-celltype_epigenomic_similarity.Rmd |
Fig 8, Ext Fig 9 | Heatmaps of H3K27me2/3 in CRISPR experiments | ./code/07D-deeptools_*.sh and ./code/07E-deeptools_*.sh |
Most color palettes (e.g. for tumor groups, genotypes, locations, cell types, HOX genes, etc) and ggplot2
theme elements (theme_min()
, no_legend()
, rotate_x()
, etc) are defined in include/style.R
.
Supplementary tables (included with the manuscript) and processed data tables (on Zenodo) were assembled from the following input/output/figure source data files. (Only tables produced with the code included here are listed below.)
Supplementary table | Path |
---|---|
6 | ./output/05/TABLE_mouse_sample_info.tsv |
7 | ./output/05/TABLE_mouse_cluster_info.tsv |
8 | ./R-4/output/02/TABLE_cNMF_programs_per_sample.tsv |
9 | ./R-4/output/02/cNMF_metaprogram_signatures.malignant_filt.tsv |
10 | ./R-4/output/02/TABLE_reference_cnmf_program_overlaps.tsv |
11 | ./output/01A/TABLE_thalamus_QC.tsv and ./output/01B/TABLE_hindbrain_QC.tsv |
12 | ./output/01A/info_clusters3.tsv and ./output/01B/info_clusters3.tsv |
13 | ./output/03A/TABLE_HOX_expression_per_transcript.tsv |
14 | ./output/03A/TABLE_HOX_H3K27ac_H3K27me3_per_transcript.tsv |
16 | ./figures/03B/enhancer_diff-1.source_data.tsv |
Processed data table | Path |
---|---|
1a | ./output/02/TABLE_bulk_counts.tsv |
1b | ./output/02/TABLE_dge_H3.1_vs_H3.3.tsv |
1c | ./output/02/TABLE_dge_thal_vs_pons.tsv |
2a | ./output/07A/TABLE_K27me3_CGIs.tsv |
2b | ./output/07A/TABLE_K27me2_100kb_bins.tsv |
3a | ./output/02/TABLE_promoter_H3K27ac_H3K27me3_per_sample.tsv |
This section describes the scripts used for preprocessing of single-cell data from this project. That includes: sn/scRNAseq, scATACseq, and scMultiome (joint RNA & ATAC in the same cells). This document refers to sn and scRNAseq generally as ‘scRNAseq’. Please see the sample metadata for the technology used to profile each sample. Please see the Methods section of the manuscript for more details on the single-cell profiling.
The pipeline for scRNAseq processing applied per-sample is summarized in this schematic. In general, scripts
contain the code to run the analysis and config
files contain the parameters or setting specific to a certain iteration of the analysis.
Following Cellranger, the scRNAseq samples have all been processed with the
lab’s preprocessing workflow (./code/scripts/scRNAseq_preprocessing.Rmd
).
Each sample is then subject to several downstream analyses as described in the schematic above,
with the associated scripts indicated.
The pipeline for scATACseq processing applied per-sample is summarized in this schematic:
Following Cellranger, preprocessing of the scATAC data is done with a script that
builds off the scRNAseq workflow, at ./R-4/code/scripts/preprocessing_scATAC.Rmd
.
This workflow is run in the scATAC pipeline at ./R-4/data/scATACseq/pipeline_10X_ATAC
,
with one folder per sample. Each sample is then subject to several downstream analyses
as described in the schematic above, run in that sample’s folder, with the associated scripts.
The pipeline for scMultiome processing applied per-sample is summarized in this schematic:
Following Cellranger, preprocessing of the scMultiome data is done with a script that builds off the scRNAseq workflow, at ./R-4/code/scripts/preprocessing_scMultiome.Rmd
.
This workflow is run in the scMultiome pipeline at ./R-4/data/scMultiome/pipeline_10X_Multiome
, with one folder per sample. Each sample is then subject to several downstream analyses
as described in the schematic above, run in that sample’s folder, with the associated scripts.
For scRNAseq, scATACseq and scMultiome samples, the cell metadata provided with the paper contains several columns matching the analyses used in the paper:
Cell_type_granular_mouse_correlations
–> cell-type projection to the extended mouse atlas, based on the Spearman correlation, using the cluster label (REGION-TIMEPOINT_CLUSTER)Cell_type_mouse_correlations
–> cell-type projection to the extended mouse atlas, based on the Spearman correlation, summarized to a broader cell class (ontology is described in Table S7)Cell_type_consensus_Jessa2022
–> consensus cell-type projection to the extended mouse atlas, based on agreement between Spearman correlation and at least one other cell-type projection method. Cells without a consensus are classified as “Uncertain”, see Methods for detailsMalignant_normal_consensus_Jessa2022
–> assignment as normal or malignant, used to decide whether cells should be included in downstream analysesThe cell annotations/metadata are included in processed data deposition on Zenodo and on GEO (GSE210568).
As described in the Methods, we used the harmony package for integration of single-cell datasets.
code/scripts/integrate_harmony.R
,
which expects a config file info.experiment.tsv
to be present (example at code/infos_templates/harmony.info.experiment.tsv
)R-4/data/integrations
, with one directory for each group of samples being integrated, and the config files withinHuman fetal data brain data for the hindbrain and thalamus were obtained from two studies, Eze et al, Nature Neuroscience, 2021, and Bhaduri et al, Nature, 2021.
./code/scripts/scRNAseq_preprocessing.Rmd
)./code/01A_2-human_thalamus.{Rmd,html}
)./code/01B-human_hindbrain.{Rmd,html}
)rr
template & helpersThis repository uses the rr
template, which contains
a set of R markdown templates to help me ensure reproducibility. Secondly, this also
provides a set of helper functions (located in rr_helpers.R
and prefixed by rr_
in the
function name) to help encourage documentation.
The R libraries for this project are managed with the package renv
.
The R versions used are 3.6.1 and 4.1.2, and renv
manages one library
for each R version.
The renv
package:
renv
folder (for R 3.6.1) or R-4/renv
folder (for R 4.1.2) - the libraries themselves are not on GitHubrenv.lock
and R-4/renv.lock
, which
can be used to reproduce the R package environment.The reason for using two different R versions is that certain analyses involving 10X Multiome data require versions of Seurat/Signac dependent on R > 4.
Each markdown/HTML file has a “Reproducibility report” at the bottom (example), indicating when the document was last rendered, the most recent git commit when it was rendered, the seed, and the R session info.
Lightweight testing is performed in certain cases (e.g. validating metadata)
using the ensurer
package, combined with the
testrmd
testing framework for R Markdown documents.
Certain reusable ensurer
contracts (reusable tests) are stored in ./code/functions/testing.R
.
The following are tracked / available on GitHub:
.Rmd
files, containing the code, and .md
and rendered HTML files, containing code and outputsfigures
, when sufficiently smalldesc
files for outputs, under outputs
renv
packageThe following are not tracked / available on GitHub:
png
/pdf
format, and some figure source dataIf you use or modify code provided here, please cite this work as follows:
Selin Jessa, Steven Hébert, Samantha Worme, Hussein Lakkis, Maud Hulswit, Srinidhi Varadharajan, Nisha Kabir, and Claudia L. Kleinman. (2022). HGG-oncohistones analysis code. Zenodo. https://doi.org/10.5281/zenodo.6647837