Single cell ATAC

Single-Cell Chromatin Accessibility Assays

Motivation

The diploid human genome contains roughly 6.4 billion base pairs, totaling a distance of over 3.84 meters of linearized DNA per nucleus. To maintain nuclear integrity and limit search space for DNA binding proteins, cells compact their genomes such that only 1-4% of it is accessible at a given time¹. This is achieved by the coiling of DNA around histone proteins and condensation of those resultant nucleosomes into larger macromolecular structures. The regions which remain accessible are highly enriched in genomic content relevant to regulating transcription and defining cell type and state, including gene promoters and enhancers². A single-cell approach to measuring chromatin accessibility shares the same two previously stated benefits of single-cell analysis over bulk approaches. Namely, i) single-cell approaches allow for the unbiased interrogation of multiple cell types within a complex tissue sample, and ii) single-cell approaches provide a higher resolution of chromatin reconfiguration in actively differentiating systems than in bulk assays. Most variants uncovered in GWAS studies of neurodevelopmental disorders are located in non-coding regions, thus demonstrating the significance of assessing non-coding regulatory elements³. Using a single-cell chromatin accessibility assay, we are able to uncover which cell types express these non-coding regions that are associated with disease states⁴. By tracking accessible sites in single-cells one can infer the activity of transcription factors, track the opening of enhancers, and infer their recruitment to promoter regions.

Method

In order to catalog the small sections of the genome that are accessible in each cell, several strategies have been developed. All strategies share the common through-line of leveraging the susceptibility of exposed DNA to insult when compared to compacted DNA. All assays are based on the premise of fragmenting the more vulnerable DNA and the subsequent capture of fragments for sequencing library preparation. DNase-based methods were until recently the most prominent method, being used in the Encyclopedia of DNA Elements (ENCODE) project. In this approach, the genome, while in its native state, is treated with DNase I, a protein that can digest both single and double stranded DNA (Figure 4). Enzymes are limited by their protein footprint so condensed heterochromatic regions are sterically protected from enzymatic action. The fragmented DNA can then be captured, and sequencing adapters appended for massively parallel sequencing (Figure 4a)¹. While DNase approaches have been adapted to a single-cell format, this method remains difficult to titer. Changes in DNase I concentration or incubation greatly affect library quality⁵. An alternative method is THS-seq, wherein hyperactive transposase (the protein Tn5) is loaded with a bacterial T7 promoter region and tagmented into regions accessible to the Tn5, again using steric hindrance to select for open regions (Figure 4b). Tagmented DNA is isolated and in vitro transcription is used to amplify regions via the added T7 promoter region. RNA intermediates reflecting the open regions of chromatin are then reverse transcribed and sequencing adapter are added⁶.

By far the most widely used assay for accessible chromatin is ATAC-seq (Assay for Transposase Accessible Chromatin using sequencing)⁷. This method uses the same Tn5 protein used in THS-seq, but uses a simplified workflow. The Tn5 enzyme is, as previously mentioned, sterically limited to regions of open chromatin. ATAC-seq involves loading the enzyme with adapters necessary for PCR and sequencing. At open regions, the Tn5 enzyme both fragments the genomic DNA and appends the PCR adapters in the same reaction (Figure 4c). The excised accessible DNA can then be amplified selectively by using complementary primers. This assay is far more efficient at the capture of open genomic regions than the other approaches and has been adapted to optimize cell isolation and tagmentation conditions⁸. Recently commercialized versions of single-cell ATAC was made available which encapsulates cells or nuclei within microfluidic droplets^9,10. Another means of single-cell ATAC popularized by us and others is sci-ATAC which uses the aforementioned split-pool barcoding approach (combinatorial indexing) (Figure 4d)^4,11.

**Figure 4.** Single-cell chromatin accessibility assays. a) Single-cell DNase-seq uses DNAse I (purple) digests open chromatin, adapters are subsequently added. b) Single-cell transposome hypersensitive site sequencing (scTHS-seq) uses a transposase (green) to introduce a T7 bacterial promoter region to open chromatin and amplify DNA through an RNA intermediate via in vitro transcription (orange). c) Single-cell assay for transposase accessible chromatin (scATAC) uses two species of transposase to introduce i5 and i7 adapter sequences. Cells are then encapsulated in droplets with an oligonucleotide coated gel bead to uniquely index each cell. d) Single-cell combinatorial indexing assay for transposase accessible chromatin (sci-ATAC) uses two species of transposase (purple and orange) to introduce two adapters directly into open chromatin. e) Symmetrical strand sci-ATAC uses a single species of transposase and subsequent adapter switching strategy to amplify open regions, further detailed in Chapter 2. Labeled DNA oligonucleotide colors are consistently colored across panels.

Combinatorial indexing is performed through the addition of indexes both at the multiplexed tagmentation stage, and at the final PCR stage. This method is beneficial in that the number of assayable cells per preparation scales exponentially with increasing index combinations. sci-ATAC libraries have uncovered a trove of regulatory information, however the number of captured fragments is inherently limited. To successfully capture a fragment in PCR, it must have the proper tagmentation of both i7 and i5 adapters. This means that ~50% of fragments are lost as i5-i5 or i7-i7 tagmentations (Figure 5). I describe a correction to this strategy through the use of single Tn5 species and an adapter switching strategy, named symmetrical strand sci (“s3”, Figure 4e). The above summary demonstrates that all protocols show a commonality in the generalized goals of both the fragmentation and capture of unprotected genomic regions. Consequently, the information gathered by all assays is similar in that it is essentially a count of captured genomic regions overlapping with a reference genome.

**Figure 5.**Tagmentation with two separate adapter-loaded Tn5 species has loss in effiency. In a captured molecule, i5 and i7 adapters must be added in the proper orientation for PCR. i5-i5 (top left) and i7-i7 tagmentations (bottom right) are not sequencable, despite the genomic regions being open.

Analysis

Single-cell chromatin accessibility data is count data. Single-cell ATAC-seq methods are by far the most widely used and their analysis will be detailed below; however, similar analysis can be performed with any of the above listed alternative protocols. Genomic DNA fragments captured in the assay are sequenced and aligned to a reference genome by an alignment algorithm (Figure 6)^12,13. Reads which overlap in alignment (“pile-ups”) are used to define discretized regions of open chromatin, essentially assuming that the chromatin accessibility protocol is biasing sequence capture to unprotected regions. The calling of discrete open chromatin regions, or peaks, is done with an peak-calling algorithm, like MACS2¹⁴, which uses a Poisson distribution of reads across the mappable genome in a sliding window of bins. If there are more read counts in a region than expected by this null hypothesis, a peak region is called and an open region of the genome is uncovered14. These peaks are then used to “bin” the genome into sites with evidence of accessibility. Single-cell ATAC-seq methods apply peak-calling on the full data set, not accounting for single cells, since any given diploid cell can have at most four captured reads at a given base (two copies of each top and bottom DNA strands). Once peaks are called on the entire data set, cell identity is mapped back to individual reads via the cell identifier (the unique combination of indexes) to generate a sparse cell x peak matrix^4,11, populated by the number of reads per cell aligning to an open region.

Cells are then grouped together based on similarity of peak coverage to overcome single-cell data sparsity. Natural language processing approaches like latent semantic indexing (LSI) apply a weighting schema where peaks more commonly used are decreased in importance¹⁵. Alternatively, machine-learning approaches such as the latent Dirichlet algorithm (LDA) is used to generate “topics” or groups of peaks commonly seen together within the data. From there the cell x peak matrix is reduced from hundreds of thousands of peaks to a couple of dozen topics, where the number of topics scales with the complexity of the data set. This addresses both the data sparsity of single cells and captures biological information within peaks, wherein shared open sites tend to be enriched in common transcription factor motifs or linked to biological ontology¹⁶. Following dimensionality reduction, cells are grouped together based on their shared topic weighting by Louvain based clustering algorithms¹⁷. Cells are projected into two dimensional space via a machine learning algorithm like uniform manifold approximation and mapping (UMAP) or t-distributed stochastic neighbor embedding (tSNE)¹⁸.

**Figure 6.**Flow-through of single-cell ATAC-seq data analysis. Reads are generated through sequencing, aligned to a reference, de-duplicated and filtered based on quality control metrics, read pile-ups along the genome are called, then a counts matrix of cell identifier by read count per peak is generated. This counts matrix is then reduced in dimensionality, and clustered and projected into 2D space. From there cluster aggregates (all cells combined within a cluster) have the power for differential accessibility analysis, and can be used for trajectory analysis, transcription factor motif usage and the assessment of cis-coaccessible networks for promoter-enhancer interactions.

Following the unbiased clustering of cells, differences in peak usages between clusters are assessed by use of logistic regression tests. Additionally, the activity of transcription factors can be inferred per cell, based on the expression of transcription factor specific DNA binding motifs. If each peak with reads for a cell is binarized, transcription factor activity can be inferred based on the overrepresentation of motifs present in open sites^19,20. Given that enhancers and promoters are recruited in a concerted effort to drive transcription, this implies that the accessibility of both promoters and enhancers should co-occur if a site is acting in an enhancer-like function. To assess this agnostically within a data set, we look for the co-occurrence of accessibility in local enhancers linked to a peak region overlapping a known promoter. Cis-co-accessible networks (CCANs) are anchored at the promoter peak, and generated through correlation to other accessible nearby peaks for each cells with proper coverage. This network of enhancers and gene promoters better correlate with gene transcription as compared to either promoter accessibility alone or average gene body accessibility^21,22. This is possible through the statistical power generated by so many independent samples made in single-cell library preparation. In order to leverage single-cell data to assess cell differentiation or epigenomic shifts, we can order cells in reduced dimensionality space and calculate a minimal spanning tree, or L1-graph, which traverses across cells, minimizing the residual distance from the tree. This allows for ordering of cells in order to infer programmatic shifts in the epigenome during cell state shifts²³.

In recent works, whole organism atlases have been generated on human and mouse development^4,24. While not focused directly on corticogenesis, these data sets reveal the waves of transcription factor motif accessibility changes as stem cells progress towards maturing neurons. As excitatory neurons mature in the human cortex, there is a marked opening of Rfx and Tal-related transcription factor binding sites (e.g. RFX2, TWIST2, NEUROD1) and a closing of early radial glial marker sites like SOX2 and POU factors (e.g. POU2F1), reflecting a concordance with known transcriptomic changes⁴. Further, chromatin accessibility across cortical neurons reflects the spatial organization of cortical layering in the murine brain²⁴. However, many questions of chromatin dynamics, RG division, fate specification, and regulatory network formation persist that require a focused approach.

References