2025-06-22 13:43来源:本站
没有使用统计方法来预先确定样本量。
在线免费获得组蛋白修饰,DNA可及性,DNA甲基化和RNA表达的所有基因组图。可在http://www.ncbi.nlm.nih.gov/geo/geo/geo/Roadmap/epigeNomics/上找到,可在短读取存档或DBGAP上存放的原始测序数据链接。所有用于分析实验的主要处理数据(包括映射的读取)都包含在人类表观基因组图集(http://www.epigenomeatlas.org)的第9版中。与此集合中的每个数据集相关联的完整元数据都在GEO上存档,并描述了为每个分析实验收集的样本,测定,数据处理详细信息和质量指标。
该纲要的第9版包含来自多个分析实验的均匀预处理和映射的数据(来自多个个体的技术和生物学重复,来自多个中心的数据集)。为了降低冗余,提高数据质量并实现我们的综合分析所需的统一性,实验进行了额外的处理,以获取111个合并表观基因瘤的综合数据(有关其他详细信息,请参见下面的部分)。为每个合并表观基因组分配了数字表观基因组标识符(EID;例如,E001)和表观基因组名称的助记符。补充表1(QCSummary片)总结了单个版本的9个样品映射到合并的表观基因组ID。总结了综合表观基因组等关键的元数据,例如年龄,性别,解剖学,表观基因组类别(请参见补充表1,表观遗传学纸板),种族和固体/液态状态。在集成分析中,还使用了与编码项目(E114至E129之间的表观基因组ID)相对应的数据集(具有E114至E129的表观基因组ID)23。如下所述,从127个合并表观基因组中进行的所有数据集都经过处理过滤器,以确保基于读取长度的可读性和测序深度的均匀性。
127个表观基因瘤中的每一个都包含用于组蛋白修饰的核心芯片序列数据集-H3K4ME1,H3K4ME3,H3K27ME3,H3K36ME3,H3K9ME3,H3K9ME3,以及相应的全细胞提取物序列的对照。九十八个表观基因组和六十二个表观基因组分别巩固了H3K27AC和H3K9AC组蛋白CHIP-SEQ数据集。较小的表观基因组具有用于附加组蛋白标记的芯片序列数据集,总共提供了1,319个合并数据集(补充表1,QCSummary Sheet)。53个表观基因组具有DNA可及性(DNASE-SEQ)数据集。五十六个表观基因组具有mRNA-seq基因表达数据。对于127个巩固的表观基因组,在95个表观基因组中共有104个DNA甲基化数据集涉及甲硫酸盐治疗(WGB或RRBS分析)或MEDIP-SEQ和MRE-SEQ和MRE-SEQ分析的组合。除了这里在111个参考表观基因瘤中分析的1,936个数据集外,NIH路线图表观基因组学项目还产生了另外的869个全基因组数据集,该数据集与GEO,人类表观基因组图案图和NCBI链接,并公开且公开且免费。
我们从具有RNA-Seq数据的56个参考表观基因概念中统一地重新处理的mRNA-SEQ数据集。对于RNA-seq分析,在库构造45之后,我们使用BWA对准器对75 bp-l-t-l-l-l-l-l-l读取,并分别生成读取覆盖率,以获得正面和负链特异性库。我们在RNA-Seq库中使用了几个QC指标,包括内含子 - 外观比率,基因间读取分数,链特异性(用于绞合的RNA-SEQ方案),3'-5'偏见,GC偏置和RPKM发现率(补充表1,RNASEQQCCSMUMMARY表1)。We quantified exon and gene expression using a modified RPKM measure8, whereby we used the total number of reads aligned into coding exons for the normalization factor in RPKM calculations, and excluded reads from the mitochondrial genome, reads falling into genes coding for ribosomal proteins, and reads falling into top 0.5% expressed exons.基因的RPKM使用对所有合并外显子的总读数总数计算出按总外显子长归一化的基因。所得文件包含所有注释外显子和编码和非编码基因(不包括核糖体基因)的RPKM值,以及内含子(使用Gencode V10注释)。我们还报告了所有显着的基因间RNA-seq重叠群的坐标,这些重叠群未重叠带注释的基因。
阅读映射。来自表观基因组地图集的版本9的测序数据集涉及使用Pash读取Mapper34绘制到人类基因组的HG19组装上的1502.1亿个测序读取。使用这些读取映射(如上所述的RNA-seq数据集(RNA-seq数据集)除外)用于构建111个巩固表观基因组。仅保留唯一的映射读数,并将乘法映射读数过滤。包含映射读取的床文件是从http://www.epigenomeatlas.org获得的。每种测定类型和实验的比对参数在GEO上存档的相关公开版本9元数据中指定。对于编码数据集,从http://hgdownload.cse.ucsc.edu/goldenpath/goldenpath/hg19/encodedcc/下载了包含映射读取的BAM文件。仅保留唯一的映射读数,并丢弃了乘法映射读数。
可视性过滤,合并和亚采样。原始版本9读取对齐文件包含预先扩展到200 bp的读取。但是,整个发行版的原始读取长度在9个原始数据集上存在显着差异,这反映了项目过程中中心之间的差异和测序技术的变化(36 bp,50 bp,50 bp,76 bp和100 bp)。为了避免由于可视性而引起的人为差异,对于每个合并数据集,原始映射读数均匀地截断为36 bp,然后使用36 bp的自定义可视性轨道对其进行重新验证,以仅保留映射到位置(考虑到Strand)的映射(考虑到该位置),在这些位置上,这些位置在这些位置上是唯一在基因组中的。然后将过滤的数据集跨技术/生物学重复合并,并在必要时为每个标准化表格组中的每个组蛋白标记或DNase-Seq获得一个单一的合并样品。补充表1总结了单个版本的9个主要数据示例文件的映射到对应于127个合并参考表观基因质基因组的合并数据文件。
为了避免由于测序深度的差异而引起的人工信号强度差异,所有合并的组蛋白标记数据集(除了附加组蛋白标记了七个深层剖面的表观基因瘤,图2J)均均匀地采样至3000万个读数的最大深度(所有合并样品的中位数读取深度)。对于七个深层参考的表观基因组(图2J),组蛋白标记数据集被亚采样至最大4500万个读数(中间深度)。将合并的DNase-seq数据集亚采样至最大深度为5000万个读取(中间深度)。然后将这些均匀的子采样数据集用于所有进一步的处理步骤(峰值呼叫,信号覆盖轨道,染色质状态)。
高峰电话。For the histone ChIP-seq data, the MACSv2.0.10 peak caller was used to compare ChIP-seq signal to a corresponding whole-cell extract (WCE) sequenced control to identify narrow regions of enrichment (peaks) that pass a Poisson P value threshold 0.01, broad domains that pass a broad-peak Poisson P value of 0.1 and gapped peaks which are broad domains (P < 0.1) that include at least one narrow peak (P < 0.01) (https://github.com/taoliu/MACS/)32. Fragment lengths for each data set were pre-estimated using strand cross-correlation analysis and the SPP peak caller package (https://code.google.com/p/phantompeakqualtools/)37 and these fragment length estimates were explicitly used as parameters in the MACS2 program (–shift-size = fragment_length/2).
For DNase-seq data, we used two methods to identify DNase I accessible sites. First, the Hotspot algorithm was used to identify fixed-size (150 bp) DNase hypersensitive sites, and more general-sized regions of DNA accessibility (hotspots) using an FDR of 0.01 (http://www.uwencode.org/proj/hotspot)104. MACSv2.0.10 was also used to call narrow peaks using the same settings specified above for the histone mark narrow peak calling.
Narrow peaks and broad domains were also generated for the unconsolidated, 36-bp mappability filtered histone mark ChIP-seq and DNase-seq Release 9 data sets using MACSv2.0.10 with the same settings as specified above.
Genome-wide signal coverage tracks. We used the signal processing engine of the MACSv2.0.10 peak caller to generate genome-wide signal coverage tracks. Whole-cell extract was used as a control for signal normalization for the histone ChIP-seq coverage. Each DNase-seq data set was normalized using simulated background data sets generated by uniformly distributing equivalent number of reads across the mappable genome. We generated two types of tracks that use different statistics based on a Poisson background model to represent per-base signal scores. Briefly, reads are extended in the 5′ to 3′ direction by the estimated fragment length. At each base, the observed counts of ChIP-seq/DNase I-seq extended reads overlapping the base are compared to corresponding dynamic expected background counts (λlocal) estimated from the control data set. λlocal is defined as max(λBG, λ1k, λ5k, λ10k) where λBG is the expected counts per base assuming a uniform distribution of control reads across all mappable bases in the genome and λ1k, λ5k, λ10k are expected counts estimated from the 1 kb, 5 kb and 10 kb window centred at the base. λlocal is adjusted for the ratio of the sequencing depth of ChIP-seq/DNase-seq data set relative to the control data set. The two types of signal score statistics computed per base are as follows.
(1) Fold-enrichment ratio of ChIP-seq or DNase counts relative to expected background counts λlocal. These scores provide a direct measure of the effect size of enrichment at any base in the genome.
(2) Negative log10 of the Poisson P-value of ChIP-seq or DNase counts relative to expected background counts λlocal. These signal confidence scores provide a measure of statistical significance of the observed enrichment.
The −log10(P value) scores provide a convenient way to threshold signal (for example, 2 corresponds to a P value threshold of 1 × 10−2), similar to what is used in identifying enriched regions (peak calling). We recommend using the signal confidence score tracks for visualization. A universal threshold of 2 provides good separation between signal and noise. Both types of signal tracks were also generated for the unconsolidated data sets using the same parameter settings described above.
Quality control. For the primary Release 9 data sets, data quality enrichment scores were computed as the fraction of the uniquely mapped reads overlapping with areas of enrichment. Several methods were employed to select signal enrichment regions. The SPOT quality score was computed based on regions identified with the HotSpot peak caller104; the FindPeaks quality score was inferred based on peak calls made using the FindPeaks36 software; finally, a Poisson metric was derived by modelling the read distribution in genome-tiling 1,000-bp windows with a Poisson distribution and selecting as enriched regions windows with P < 0.05. All the quality scores in Release 9 are in agreement, with strong pairwise correlation (Pearson correlation >0.9). Concordance between centres was confirmed and data analysis pipeline was validated at the outset of the project using data sets for the H1 cell line. The same pipeline was subsequently used to produce Release 9 data. ChIP-seq data for six histone modifications (H3K4me3, H3K27me3, H3K9ac, H3K9me3, H3K36me3 and H3K4me1) were independently generated for the H1 cell line by three REMCs (Broad, UCSD, UCSF-UBC). To quantify concordance, the reads from each experiment were mapped (Level 1 data), read density tracks (Level 2 data) were generated using the EDACC’s primary data processing pipeline, and finally Pearson correlation coefficients were computed between each pair of experiments, as well as between experiments and H1 input acting as a control for background correlation between signals (Supplementary Table 2). The methylome processing pipeline was characterized experimentally on four independent samples38,39.
For the uniformly reprocessed and consolidated ChIP-seq and DNase-seq data sets, strand cross-correlation measures were used to estimate signal-to-noise ratios (https://code.google.com/p/phantompeakqualtools/)37. Data sets for each mark were rank-ordered based on the normalized strand cross-correlation coefficient (NSC) and flagged if the scores were significantly below the median value or in the range of NSC values for WCE extract controls. Consolidated data sets with extremely low sequencing depth (<10M reads) were also flagged. Each standardized epigenome was then manually assigned a subjective quality flag of 1 (high), 0 (medium) or −1 (low), based on the number of flagged data sets it contained. The SPOT, FindPeaks and Poisson quality scores were also recomputed for the consolidated data sets. We observed high correlations of the NSC scores with the SPOT (Pearson correlation of 0.7) and FindPeaks scores (Pearson correlation of 0.65). All QC measures are provided in Supplementary Table 1 (Sheets QCSummary and AdditionalQCScores).
To identify potential antibody cross-reactivity or mislabelling issues, a pairwise correlation heat map (Extended Data Fig. 1e) was computed across all consolidated data sets for H3K4me1, H3K4me3, H3K36me3, H3K27me3, H3K9me3, H3K27ac, H3K9ac and DNase. We computed the Pearson correlation between all pairs of the signal tracks based on signal in chromosomes 1–22 and chromosome X. We used the signal confidence score tracks (−log10(Poisson P value)) where we first computed the average signal scores within each consecutive 25-bp interval. To order the experiments in the heat map we defined the distance between two pairs of experiments as 1-correlation value and used a travelling salesman problem formulation105.
We used PASH38 alignments for the WGBS and RRBS read alignments. From the number of converted and unconverted reads at each individual CpG the total coverage and fractional methylation were reported. The data were uniformly post-processed and formatted into two matrices for each chromosome. One matrix contained read coverage information for each base (C and G) in every CpG (row) and for each reference epigenome (column). Another matrix similarly contained fractional methylation ranging from 0 to 1. For the locations where coverage was ≤3 we considered data as missing. For MeDIP/MRE methylation data we used the output of the mCRF tool31 that reports fractional methylation in the range from 0 to 1 and uses an internal BWA mapping. The mCRF results were combined in a single matrix per chromosome for all reference epigenomes where available.
To capture the significant combinatorial interactions between different chromatin marks in their spatial context (chromatin states) across 127 epigenomes, we used ChromHMM v.1.10106, which is based on a multivariate Hidden Markov Model.
A ChromHMM model applicable to all 127 epigenomes was learned by virtually concatenating consolidated data corresponding to the core set of five chromatin marks assayed in all epigenomes (H3K4me3, H3K4me1, H3K36me3, H3K27me3, H3K9me3). The model was trained on 60 epigenomes with highest-quality data (Fig. 2k), which provided sufficient coverage of the different lineages and tissue types (Supplementary Table 1; Sheet QCSummary). The ChromHMM parameters used were as follows: reads were shifted in the 5′ to 3′ direction by 100 bp. For each consolidated ChIP-seq data set, read counts were computed in non-overlapping 200-bp bins across the entire genome. Each bin was discretized into two levels, 1 indicating enrichment and 0 indicating no enrichment. The binarization was performed by comparing ChIP-seq read counts to corresponding whole-cell extract control read counts within each bin and using a Poisson P value threshold of 1 × 10−4 (the default discretization threshold in ChromHMM). We trained several models in parallel mode with the number of states ranging from 10 states to 25 states. We decided to use a 15-state model (Fig. 4a–f and Extended Data Fig. 2b) for all further analyses since it captured all the key interactions between the chromatin marks, and because larger numbers of states did not capture sufficiently distinct interactions. The trained model was then used to compute the posterior probability of each state for each genomic bin in each reference epigenome. The regions were labelled using the state with the maximum posterior probability.
A second ‘expanded’ model applicable to 98 epigenomes that also have an H3K27ac ChIP-seq data set was learned by virtually concatenating consolidated data corresponding to the core set of five chromatin marks and H3K27ac. The model was trained on 40 high-quality epigenomes using the same parameters as those used for the primary model (Supplementary Table 1; Sheet QCSummary). We trained several models with the number of states ranging from 15 states to 25 states. An 18-state model was used for further analyses (Extended Data Fig. 2c) based on similar considerations.
To assign biologically meaningful mnemonics to the states, we used the ChromHMM package to compute the overlap and neighbourhood enrichments of each state relative to various types of functional annotations (Fig. 4b, c, f and Extended Data Fig. 2b, c and Supplementary Fig. 2).
For any set of genomic coordinates representing a genomic feature and a given state, the fold enrichment of overlap is calculated as the ratio of ‘the joint probability of a region belonging to the state and the feature’ versus ‘the product of independent marginal probability of observing the state in the genome’ times ‘the probability of observing the feature’, namely the ratio between the (number of bases in state AND overlap feature)/(number of bases in genome) and the [(number of bases overlap feature)/(number of bases in genome) × (number of bases in state)/(number of bases in genome)]. The neighbourhood enrichment is computed for genomic bins around a set of single-base-pair anchor locations such as transcription start sites.
For the overlap enrichment plots in the figures, the enrichments for each genomic feature (column) across all states is normalized by subtracting the minimum value from the column and then dividing by the max of the column. So the values always range from 0 (white) to 1 (dark blue); that is, it’s a column-wise relative scale. For the neighbourhood positional enrichment plots, the normalization is done across all columns; that is, the minimum value over the entire matrix is subtracted from each value and divided by the maximum over the entire matrix.
The functional annotations used were as follows (all coordinates were relative to the hg19 version of the human genome): (1) CpG islands obtained from the UCSC table browser. (2) Exons, genes, introns, transcription start sites (TSSs) and transcription end sites (TESs), 2-kb windows around TSSs and 2-kb windows around TESs based on the GENCODEv10 annotation (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV10/) restricted to GENCODE biotypes annotating long transcripts. (3) Expressed and non-expressed genes, their TSSs and TESs. Genes were classified into the expressed or non-expressed class based on their RNA-seq expression levels in the H1-ES cells (Fig. 4c) and GM12878 (Extended Data Fig. 2b) cell lines. A gaussian mixture model with two components was fit on expression levels of all genes to obtain thresholds for the two classes. (4) Zinc finger genes (obtained by searching the ENSEMBL annotation for genes with gene names starting with ZNF). (5) Transcription factor binding sites (TFBS) based on ENCODE ChIP-seq data in the H1-ES cell line. The uniformly processed transcription factor ChIP-seq peak locations were downloaded from the ENCODE repository: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/. We also computed percentage transcription factor binding site coverage for state calls in the GM12878 and K562 cell lines using corresponding transcription factor ChIP-seq data from ENCODE which matched and supported the mnemonics and state interpretations obtained from the H1 cell line (Supplementary Fig. 2). (6) Conserved GERP elements based on 34 way placental mammalian alignments http://mendel.stanford.edu/SidowLab/downloads/gerp/ (Supplementary Fig. 3). (7) Enrichment for conserved GERP elements subtracting parts of the above-mentioned GERP elements that overlap exons.
We also learned independent 15-state models individually on each of the 127 epigenomes using the core set of 5 marks and the same parameter settings as for the primary model. To compare the individual models to the joint 15-state primary model, we stacked the emission vectors for all states from all the models and hierarchically clustered them using Euclidean distance and Ward linkage (Extended Data Fig. 2a). The individual epigenome models consistently and repeatedly identified states that were also recovered by the joint model (Extended Data Fig. 2a). Two additional clusters which included states recovered by the independent models learned in individual cell types, but not recovered in the joint model, were HetWk, characterized by weak presence of H3K9me3, and Rpts, characterized by presence of H3K9me3 along with a diversity of other marks, which was enriched in a large number of repeat elements.
For each of the seven deeply profiled reference epigenomes (Fig. 2j) we independently learned chromatin states on observed data for all available histone modifications or variants, and DNase in the reference epigenome. The same binarization and model learning procedure was followed as for the core set of 5 marks. We chose to consistently focus on a larger set of 50-states to capture the additional state distinctions afforded by using additional marks (Supplementary Fig. 4). Enrichments for annotations, including some of those described above for the 15-state model, were computed using ChromHMM. The HiC domains were obtained from ref. 107; the lamina-associated domains are described below; conserved element sets were the hg19 lift-over from ref. 73; repetitive element definitions were from RepeatMasker.
The distribution of DNA methylation (per cent CpG methylation from WGBS data) and DNA accessibility (DNase-seq −log10(P value) signal confidence scores) was computed using regions belonging to each of the 15 chromatin states based on the core set of 5 marks and the 18 chromatin states from the expanded model across all reference epigenomes for which these data sets were available (Fig. 4d, e).
CpGs with a minimum read coverage of 5 were used to calculate the average methylation percentages within genomic regions labelled with each chromatin state from the 15-state primary model and 18 state expanded model. only regions containing more than 3 CpGs with at most 200 bp between consecutive CpGs were used. Plots were generated using ggplot2 package for R (v.3.02). The average methylation levels for the chromatin states across DNA methylation platforms (WGBS, RRBS and mCRF) were analysed using Standard Least Square models in JMP (v.11.0; SAS Ins.). The model included the platforms (3 levels), chromatin states (15 levels) and the interactions (Extended Data Fig. 4).
Genome-wide DamID binding data for human lamin B1 in SHEF-2 ES cells were obtained from GEO series GSE22428 (ref. 63). Lamina associated domains were determined using a similar method to the one described in ref. 64. First, hg18-based data coordinates were converted to hg19-based coordinates using UCSC’s liftOver tool. Data were smoothed using a running median filter with a window size of 5 probes, after which domains were detected by estimating border and domain positions and comparing these to domains defined on 100 randomized instances of the same data set. Parameters are chosen such that the false discovery rate (FDR) for detected domains is 1%.
For each state s for the core 15-state joint model we computed the number of genomic bins that were labelled with that state in at least one epigenome (Gs). From among these bins we counted the number of bins (gs,i) that were labelled as being in state s in exactly i epigenomes (i = 1...127). We converted these counts to fractions (gs,i/Gs) and computed the cumulative fraction that is consistently labelled with the same chromatin state in at most N epigenomes (N = 1...127). States whose cumulative fractions rise faster than others represent those that are less constitutive (more variable). We repeated the same procedure restricted to 43 high-quality and non-redundant Roadmap epigenomes (using only 1 representative epigenome from those corresponding to ES cells, iPS lines or epigenomes for the same tissue type from different individuals and excluding ENCODE cell lines) (Supplementary Table 1, Sheet VariationAnalysis) (Supplementary Fig. 6a). Analogous analysis was performed on states from the 18-state expanded model (Extended Data 5a and Supplementary Fig. 6b).
The observed cumulative fractions of cell-type specificity are a function of the composition of cell types in the compendium and do depend to some extent on the variability of data quality for the different marks. For example, the enhancer mark (H3K4me1) does have a much better signal-to-noise ratio than the transcribed mark (H3K36me3). One might expect this to result in more spurious variation of states associated with the transcribed mark. However, contrary to this expectation, the cumulative fractions for states involving only the transcription mark (Tx and TxWk) and not the enhancer mark indicate that these states are in fact less variable and more constitutive across cell types. On the other hand, all states composed of the enhancer mark (H3K4me1), irrespective of whether they do (TxFlnk, EnhG) or do not (EnhBiv, Enh, BivFlnk, TssAFlnk) include the transcription mark (H3K36me3), are far more cell-type specific. These observations indicate that the increased variability of states is largely due to the enhancer mark (H3K4me1) than the transcribed mark (H3K36me3). As replicates are not available in all epigenomes, we did not correct for inter-replicate variation in this analysis, but in the state-switching analysis below we utilize samples from the same tissue as quasi-replicates.
To avoid spurious switching due to differences in data quality, we restricted this analysis to chromatin states from the 43 high-quality and non-redundant Roadmap epigenomes (see above). Using the 15 state primary model, we computed the empirical switching frequency of any pair of states across all pairs of 43 epigenomes. For a given pair of states A and B, we counted the number of genomic bins that were labelled as (A,B) or (B,A) in all pairs of epigenomes. The switching frequency matrix (which is symmetric) was then row-normalized to convert the switching frequencies to switching probabilities. This is done to avoid a dependence on the total number of epigenomes. Also, the switching probabilities unlike switching frequencies are not dominated by states that are highly prevalent (for example, quiescent state). Supplementary Fig. 7b shows the empirical switching probabilities for all pairs of states across the 43 epigenomes. To differentiate between chromatin state dynamics across tissues (inter-tissue) relative to variation of states across individuals or replicates from the same tissue (intra-tissue), we also computed analogous switching frequencies by restricting to subgroups of epigenomes from the same tissue type (Supplementary Table 1, Sheet VariationAnalysis). The frequencies were added across all sub-groups and then row-normalized to switching probabilities. Supplementary Fig. 7a shows the intra-tissue switching probabilities. We then computed the relative enrichment of state switches as the log10 ratio of inter-tissue switching probability across the 43 epigenomes relative to the intra-tissue switching probabilities (Fig. 5c). We repeated this analysis on the 16 ENCODE cell lines and obtained similar conclusions regarding relative enrichment of state switches (Supplementary Fig. 7c). Analogous analyses were performed using the 18-state expanded model in Roadmap Epigenomics samples (Extended Data Fig. 5c) and ENCODE samples (Supplementary Fig. 7d).
To study large-scale chromatin structure we first calculated ChromHMM (15-state model) state frequencies identified in 200-bp genome-wide bins across 127 epigenomes. Then we averaged state frequencies over the 2-Mb genomic regions, thus defining vectors of length 1,458 for each state. The unsupervised clustering of a 15 × 1,458 matrix (using Pearson correlation as a similarity measure and complete linkage) revealed 11 distinct genomic clusters enriched in different subsets of chromatin states (Fig. 5d, top heat map). Clusters had different sizes, with the smallest one (c1) containing only 27 bins, while the largest cluster (c9), occupied predominantly by a ‘quiescent’ state for all epigenomes, had 377 bins. For each 2-Mb bin in each cluster we calculated average gene density, lamin B1 signal (see section 4 above) and overlap with different cytogenetic bands (Fig. 5d, bottom, which displays also average levels across each cluster). We also show chromosomal locations of the clusters as well as distributions of CpG island frequency across the 2-Mb bins in each cluster (Extended Data Fig. 5d).
As a general resource for epigenomic comparisons across all epigenomes for which DNA methylation data is available, we defined DMRs using the method of Lister et al.108, combining all differentially methylated sites (DMSs) within 250-bp of one another into a single DMR and excluded any DMR with less than 3 DMSs. For each DMR in each sample, we computed its average methylation level, weighted by the number of reads overlapping it109. This resulted in a methylation level matrix with rows of DMRs and columns of samples.
For analysing differentiation of hESCs in Fig. 4h, we used a second set of DMRs. We used a pairwise comparison strategy between ES cells and three in vitro derived cell types representative of the three germ layers (mesoderm, endoderm, ectoderm) and performed DMR calling as previously described53. only DMRs losing more than 30% methylation compared to the ES cell state at a significance level of P ≤ 0.01 were retained. Subsequently, we computed weighted methylation levels for all three DMR sets across HUES64, mesoderm, endoderm and ectoderm as well as three consecutive stages of in vitro derived neural progenitors (please see accompanying paper53 for details on the cell types). Finally, we plotted the corresponding distribution using the R function vioplot in the vioplot package. In order to identify potential regulators associated with the loss of DNA methylation at these regions, we determined binding sites of a compendium of transcription factors profiled in distinct cell lines and types that overlapped with each set of hypomethylated DMRs51. Next, we determined a potential enrichment over a random genomic background by randomly sampling 100 equally sized sets of genomic regions, respecting the chromosomal and size distribution of the different DMR sets and determined their overlap with the same transcription factor binding site compendium to estimate a null distribution. only transcription factors that showed fewer binding sites across the control regions in 99 of the cases were considered for further analysis. Next, we computed the average enrichment over background for each transcription factor with respect to the 100 sets of random control regions for each germ layer DMR and report this enrichment level in Fig. 4h right, where we capped the relative enrichment at 12.
For studying breast epithelia differentiation, DMRs were called from WGBS, requiring at least five aligned reads to call differentially methylated CpG, and at least three differentially methylated CpGs within a distance of 200 bp of each other45. For studying tissue environment versus developmental origin, DMRs were called from MeDIP and MRE data using the M&M algorithm56.
For variation in methylation of each chromatin state across epigenomes (Fig. 4g and Extended Data Fig. 4f), we first excluded any contiguous chromatin state region containing less than three CpG sites. Then, the mean of the methylation level for all contained CpG sites was calculated for each region, and for each epigenome density values were calculated for these mean methylation values between 0% and 100%, with density values estimated over n = 1,000 points with a gaussian kernel, with the default ‘nrd0’ bandwidth from the R stats package density function. Finally, for each chromatin state, we plotted the ln(density + 1) for each epigenome as rows, with the colour scale set with white as the minimum ln(density + 1) value and red, green, or blue, for WGBS, mCRF and RRBS, respectively, set as the maximum ln(density + 1) value in the matrix. Rows were ordered by the epigenomic lineage and grouping ordering shown in Fig. 2a. In Extended Data Fig. 4f, epigenomes were first grouped by methylation platform, and then ordered by Fig. 2a within each platform. The chromatin state methylation profiles in the cell lines versus primary cells/tissue cells were analysed using a mixed model with repeated measures. Overall effect of the group (cell lines versus primary cells/tissue cells) was tested using epigenomes within group as the error term. Testing for group effect was performed for each of the 15 chromatin states, resulting in a Bonferroni correction on the P values for the 15 tests.
To identify patterns of coordinated changes of histone marks over enhancers during heart muscle development, we compared ES cells, mesendoderm cells, and left ventricle tissue57. We identified relevant enhancers as those that show changes in at least one histone mark between a specific cell type cluster (heart muscle in our case) and other cell types using LIMMA (Linear Model for Microarray Analysis). We applied FDR-corrected P value significance threshold of 0.05 to obtain cluster-specific enhancers. For each tissue type (heart muscle in our case) we then clustered the enhancers into five clusters (C1–C5) based on their multi-mark epigenomic profiles using the k-means algorithm implemented in the Spark tool (Fig. 4i). The tools used to generate Fig. 4i are integrated into the Epigenomic Toolset within the Genboree Workbench and are accessible for online use at http://www.genboree.org.
For each analysed mark, we calculated Pearson correlation values between all pairwise combinations of reference epigenomes using the mark’s signal confidence scores (–log10(Poisson P value)) within 200 bp of the genomic regions deemed relevant for that mark. Relevance of regions is determined by whether a region was called in a particular (mark-matched) chromatin state with posterior probability of >0.95 in any of the reference epigenomes. For H3K4me1, H3K27ac and H3K9ac we used state Enh; for H3K4me3 state TssA; for H3K27me3 state ReprPC; for H3K36me3 state Tx; and for H3K9me3 state Het, unless otherwise noted (all based on the 15-state core model).
The resulting correlation matrices were used as the basis for a distance matrix for complete-linkage hierarchical clustering, followed by optimal leaf ordering110. Bootstrap support values are derived from 1,000 random samplings with replacement from all regions considered for a particular mark and a bootstrap tree was estimated for each resampling. The bootstrap support for a branch corresponds to the fraction of bootstrapped trees that support the bipartition induced by the branch.
In parallel to this, all correlation matrices mentioned above were used to perform Multi-Dimensional Scaling analyses using R.
For each of the 39 Roadmap reference epigenomes with DNase data, peak positions are combined across reference epigenomes by defining peak island areas, defined by stacking all DNase peak positions across epigenomes, and considering the full width at half maximum (FWHM). Note that for this we are only considering peak locations, not intensities. The goal of this is to obtain an estimate of the area of open chromatin, not to quantify the level of ‘openness’, as these data are not available for all reference epigenomes. In cases when peak islands overlap, they are merged because it means that the original DNase peak area populations overlap at least for half of the epigenomes with DNase peaks in that area (given the FWHM approach). Peak island summits are defined as the median peak summit of all peak island member DNase peaks. This results in a total of 3,516,964 DNase enriched regions across epigenomes.
We then annotate each of the 3.5M DNase peaks with the chromatin states they overlap with in each of the 111 Roadmap reference epigenomes, using the core 15-state chromatin state model, and focusing on states TssA, TssAFlnk and TssBiv for promoters, and EnhG, Enh and EnhBiv for enhancers, and state BivFlnk (flanking bivalent Enh/Tss) for ambiguous regions. Out of these, 2.5M regions are called as either enhancer or promoter across any of the 111 Roadmap reference epigenomes. Note that because DNase data are not available for all Roadmap epigenomes, the set of regulatory regions defined may exclude DNase regions active in cell types for which DNase was not profiled (Fig. 2g). Although most regions are undisputedly called exclusively promoter or enhancer, there are 535,487 regions that needed further study to decide whether they should be called promoters, enhancers, or both (‘dyadic’). We arbitrate on these regions by first clustering them (using the methods in the following section) with an expected cluster size of 10,000 regions, and then for each cluster calculating (a) the mean posterior probabilities for promoter and enhancer calls separately, and (b) the mean number of reference epigenomes in which regions were called promoter or enhancer. Clusters of regions for which the differences in mean posterior probabilities (a) is smaller than 0.05, or for which the absolute log2 ratio of the number of epigenomes called as promoter or enhancer (b) is smaller than 0.05, are called true ‘dyadic’ regions, along with a small number of ‘ambiguous’ regions in state BivFlnk. Note that this particular clustering is only to arbitrate on these regions using group statistics instead of one-by-one; the final clusterings are described next. Overall, we define 2.3M putative enhancer regions (12.63% of genome), 80,000 promoter regions (1.44% of genome) and 130,000 dyadic regions (0.99% of genome), showing either promoter or enhancer signatures across epigenomes.
To cluster regulatory (that is, enhancer, promoter or dyadic) regions based on their activity patterns across all reference epigenomes, we expressed each region in terms of a binary vector of length n×s, where n is the number of reference epigenomes (111) and s is the number of chromatin states considered. For enhancers and promoters, s = 3, as both of these types of regions are made up of 3 chromatin states in the 15-state ChromHMM model (enhancers, EnhG, Enh and EnhBiv; promoters, TssA, TssAFlnk and TssBiv).
The thus obtained binary matrices are subsequently clustered using a variation of a k-centroid clustering algorithm111. Instead of Euclidean distance we use a Jaccard-index-based distance. This is done to be able to correctly cluster highly cell-type-restricted regions. From a computational point of view, we optimized the method to both deal with the size of the used data matrices and leverage their sparsity, to efficiently compute and update distances for matrices with sizes on the order of 106 × 103. The algorithm has been further modified to converge when less than 0.01% of cluster assignments change between iterations.
We selected the number of clusters k by tuning the expected number of regions within each cluster to be approximately 1,000 for promoter and dyadic regions, and approximately 10,000 for enhancer regions, given their much larger count (81,000, 129,000 and 2.3M for promoter, dyadic and enhancer, respectively). This results in a value of k = 233 for enhancer clusters (for 10k elements per cluster), and the algorithm converged on k = 226 non-empty clusters, which are used for subsequent analyses.
Clusters are visualized (Fig. 7a) by ‘diagonalizing’ when possible. First, ‘ubiquitous’ clusters (defined as having at least 50% of epigenomes with an enhancer/promoter density of >显示了25%)。然后,剩余的簇按照表观基因组具有最大增强子密度的排序。
已经使用出色的工具55进行了与基因组目录(基因本体论(GO),人类表型本体论(HPO)的邻近基因成员的富集分析。特别是,出色的Web API用于自动提交区域描述并检索结果以进行后续解析。我们将自己限制在富集比至少为2的结果的解释中,而多个假设检验校正了基于二项式和基于高几何分布测试的p值<0.01。
为了可视化图7b,c中富集项的代表性子集,我们选择了代表性术语进行显示(通过重新订购行来对角线化富集矩阵后)。我们使用加权袋的方法来选择高度丰富的术语,这些术语包含许多单词,这些单词在基因组标签中表现出类似的富集模式。简而言之,沿对角富集矩阵的行名称(基因 - 设定项)滑动,我们收集单词计数,并将其乘以从Great获得的整数 - LOG10(Q值)。我们在图7b的33号尺寸的滑动窗口中这样做(导致35个术语),图7c(以15个术语为15)。对于窗口中的每个单词,这些值相对于所有行名称上的相同单词都表示,并在何种程度上注册了它们的代表性过高。然后,根据其组成的所有单词的平均代表性过度代表,将窗口中的每个基因组项分配得分。最后,基因集是基于这种平均代表性及其重要性的共同排名。排名最佳的基因集标签被选为该窗口的代表标签。所有术语均显示在图11D中,可在http://compbio.mit.edu/roadmap上下载。
我们从主要大规模数据库中收集了1,772个已知的转录因子识别基序(位置权重矩阵)68,112,113,114,115,115,116,117,并测量了每种增强器模块的增强剂的富集,与226个增强器模块相比,使用Refs 9,68的226个profester(如226个增强器)的构图。我们使用0.75相关截止群聚类,导致300个基群簇68,并为每个基序群集选择该基序在任何增强器模块中具有最高富集的基序,以进行进一步分析。
我们将每个增强子模块和转录因子的表达得分计算为跨细胞类型的转录因子表达式之间的Pearson相关性,其中具有表达数据(零范围化的log(RPKM),零被log(0.0005))和模块的“中心”代替。对于每个增强器模块,其中心被定义为长度111的向量,其中包含该模块中的区域的比例,称为(任何类型的)增强子中的111个表观基因组中的每个模块。该表达评分旨在充当细胞类型模块中转录因子的“表达式”。然后,我们将每个转录因子的表达式增强值计算为该表达评分的相关性以及跨增强子模块的相应基序的富集。图8显示了至少一个log2 = 1.5的log2富集或耗尽的簇,至少有一个log2 = 1.5,至少有一个基序的群集在图8中显示了前40个基序,并具有log2富集或耗尽的簇,并且扩展数据图8a(每个因子仅在图8中显示一个基序)。
我们显示了所有增强子模块中的所有84个基序(log2≥1.5),整个226个增强子模块(补充图13A)以及它们在101个模块中显着丰富(扩展数据8a)。同样,我们在整个111个单个参考表观基因源组中(补充图13B),特别是在15个富集的表观基因瘤中展示了所有10个富集的基序(补充图13B)(补充图13C)。最后,我们显示了整个17个组织组中的所有19个富集基序(补充图13d),特别是在10组中显示出明显富集的10组(补充图13E)。
为了可视化调节器 - 细胞类型链接(图8),我们使用这些基序模块富集计算了每种细胞类型和基序之间的边缘权重。对于每种基序和单元格类型,我们计算了log2基序富集乘积的所有模块的总和,以及模块中心内的单元格类型的值(仅通过用0替换<0.7的值来考虑高度相关的单元格类型)。我们显示所有最终的边缘权重,至少为1.5,并使用Cytoscape118可视化网络。
基于上面提到的相同基序富集方法,我们计算了每个库中组织特异性数字基因组足迹(DGF)区域中的基序富集。通过在42个DGF文库中选择不超过20个DGF文库的DGF区域来鉴定组织特异性的DGF区域。为了生成扩展数据图9B,我们根据其组织类型将每个库中的图案富集标准化为每个基序(行)的z得分(行),并为每个DGF库(列)涂色。
我们计算了每个驱动程序基序(扩展数据9C和10)的位置富集与每种细胞类型中数字基因组足迹(DGF)位点有关的位置富集(补充表5B)。对于每个驱动器转录因子基序,我们相对于最接近的DGF位点(中心视图)的中心(中心视图)和相对于最接近DGF位点的边界(边界视图)而产生了与基序位置(基序实例的中心)相对应的两个视图。我们仅考虑了在100 bp以内的最接近DGF站点的主题实例。对于中心视图,我们绘制了基序的发生密度与到DGF中心的距离不同。对于边界视图,我们考虑了图案实例的中心和DGF边界的任何一侧之间的最短距离,并给出了DGF内部图案实例的负距离,否则为正距离值。与中心视图相似,我们在每个单元格的边界视图中绘制了基序密度与派生距离值。
为了访问DGF在每种细胞类型中的基序浓度的重要性,我们计算了DGF富集比为距离与DGF中心小于20 bp的基序实例的数量与直接侧翼窗口中的该数量(即具有距离的距离数量)与DGF中心的数量与DGF中心的数量较大,而DGF中心的数量较大,大于20 bp和40 bp。作为对照,我们从给定基序的洗牌版本中随机采样了相同数量的基序实例,并在洗牌基序实例中获得了DGF富集比。真实基序的DGF富集比通过平均值和标准偏差从1000倍随机抽样中进一步转换为Z评分。然后,从z得分和Bonferroni校正中进一步计算了调整后的P值的细胞类型数量。
在补充表5a中,将预测表观基因组修饰的基序与DGF进行了比较。这是在可用DGF和预测基序的三种细胞类型中完成的:“ H1 BMP4衍生的sendoderm培养细胞”(E004),“ H1 BMP4衍生的滋养层状细胞培养细胞”(E005)和“ H1衍生的中质干细胞”(E006)。考虑了以下七个输入的预测基序:H3K27Me3,H3K27AC,H3K9ME3,H3K36ME3,H3K4ME1,H3K4ME3和DNA甲基化谷(DMV)11。为了识别重叠,对相应修改的修改峰进行了扫描预测基序,并记录了基序和序列之间最佳匹配的位置。然后,我们计算了最佳主题匹配的位置的次数,将DGF重叠至少1 bp。将这些计数与随机识别的重叠数量进行比较,这是通过将DGF与修饰峰内的随机位置进行比较来计算得出的。报告的随机频率是100重复的平均值。为了计算倍数富集,我们将观察到的频率除以随机频率。
我们测试了组蛋白标记H3K4ME1,H3K4ME3,H3K36ME3,H3K9ME3,H3K27ME3,H3K27ME3,H3K9AC和H3K27AC以及dnase eps eps eps eps的epem2,我们测试了组蛋白标记H3K4ME1,H3K4ME3,H3K9ME3,H3K27AC和H3K27AC以及dnase eps eps的eps 2n peaks的epem2,我们在各个基因组研究(GWAS)中富集了SNP。将使用的SNP策划为NHGRI GWAS Catalogue75,并于2014年9月12日通过UCSC Table Browser119获得。我们将富集分析限制为染色体1-22和X染色体。我们将一项研究定义为一项研究是带有带状特征和PubMed ID的独特组合。为了减少分配给同一研究的SNP对之间的依赖性,我们修剪了SNP,使得在同一染色体上没有两个SNP彼此之间彼此1 MB之内。修剪过程以P值排名为P值的每个SNP,最重要的是首先出现,如果在1 MB内没有同一染色体上的SNP,我们将保留SNP。我们计算了针对修剪的GWAS目录作为背景的每一组修剪的SNP峰值呼叫的富集的超几何P值。我们通过生成100个随机的PRUNED GWAS目录的随机版本,将SNP分配给SNP,分配了100个随机的研究和计算参考表观基因组组合的平均分数,从而分配了SNP,我们分别估计每个标记的映射从P值映射到所有研究和参考表观遗传组组合跨测试的虚假发现率,并分配了SNP的平均水平(在该级别上的平均级别)。根据实际GWAS目录除以数字。