2025-06-22 09:47来源:本站
此处报告的53,831个样本的WGS是对以前从33名NHLBI资助的研究项目中收集并同意的样本进行了的样本。所有研究均由相应的机构审查委员会批准(补充信息4)。所有测序均通过从全血中提取的DNA进行,除了17个Framingham样品(淋巴细胞细胞系)和HAPMAP样品Na12878和Na19238(淋巴细胞系)(淋巴细胞系)使用定期用作测序对照。通过将序列数据与人类基因组进行对齐,并通过与先前的遗传分析进行了认证,对细胞系进行了支原体污染。
使用Illumina Hiseq X Ten仪器的平均深度至少为30倍(配对端,150 bp读数)的WGS在六个测序中心进行了几年(补充表17)。所有测序使用的无PCR库制备套件均从KAPA生物系统购买,相当于Illumina TruseQ无PCR样品制备指南中的协议(Illumina,FC-121-2001)。可从Topmed网站(https://www.nhlbiwgs.org/topmed-whole-whole-genome-sequesing-project-freeze-5b-phapees-1-and-2)获得。此外,在最佳测序项目开始之前对四个贡献研究的1,606个样本进行了30倍的覆盖范围WG,并包括在此数据集中。使用HISEQ 2000或2500仪器在Illumina上进行了测序,具有2×100 bp或2×125 bp配对的读数,有时使用PCR扩增。
定期进行序列数据处理,以生成“冻结”基因型数据,其中包括当时可用的所有样本。按照先前发布的协议,使用BWA-MEM76重塑所有序列为HS38DH 1000基因组构建了38个人类基因组参考,包括诱饵序列77。使用GOTCloud78,79管道在给定冻结中的所有样品中共同执行变体发现和基因型调用。此过程导致单一的多研究基因型调用集。用于变体站点的支持矢量机质量过滤器是使用一系列特定地点的质量指标和来自阵列的已知变体的训练,而1000个基因组项目作为阳性对照和具有Mendelian矛盾的变体作为负面对照(有关更多详细信息,请参见在线文档80)。在删除所有次要等位基因计数小于2的位点后,使用EAGLE 2.481逐步逐步逐步逐步逐步逐步逐步分阶段。样本级别的质量控制包括检查谱系错误,自我报告和遗传性别之间的差异以及与以前的基因分型阵列数据的一致性。在提交DBGAP之前,检测到的任何错误均已解决。有关WGS数据采集,处理和质量控制的详细信息在最佳数据冻结之间有所不同。冻结特定的方法在顶级网站(https://www.nhlbiwgs.org/data-sets)和DBGAP上发布的每个顶级登录中包含的文档中描述(例如,请参见phs000956.v4.p1中的文档PHD008024.1)。
每个研究参与者的个别级序列数据的副本都存储在Google和Amazon Cloud上。访问涉及批准的DBGAP数据访问请求,并由NCBI序列数据传递试验机制介导。该机制使用在用户云实例上运行的Fusera Software82用DBGAP处理身份验证和授权。它为一个或多个顶部(或其他)示例的序列数据提供了读取访问权限,作为.cram文件(带有相关的.crai索引文件),该文件安装在云计算实例上的保险丝虚拟文件系统中。样本由在NCBI序列读取存档(SRA)数据库中分配的“ SRR”运行登录号确定,并在每个研究的PHS号下显示在SRA运行选择器(https://trace.ncbi.ncbi.nlm.nih.gov/trace/trace/srace/sra/sra/sra/sra.cgi)中。通过搜索dbGap来查找“顶部”字符串,可以很容易地找到所有顶级研究的pHS数字。Fusera软件仅限于在Google或Amazon Cloud Instances上运行,以避免产生数据出口费用。Fusera,Samtools和其他工具也包装在Docker容器中,以易于使用,可以从Docker Hub83下载。
在此处介绍的分析中使用了几个来自三个不同WGS数据冻结的样本集:冻结3(GRCH37比对,大约18,000个样品在2016年共同称为),冻结5(GRCH38比对,约65,000个样品在2017年共同称为)和Freeze 8(GRCH38对8(GRCH38平行,约为14000,000个样本),共有140,000个Samples inspamples in 140,000 inspamples in Conspamples in Conspamples in 2019 in 2019 in 2019)。扩展数据表3指示了本文所述的几种不同类型的分析中的每一种。大多数分析是对从冻结5得出的53,831个样品(扩展数据表3中的“一般变体分析”)或批准用于人群遗传研究的子集(扩展数据表3中的“人口遗传学”)的样本。使用符合DBGAP共享的样品在分析时从Freeze 5中选择了53,831组,不包括(1)同一参与者的(1)重复样本;(2)每个单卵双胞胎对的一个成员;(3)具有可疑身份或低阅读深度的样本(<98% of variant sites at depth ≥ 10×); and (4) samples with consent types inconsistent with analyses presented here. The ‘unrelated’ sample set consisting of 40,722 samples refers to a subset of the 53,831 samples of individuals who are unrelated with a threshold of third degree (less closely related than first cousins), identified using the PC-AiR method84. Exact numbers of samples used in each analysis are listed in Supplementary Table 18.
From around 10,000 BioMe study samples present in TOPMed freeze 8, we randomly selected 1,000 samples for which whole-exome sequencing (WES) data were available. These samples were whole-exome sequenced using Illumina v4 HiSeq 2500 at an average 36.4× depth. Genetic variants were jointly called using the GATK v.3.5.0 pipeline across all 31,250 BioMe samples with WES data. A series of quality control filters, known as the Goldilocks filter, were applied before data delivery to the Charles Bronfman Institute for Personalized Medicine (IPM). First, a series of filters was applied to particular cells comprising combinations of sites and samples—that is, genotypic information for one individual at one locus. Quality scores were normalized by depth of coverage and used with depth of coverage itself to filter sites, using different thresholds for SNVs and short indels. For SNVs, cells with depth-normalized quality scores less than 3, or depth of coverage less than 7 are set to missing. For indels, cells with depth-normalized quality scores less than 5, or depth of coverage less than 10 are set to missing. Then, variant sites were filtered, such that all samples carrying variation have heterozygous (0/1) genotype calls and all samples carrying heterozygous variation fail the allele balance cut-off; these sites were removed from the dataset at this stage. The allele balance cut-off, as with the depth and quality scores used for cell filtering above, differed depending on whether the site was a SNV or an indel: SNVs require at least one sample to carry an alternative allele balance ≥ 15%, and indels require at least one sample to carry an alternative allele balance ≥ 20%. These filters resulted in the removal of 441,406 sites, leaving 8,761,478 variants in the dataset. After subsetting to 1,000 randomly selected individuals, we had 1,076,707 autosomal variants that passed quality control. We further removed variants with call rate <99% (that is, missing in more than 10 individuals), reducing the number of analysed autosomal variants to 1,044,517. The comparison results of TOPMed WGS and BioMe WES data are described in Supplementary Information 1.3.1.
Investigators of the Framingham Heart Study (FHS) evaluated WGS data from TOPMed in comparison with sequencing data from CHARGE Consortium WGS and WES datasets. Supplementary Table 19 provides the counts and depth of each sequencing effort. The overlap of these three groups is 430 FHS study participants, on whom we report here. We use a subset of 263 unrelated study participants to calculate the numbers of singletons and doubletons, MAF, heterozygosity and all rates, to avoid bias from the family structure. Supplementary Information 1.3.2 provides further detail on the sequencing efforts and a detailed description of the comparison results.
pLOF variants were identified using Loss Of Function Transcript Effect Estimator (LOFTEE) v.0.3-beta85 and Variant Effect Predictor (VEP) v.9486. The genomic coordinates of coding elements were based on GENCODE v.2915. only stop-gained, frameshift and splice-site-disturbing variants annotated as high-confidence pLOF variants were used in the analysis. The pLOF variants with allele frequency > 0.5% or within regions masked due to poor accessibility were excluded from analysis (see Supplementary Information 1.5 for details).
We evaluated the enrichment and depletion of pLOF variants (allele frequency < 0.5%) in gene sets (that is, terms) from Gene ontology (GO)87,88. For each gene annotated with a particular GO term, we computed the number of pLOF variants per protein-coding base pair, L, and proportion of singletons, S. We then tested for lower or higher mean L and S in a GO term using bootstrapping (1,000,000 samples) with adjustment for the gene length of the protein-coding sequence (CDS): (1) sort all genes by their CDS length in ascending order and divide them into equal-size bins (1,000 genes each); (2) count how many genes from a GO term are in each bin; (3) from each bin, sample with replacement the same number of genes and compute the average L and S; (4) count how many times sampled L and S were lower or higher than observed values. The P values were computed as the proportion of bootstrap samples that exceeded the observed values. The fold change of average L and S was computed as a ratio of observed values to the average of sampled values. We tested all 12,563 GO terms that included more than one gene. The P-value significance threshold was thus ~2 × 10−6. The enrichment and depletion of pLOF variants in public gene databases was tested in a similar way.
We compared sequencing depth at protein-coding regions in TOPMed WGS and ExAC WES data. The ExAC WES depth at each sequenced base pair on human genome build GRCh37 was downloaded from the ExAC browser website (http://exac.broadinstitute.org). When sequencing depth summary statistics for a base pair were missing, we assumed depth <10× for this base pair. only protein-coding genes from consensus coding sequence were analysed and the protein-coding regions (CDS) were extracted from GENCODE v.29. When analysing ExAC sequencing depth, we used GENCODE v.29 lifted to human genome build GRCh37. When comparing sequencing depth for each gene individually in TOPMed and ExAC, we used only genes present in both GRCh38 and GRCh37 versions of GENCODE v.29.
Analysis of unmapped reads was performed using 53,831 samples from TOPMed data freeze 5. From each sample, we extracted and filtered all read pairs with at least one unmapped mate and used them to discover human sequences that were absent from the reference. The pipeline included four steps: (1) per-sample de novo assembly of unmapped reads; (2) contig alignment to the Pan paniscus, Pan troglodytes, Gorilla gorilla and Pongo abelii genome references and subsequent hominid-reference-based merging and scaffolding of sequences pooled together from all samples; (3) reference placement and breakpoint calling; and (4) variant genotyping. The detailed description of each step is provided in Supplementary Information 1.7.
Details of the Stargazer genotyping pipeline have been described previously43. In brief, SNVs and indels in CYP2D6 were assessed from a VCF file generated using GATK-HaplotypeCaller89. The VCF file was phased using the program Beagle90 and the 1000 Genomes Project haplotype reference panel. Phased SNVs and indels were then matched to star alleles. In parallel, read depth was calculated from BAM files using GATK-DepthOfCoverage89. Read depth was converted to copy number by performing intra-sample normalization43. After normalization, structural variants were assessed by testing all possible pairwise combinations of pre-defined copy number profiles against the observed copy number profile of the sample. For new SVs, breakpoints were statistically inferred using changepoint91. Information regarding new SVs was stored and used to identify subsequent SVs in copy number profiles. Output data included individual diplotypes, copy number plots and a VCF of SNVs and indels that were not used to define star alleles.
We excluded indels and multi-allelic variants, and categorized the remaining variants as common (allele frequency ≥ 0.005) or rare (allele frequency < 0.005), and as coding or noncoding based on protein-coding exons from Ensembl 9492. Variant counts were analysed across 2,739 non-empty (that is, with at least one variant) contiguous 1-Mb chromosomal segments, and counts in segments at the end of chromosomes with length L < 106 bp were scaled up proportionally by the factor 106 × L−1. For each segment, the coding proportion, C, was calculated as the proportion of bases overlapping protein-coding exons. The distribution of C is fairly narrow, with 80% of segments having C ≤ 0.0195, 99% of segments have C ≤ 0.067 and only 3 segments having C ≥ 0.10. Owing to the significant negative correlation between C and the number of variants in a segment, and potential mapping effects, we use linear regression to adjust the variant counts per segment according to the model count = β × C + A + count_adj, where A is the proportion of segment bases overlapping the accessibility mask (Supplementary Information 1.5). Unless otherwise noted, we present analyses and results that use these adjusted count values.
Distinct base classifications were defined by both coding and noncoding annotations (based on Ensembl 9492) and CADD in silico prediction scores21 (downloaded from the CADD server for all possible SNVs). For each base, we used the maximum possible CADD score (when using the minimum CADD score, results were qualitatively the same). bases beyond the final base with a CADD score per chromosome were excluded. This resulted in six distinct types of concatenated segments: high (CADD ≥ 20), medium (10 ≤ CADD < 20) and low (CADD < 10) CADD scores for coding and similarly for noncoding variants. Common (allele frequency ≥ 0.005) and rare (allele frequency < 0.005) variant counts were then calculated across these concatenated segments. Multi-allelic variants and those in regions masked due to accessibility were excluded. Counts in segments at the end of chromosomes were scaled up as in the contiguous analysis.
From the TOPMed freeze 5 dataset, we selected a subset of 1,000 unrelated individuals of African ancestry, 1,000 unrelated individuals of East Asian ancestry and 1,000 unrelated individuals of European ancestry, with the ancestry of each individual inferred across 7 global reference populations using RFMix93. In each of these subsamples, we recalculated the allele counts of each SNV and extracted SNVs that were singletons within that sample, then calculated the distance to the nearest singleton (either upstream or downstream from the focal singleton) occurring within the same individual. Note that a singleton defined here is not necessarily a singleton in the entire TOPMed freeze 5 dataset. We chose to limit the size of each population subsample to n = 1,000 for three reasons: first, to ensure the different population subsamples carried roughly a similar number of singletons; second, to ensure homogeneous ancestry within each subsample so that our analysis of singleton clustering patterns was not an artefact of admixed haplotypes; third, to limit the incidence of recurrent mutations at hypermutable sites, which can alter the underlying mutational spectrum of singleton SNVs in large samples94. Although the TOPMed Consortium sequenced individuals from several other diverse population groups (for example, Samoan, Hispanic/Latino individuals), the majority of these individuals were of admixed ancestry and the singletons from these smaller samples reflected mutations that have accumulated over a longer period of time, so the mutation spectra and genome-wide distributions of these samples would be more susceptible to distortion by other evolutionary processes such as selection and biased gene conversion31.
To quantify the effects of external branch length heterogeneity on singleton clustering patterns, we used the stdpopsim library95 to simulate variants across chromosome 1 for 2,000 European and 2,000 African haploid samples, using a previously reported demographic model10. Simulations were performed using a per-site, per-generation mutation rate96 of 1.29 × 10−8, and using recombination rates derived from the HapMap genetic map97. Because our aim was to compare these simulated singletons to unphased singletons observed in the TOPMed data, we randomly assigned each of the 2,000 haploid samples from each population into one of 1,000 diploid pairs, and calculated the inter-singleton distances per diploid sample, ignoring the haplotype on which each simulated singleton originated.
The distribution of singletons suggest an underlying nonhomogeneous Poisson process, where the rate of incidence varies across the genome. In other areas of research, it has been shown that the waiting times between events arising from other nonhomogeneous Poisson processes, such as volcano eruptions or extreme weather events, can be accurately modelled as a mixture of exponential distributions98,99. Taking a similar approach, we model the distribution of inter-singleton distances across all Si singletons in individual i as a mixture of K exponential component distributions (fk(di;θi,k)), given by:
where θi,1 < θi,2 < … < θi,K and λi,k = Si,k/Si is the proportion of singletons arising from component , such that .
We estimate the parameters of this mixture (λi,1, …, λi,K, θi,1, …, θi,K) using the expectation–maximization algorithm as implemented in the mixtools R package100. Code for this analysis is available for download from the GitHub repository101. To identify an optimal number of mixture components, we iteratively fit mixture models for increasing values of K and calculated the log-likelihood of the observed data D given the parameter estimates , stopping at K components if the P value of the likelihood ratio test between K − 1 and K components was >0.01 (χ2 test with two degrees of freedom). The goodness-of-fit plateaued at four components for the majority of individuals, so we used the four-component parameter estimates from each individual in all subsequent analyses.
Now let ki,j indicate which of the four processes generated singleton j in individual i. We calculated the probability of being generated by process k as:
We then classified the process-of-origin for each singleton according to the following optimal decision rule:
After assigning singletons to the most likely mixture component, we pooled singletons across individuals of a given ancestry group and counted the number of occurrences in each component in non-overlapping 1-Mb windows throughout the genome. We defined hotspots as the top 5% of 1-Mb bins containing the most singletons in a component in each ancestry group.
In each 1-Mb window, we calculated the average signal for 12 genomic features (H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me3, H3K9ac, H3K9me3, exon density, DNase hypersensitivity, CpG island density, lamin-associated domain density and recombination rate), using the previously described source datasets31. For each mixture component, we then applied the following negative binomial regression model to estimate the effects of each feature on the density of that component in 1-Mb windows:
Where Ya,k,w is the number of singletons in ancestry subsample a of mixture component k in window w and X1,w, …, X12,w are the signals of each of the 12 genomic features in corresponding window w.
In these analyses, we used 39,722 unrelated individuals that had provided consent for population genetics research. Each individual was grouped into their TOPMed study, except for individuals from the AFGen project, which were treated as one study (Extended Data Tables 1, 2). Individuals from the FHS and ARIC projects individuals, which overlapped with the AFGen project, remained in their respective studies and were not grouped into the AFGen project. Individuals for whom the population group was either missing or ‘other’ were removed from the analysis. We then removed all indels, multi-allelic variants and singletons from the remaining 39,168 individuals. Each study was then split by population group. We excluded studies that had fewer than 19 samples from the analysis; however all 39,168 samples were used to define singleton filtering. We used the Jaccard index102, J, to determine the intersection of rare variants (2 ≤ sample count ≤ 100) between two individuals divided by the union of the rare variants of that pair, where the sample count indicates the number of individuals with either a heterozygote or homozygote variant. We then determined the average J value between and within each study.
To confirm that J is not biased by sample size, we randomly sampled 500 individuals from each of two studies with European (AFGen and FHS) and African (COPDGene and JHS) population groups in TOPMed freeze 3, without replacement. We then recalculated J between and within these randomly sampled studies, considering alternative allele counts between 2 and 100 within these 2,000 individuals.
We used the RefinedIBD program103 to call segments of identical-by-descent (IBD) sharing of length ≥ 2 cM on the autosomes using passing SNVs with MAF > 5%。该分析中包括所有53,831个样品,我们使用EAGLE281逐步逐步进行了基因型数据。由于IBD赔率(LOD)得分的对数通常在具有强大的瓶颈的种群中被放气,因此,我们使用的是1.0的LOD得分阈值,而不是默认的3.0。为了说明可能的相位和基因分型误差,如果差距的长度最多为0.5 cm,并且最多是一个不一致的基因型,则我们填补了同一对个体的IBD段之间的差距。由于较低的LOD阈值,较低变体密度的区域可能具有过多的明显IBD段。因此,我们使用先前描述的程序104确定了具有高度升高的IBD水平的区域,并删除了所有完全落在这些区域内的IBD段。
我们在每个研究中将数据除以研究和人群群体。在对IBD共享水平和最新有效规模的分析中,未经适当的同意或人口群体的研究不包括80个人。我们计算了每对个体的IBD段的总长度,并在一项研究中和每对人口群体之间的每个人群组中的这些总数平均。我们还使用IBDNE104估计了每个组的最新有效人群。
我们从顶部数据冻结3中选择了2,416个样本,即(1)占欧洲血统的比例很高。(2)无关;(3)同意人群遗传学研究。补充信息1.10提供了有关祖先估计和过滤器的更多详细信息。
我们执行了几个步骤,以过滤高质量中性位点的基因组,该位点基于先前描述的确定方案30(补充信息1.10)。过滤后,使用背景选择系数(McVicker的B Statistics60)在链接的位点进行选择,对基因组中的位置进行注释。我们使用带有B值的所有站点进行一般分析。但是,在进行人口统计学推断时,我们将分析限制在B范围内基因组分布的前1%(B≥0.994)中的基因组区域。这些位点对应于基因组的区域,该区域被推断为最弱的背景选择量(即,在连接位点的选择最弱的效果下)。基因组中的位点还使用GRCH37 E71祖先序列的祖先注释偏振到祖先和衍生状态。仅保留多态性双性平行位点后,我们有20,324,704个位点,其中191,631个位置的B≥0.994。我们还确定了91,177个四倍退化的同义位点(不论B)是多态(BI-callelic),并且具有较高的祖先和衍生状态。
我们通过将指数增长的模型与三个参数(Neur0,Neur,Teur)拟合到现场频谱频谱中,对MOMMENTS105程序进行了人口统计学推论。这包括两个自由参数:指数增长(TEUR)的起始时间和生长后的最终人口规模(神经)。祖先大小参数(即生长开始时的种群大小)在我们的模型中保持恒定,使得人口的相对起始大小始终为1。我们将推理程序应用于四倍的退化位点或B≥0.994的位点。用于推理的位点频谱进行了展开,并基于上述的极化步骤。使用样本量(2N)的1,000、2,000、3,000、4,000和4,832染色体拟合推理程序。为了通过推理程序将输出的缩放遗传参数转换为物理单位,我们使用了由此产生的theta(也通过矩推断)和1.66×10-8的突变速率106来生成相应的有效人口规模(NE)。为了将几代人转变为几年,我们假设了25年的一代时间。95%的置信区间是通过重新采样1,000次并使用Godambe信息矩阵生成参数不确定性107来产生的。补充信息1.10中提供了更详细的描述。
我们从从顶部数据冻结5中选出的39,649个无关的个体开始,我们同意了人口遗传分析的同意(扩展数据表3)。由于单例密度评分(SDS)需要数千个样本和基线人口统计记录,因此我们按人群群体将数据征收,并将我们的人口分析限制在我们拥有良好人口统计学历史的那些人口群体中:广义上是欧洲,广泛的非洲和广泛的东亚。为了避免混合物引入的潜在问题,我们要求我们的样本超过90%,推断出欧洲,非洲或东亚血统,这是由七向血统推理管道推断的(补充信息1.11)。这留下了n = 21,196欧洲样本,n = 2,117个非洲样本,n = 1,355个东亚样品。我们专门将阿米什人的样本排除在欧洲集团之外,因为它们是独特的创始人人口。我们分别分析了每个人群。仅使用使用WGSA Pipeline108推断出具有明确祖先状态的双行位点。排除了染色体边界附近的地点,丝粒附近和可及性差的地区被排除在外。我们使用先前发布的R脚本61在每个人群中执行所有人口统计学历史模拟和SDS计算。然后,我们将RAW SDS分数归一化1%频率箱内,并将归一化分数视为Z分数,如先前所述61,将其转换为P值。原始的和归一化的SDS分数包括在补充数据2中。
我们将每个常染色体染色体和X染色体分为重叠的块(每个块大小为1 MB,连续块之间具有0.1 MB的重叠),然后使用Eagle v.2.481逐步分阶段。我们删除了所有Singleton站点,并将单倍型块压缩到M3VCF格式中。之后,我们为每个染色体的压缩不型块块连接起来,以生成最终的参考面板。
对于所有最高的个体,使用投影到938人类基因组多样性项目(HGDP)个体的主要成分空间上的前四大主要成分估算了遗传祖先。对于每个顶级个体,我们根据verifyBamid2估计的主要成分空间中的欧几里得距离,确定了来自1000个基因组项目3的2,504个个体的10个最接近的个体。如果来自1000个基因组项目第3阶段的所有10个最接近的人都属于相同的超级人群 - 非洲,掺入美国,东亚,欧洲和南亚人口的人群 - 我们估计,最高的个体也属于超级人群。在97,256个参考小组个体中,有90,339(93%)被分配给超级人群,以下细分:非洲,24,267个人;混合美国人,17,085个人;欧洲人,47,159个人;东亚,1,184个人;南亚,644个人。我们从Biome上面的研究中随机选择了100个个体,并选择了Illumina Hanemomniexpress(8v1-2_a)阵列上存在的20染色体上的标记。使用1000个基因组项目3(n = 2,504),单倍型参考财团(HRC,n = 32,470)和Topmed(n = 96,756),使用1000个基因组项目阶段3(n = 2,504),使用1000个基因组项目阶段3(n = 96,756),将所选的基因型逐步逐步判断。使用每个参考面板中的MiniMAC4111估算分阶段的基因型,并估计估计的精度为估算剂量之间的平方相关系数(R2),而基因型从序列数据调用。在所有估计属于相同超级人群的顶级个体中,估计等位基因频率, R2值在每个MAF类别的变体中平均。为了计算平均R2的目的,假定100个测序个体中存在的变体具有R2 = 0。通过在每个MAF类别中的平均R2计算出R2> 0.3的最小MAF,通过找到使用线性插值交叉R2 = 0.3的MAF。稀有变体的平均数量(MAF < 0.5%) and the fraction of imputable rare variants (r2 > 根据Hardy-Weinberg平衡,基于高于最小MAF的估算样品中的非参考等位基因的数量计算0.3)。
在使用Eagle v.2.4的81个染色体块上逐步衡量英国生物库遗传数据后,使用LifeOver112将相分的数据从GRCH37转换为GRCH38。使用minimac4111进行插补。
我们比较了英国生物库(SPB Pipeline113)与最高的基因型发布的外显子记录数据之间的基因型相关性。该比较评估了49,819个个体和3,052,260个常染色体变体,这些变体都在外来的序列和上面输入的数据集中都发现(与染色体,位置和等位基因匹配,并且在顶级输入的数据中具有至少0.3的质量)。我们将变体分为MAF垃圾箱,其中使用了外显学数据中的MAF来定义垃圾箱,并计算出每个垃圾箱内的Pearson相关性。
我们测试了单个plof,胡说八道,翻新和基本的剪接站点变体85,86,用于与1,419个phecodes相关,该phecodes由ICD-10(国际疾病的国际疾病分类第10修订)代码构建以定义案例和控制措施。以前已经描述了Phecodes的构建114。我们对“白人英国”个体进行了协会分析,在应用以下质量控制指标之后,导致408,008个人:(1)截至2019年底,样本未从英国生物银行研究中撤回同意;(2)“提交性别”匹配“推断性别”;(3)可用的分阶段常染色体数据;(4)除去缺失基因型或杂合性的异常值;(5)没有假定的性爱染色体非整倍性;(6)没有过多的亲戚;(7)不排除在亲属推理之外;(8)在英国,生物银行定义了“英国白人”血统子集。为了进行关联分析,我们使用了Saige114中实施的Logistic混合模型测试,其出生年度和前四个主要组件(从英国白人子集中计算)作为协变量。对于plof负担测试,对于每个常染色体基因,至少有两个稀有的plof变体(n = 12,052个基因),就产生了一个负担变量,其中为每个人汇总了稀有plof变体的剂量。使用Saige测试了这一剂量与1,419个性状的关联。包括单变量测试中使用的相同协变量。对于单变量和负担测试,我们都将5×10-8用作全基因组显着性阈值。
有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。