66,083个人基因组中的核定线粒体DNA序列_科学

　　我们研究了来自全血DNA的68,348个基因组基因组罕见疾病项目的基因组和来自基因组癌症项目的26,488个癌症基因组。根据基因组学英格兰样本处理指南提取并处理DNA（https://legacy.genomicsengland.co.uk/about-genomics-genomics-genomics-england/the-100000-100000-genomes-project/inemes-project/information-for-gor-gmc-gmc-staff/sample/sample/sample-guidance/）。在流体管（Brooks）中接收DNA样品，并登录到英国BioCentre的实验室管理信息系统（LIMS）。自动化库准备后，使用自动定量PCR对库进行量化，群集和测序。使用Illumina Truseq DNA无PCR高吞吐量样品制备试剂盒或Illumina Truseq Nano高吞吐量样品制备Kit46制备库。

　　英格兰东部剑桥南部国家研究伦理委员会提供的道德批准是参考编号13/EE/0325，参与者为这项批准的研究提供了书面知情同意书。100,000个基因组项目的稀有疾病部门的所有同意参与者均通过国家卫生服务局（NHS）的13个中心招募，涵盖了英格兰所有NHS患者。

　　所有样品均基于测序提供商（Illumina）和基因组学英格兰内部QC检查（https://research-help.genomicsengland.co.uk/display/display/gere/gere/sample+sample+qc）的测序质量和覆盖范围。我们仅包括与Homo Sapiens NCBI GRCH38组件对齐的样品（n = 58,335）。将所有样品测序至少产生85 GB的序列数据，测序质量至少为30个。在丢弃重复项后，对齐覆盖了至少95％的基因组，并具有15倍或上面的95％的基因组（映射质量> 10）。此外，所有包含的样品都通过了一组基本QC指标：（1）样品污染（验证bamid freemix47）< 0.03, (2) ratio of single nucleotide variant (SNV) Heterozygous-to-Homozygous (Het-to-Hom) calls < 3, (3) total number of SNVs between 3.2 M–4.7 M, (4) array concordance > 90%, (5) median fragment size >250 bp，（6）过量的嵌合读数< 5%, (7) percentage of mapper reads >60％和（8）辍学的百分比< 10%. 57,961 genomes were passed WGS QCs. We further excluded the samples with the average depth of mitochondrial genomes below 500x after re-aligned the mitochondrial reads (see details below). For the rare disease genomes study, we included 53,574 individuals, 25,436 male and 28,138 females, age from 0 to 99 years (Extended Data Fig. 1a,b). The average depth of WGS was 42x (s.d. = 7.7x) and average depth of mtDNA was 1,990x (s.d = 866x) (Extended Data Fig. 1c).

　　In the family related analysis, WGS family selection quality checks are processed for rare disease genomes, reporting abnormalities of sex chromosomes and reported versus genetic sex summary checks (computed from family relatedness, mendelian inconsistencies, and sex chromosome checks). For the sex determination, the coverage data for the X and Y chromosomes was compared to the average coverage for the sample autosomes using Plink v1.9048 (www.cog-genomics.org/plink/1.9/). The resulting output is compared with the participant sex provided at sample collection. Relatedness checks were based on verification of the mendelian inconsistencies between members of a trio/family. The individual VCF files were merged into a family VCF with BCFTools (v1.3.1)49 and the mendelian inconsistencies again checked with Plink. The relationships are also checked by calculated genomic identity-by-descent values for all pairwise relationships in a family using Plink and comparing with expected values for reported relationship (https://research-help.genomicsengland.co.uk/). We further processed an independent relatedness check using our previously published method50. In brief, a list of 32,665 autosomal SNPs was selected to estimate relatedness. By filtering the merged VCF and the 1000 Genomes reference set51 with the selected SNPs, the pc-relate function from the GENESIS package was applied to obtain the pairwise relatedness52. The first 20 principal components were used to weight the population structure, and the reference set was used to increase genetic diversity accounted for by the principal component analysis. Finally, we included 8,201 families whose relatedness was consistent between two independent prediction methods and the clinical records.

　　We initially studied 26,488 cancer genomes from Genomics England Cancer Project. Samples were prepared using an Illumina TruSeq DNA Nano, TruSeq DNA PCR-Free or FFPE library preparation kit and then sequenced on a HiSeq X generating 150 bp paired-end reads. Germline samples were sequenced to produce at least 85 Gb of sequences with sequencing quality of at least 30. For tumour samples at least 212.5 Gb was required. Alignments for the germline sample covered at least 95% of the genome at 15x or above with well-mapped reads (mapping quality >10）丢弃重复项（https://research-help.genomicsengland.co.uk/）。

　　对于样品交叉污染检查，使用验证bamid47算法处理种系样品，并将通行状态分配给少于污染的3％的样品。用conpair算法处理肿瘤样品，其通过状态表明污染低于1％，如图所述https://research-help.genomicsengland.co.uk/display/gere/10.+further+reading+and+documentation？preview=/38047056/45023724/cancer%2520a纳尔分析％2520Technical％2520 INFORMATION％2520 document％2520V1-11％2520Main.pdf＃ID-10.FurtherReading和documentation-Technicaldocumentation。

　　在上述QC步骤之后，来自12,509个肿瘤样品和11,913个匹配的正常组织（生殖线）样品的12,509个肿瘤 - 正常组织对仍然存在。使用5种不同的方法（FF，FFPF，CD128分类的细胞，EDTA和抽吸物）和三种不同的文库类型（PCR，PCR，PCR-FFPE和PCR无）制备样品。我们通过比较从不同的方法和库类型制备的样品中检测到的NUMT的平均数量来执行其他QC。我们观察到不同组之间的平均数量数量显着差异（补充图8a）。为了避免样品制备和图书馆类型引起的偏见，我们仅包括使用FF和库中的10,713个肿瘤 - 正常样品对制备，从21种癌症类型中的9648个个体中无PCR（扩展数据）（图6A）。肿瘤样本的平均WGS深度为117倍（S.D. 10.1倍），平均生殖线为43倍（S.D. 9.3倍）（补充图8B）。肿瘤样品的平均mtDNA深度为27,119倍（S.D. 13,642X），平均种系深度为3,549X（S.D. 2,452X）（补充图8C）。

　　使用1000个基因组项目第3阶段（1kgp3）51的种族来估算广泛的遗传祖先作为事实，通过为1kgp3样本生成PC并将所有参与者投射到这些情况下。我们包括五个广泛的超级人群：非洲（AFR），Ambixed American（AMR），东亚（EAS），南亚（SAS）和欧洲（Eur）。简短步骤如下：（1）从1kgp3中选择了所有无关样本，（2）我们在数据集中选择了188,382个高质量的SNP，（3）我们进一步过滤了MAF> 0.05> 0.05 in> 0.05 in 1kgp3（以及在我们的数据中），（4）（4）我们计算了前20个主要组成部分，我们计算了前20个主要数据，我们已将1个组成部分（5）分配给54号（5），（5），我们投入了（5），（5），我们投入了（5），（5），（5），（5），我们投入了（5）。（6）我们训练了一个随机的森林模型，以基于（i）的前8个1kgp3主成分预测祖先，（ii）设置nTrees = 400，（iii）列车和预测1KGP3 AMR，AFR，AFR，EAS，EAR，EUR，EUR和SAS超级人群。完整的详细信息可以在https://research-help.genomicsengland.co.uk/display/gere/ancestry+inference中找到。还使用我们先前发表的方法50对遗传血统进行了预测和检查。未被分配给5个超级人群中任何一个的个人被标记为“其他”。在这项研究中，我们预测了1,280 AFR，170 AMR，342 EAS，5,758 SAS，42,202欧元和3,363欧元（图2A）。在癌症种系基因组中，我们包括312 AFR，17 AMR，71 EAS，338 SAS，8,348 EUR和314其他（扩展数据图6C，D）。

　　我们根据稀有疾病基因组中每个人群独有的数字进行了均匀的歧管近似和投影（UMAP）55。使用RUMAP软件包分析了UMAP，并在R中具有默认参数，并使用R中的M3C软件包进行可视化。

　　使用SamTools57从每个WGS BAM文件中提取了与线粒体基因组对齐的测序读数子集。我们在产生的较小的BAM文件上运行了mtoolbox（v1.0）58，以生成重新对准的mtDNA BAM文件。重新对准的BAM文件用于调用变体。我们还使用第二个变体呼叫者VARSCAN259来调用来自重新对准的BAM文件（-strand-filter 1， - min-var-freq 0.001， - min-readS2 1， - min-min-avg-avg-famex-qual 30）的mtDNA变体。VARSCAN2中使用的MPILEUP文件是由Samtools生成的，其中包括-d 0 -Q 30 -Q 30。从Varscan2提取了等位基因分数。我们仅保留单个核苷酸多态性（SNP），每个链上的每个链读数都超过2个。排除了落入低复杂区域内的变体（66-71、300–316、513–525、3106–3107、12418–12425和16182–16194）。

　　使用Haplogrep260,61进行线粒体DNA单倍群分配。

　　为了检测Numts，我们使用了先前发表和验证的方法5,15。从对齐的WGS BAM文件中，我们使用samblaster62提取了不一致的读取对，并包括读对，其中一端与核基因组对齐，另一端与mtDNA参考序列对齐。丢弃了等于零的映射质量的读数。然后，根据共享相同的方向以及它们是否在500 bp的距离之内，将不一致的读数聚集在一起。我们检测到至少两对不一致读数支持的簇，并在我们的主要分析中滤除了不到五对不一致的读取的簇。将核DNA和mtDNA的1,000 bp的距离内的数字分组为相同的麻烦。我们基于至少两对不一致的读数和至少五对不一致的读取的数字生成了两组NUMTS（补充表1）。我们观察到平均数量和WGS深度的弱相关性（R2 = 0.134，P< 2.2 × 10−16) and mitochondrial genome depth (R2 = 0.092, P < 2.2 × 10−16) (Supplementary Figs. 9a,b) indicating that, although some NUMTs may be missed due to low depth, they are unlikely to have an impact on our conclusions. There was no detected difference of the number of detecting reads with the frequency of NUMTs, suggesting the detection of NUMTs were not biased by the sequencing quality (Supplementary Fig. 9c).

　　To identify putative breakpoints spanning nuclear DNA and a mtDNA-derived sequence (nuclear-mtDNA breakpoints), we searched for the split reads within a distance of 1,000 bp of discordant reads which were then re-aligned using BLAT63. We further analysed the re-aligned reads where one end of the read mapped to nuclear DNA and the other end of the same read mapped to mtDNA-derived sequence. We defined the breakpoints by at least three split reads within the same NUMT. Each NUMT should have one nuclear breakpoint and two mitochondrial breakpoints, with the exception of NUMTs occurring with other nuclear genome structure variations. The breakpoints with 200 bp flanking regions on nuclear genome were annotated using gencode v2964, gnomAD for pIL scores65 and a list of datasets were downloaded from UCSC66 and the publications (see details below). When the NUMTs were involved in multiple genes, we kept the genes with the highest pIL score. The breakpoints on the mitochondrial genome were annotated using MitoMap67.

　　To detect putative concatenated NUMTs, first we searched for the breakpoints spanning two locations on the mtDNA-derived sequence (mtDNA–mtDNA breakpoints). We extracted the split reads which only aligned to mtDNA sequence. Those split reads were further re-aligned using BLAT. We analysed the reads where the two ends of the same read mapped to two locations on the mtDNA sequence. We then filtered the breakpoints as follows: (1) each breakpoint had at least 3 split reads observed in at least one individual, (2) each breakpoint had at least 2 split reads observed in the same individual, (3) we excluded the split reads mapped to nearby the start and end of mtDNA genome (the beginning and end of D-loop region), (4) we excluded two concatenated positions less than 50 bp away (they may be mtDNA deletions). Note our method had its limitations—we were not able to separate mtDNA–mtDNA breakpoints within NUMTs from true mtDNA if the breakpoints located around the beginning and end of D-loop region. Thus, our analysis likely missed the concatenated NUMTs where mtDNA–mtDNA breakpoints around the beginning and end of D-loop region. However, our aim was to detect confident concatenated NUMTs and show concatenated NUMTs exist in the humans. After applying the stringent filtering (above), we detected 8,686 breakpoints from 151 different mtDNA–mtDNA breakpoints in 8,450 individuals (Extended Data Fig. 3d). 279 out of 8,686 breakpoints (140 different breakpoints) from 148 individuals were ultra-rare (frequency < 0.1%). One breakpoint (12867–14977) was exceptionally common (frequency 38.4%), which was also commonly seen in an independent dataset in our previous study5. To confirm mtDNA–mtDNA breakpoints from the nuclear genome, we performed two independent analyses: (1) we compared the mtDNA–mtDNA breakpoints observed in the offspring and their two parents. If the mtDNA–mtDNA breakpoints were present in the offspring and their fathers, but not in their mothers, we defined them as father-transmitted mtDNA–mtDNA breakpoints. If the mtDNA–mtDNA breakpoints were present in the offspring and their mothers, but not in their fathers, we defined them as mother-transmitted mtDNA–mtDNA breakpoints. Note we were not able to identify the transmission patterns if the mtDNA–mtDNA breakpoints were present in all three family members using the short-read sequencing technique. (2) For the rare and ultra-rare mtDNA–mtDNA breakpoints (F < 1%), we checked whether the individuals carrying the same mtDNA–mtDNA breakpoints also carried the same NUMT.

　　Known NUMTs were downloaded from UCSC and previous publications16,17,18,19. Bedtools49 was used to search for the known NUMTs in our dataset. Using a conservative approach, we defined the NUMTs as known providing the known NUMTs within 1,000 bp NUMT flanks (upstream 500 bp + downstream 500 bp) detected in this study on the nuclear genome, regardless of the fragments of inserted mtDNA sequences.

　　For the enrichment analysis on both nuclear and mtDNA genomes, we studied 1,637 different confident NUMTs with at least 5 discordant reads using a 2-tailed permutation test. Genomics duplications, simple repeats, dbRIP_HS-ME90, regulatory elements, CpG islands, satellites, retrotransposons (including LINEs and SINEs) and TSS were downloaded from UCSC66 (https://genome.ucsc.edu/). Using this information to compute the frequency of each dataset in 200 bp NUMT flanks (upstream 100 bp + downstream 100 bp). Empirical P values were calculated by resampling 1,000 sets of random positions matched to observed NUMTs. For the enrichment on each nuclear genome chromosome, we excluded the Y chromosome due to the complex duplicated structure of Y chromosome sequences limiting confident alignment.

　　To investigate the relationship between different chromosomes and NUMTs, we applied linear regression in R (http://CRAN.R-project.org/)68.

　　where Nnumt is number of NUMTs detected in each chromosome, Lchr is the length of chromosome, Pcentro, Pcpg, Pline, Pltr, Pretroposon, Psine, Pmicrosat, Prmsk, Prepeats, Pdups and Preg are log2-transformed proportions of centromere, CpG islands, LINES, LTRs, retroposon, SINEs, microsatellites, repeats, simple repeats, genomics duplications and regulatory elements on each chromosome.

　　To study the relationship between NUMT insertion and mitochondrial deletion, we compared the frequency of NUMT breakpoint with the frequency of mitochondrial DNA deletion breakpoint. A list of 1,312 mtDNA deletions were downloaded from mitoBreak database69. We calculated the frequencies of breakpoints in different mtDNA regions—D-loop, 13 coding genes, 2 RNAs and combined 22 tRNAs, and compared the distribution with the distribution of breakpoints for germline and tumour-specific NUMTs using linear regression.

　　We used the most conservative methods to define the de novo NUMTs from father–mother–offspring trios. We only included NUMTs with at least five pairs of discordant reads in the offspring and none of discordant read detected in the parents.

　　We applied for the same approach to define tumour-specific NUMTs in cancer genomes. Tumour-specific NUMTs were defined by at least five pairs of discordant reads in the tumour samples and none of discordant reads in the matched normal samples. Lost NUMTs in cancer genomes were defined by at least five pairs of discordant reads in the normal samples and no more than one pair of discordant reads in the matched tumour samples.

　　De novo NUMT insertion rate in trios and cancer genomes was estimated as follows:

　　where ρ(germline) is the rate of de novo NUMT insertion in trios, ρ(tumour) is the rate of tumour-specific NUMT insertion in tumour samples, NumtTtrio is the number of de novo NUMT event in trios, NumtTumour is the number of tumour-specific NUMTs, Ntrio is the number of total trios and Ngenome is the number of total normal–tumour pairs.

　　To understand the relationship between donor age, sex and the average number of NUMTs, we applied linear regression to each dataset using R (http://CRAN.R-project.org/).

　　 Model 1 < − lm(N Age + Sex + DPmt)

　　 Model 2 < − lm(Nsoma Age + Sex + DPmt)

　　Where N and Nsoma are average numbers of NUMTs and tumour-specific NUMTs, Age is donor age, Sex is donor sex and DPmt is average mitochondrial DNA sequencing depth.

　　Read alignment against human reference genome GRCh38-Decoy+EBV was performed with ISAAC (version iSAAC-03.16.02.19)70, SNVs and short insertions–deletions (indels) variant calling together with tumour − normal subtraction was performed using Strelka (version 2.4.7)71. Strelka filters out the following germline variant calls: (1) all calls with a sample depth three times higher than the chromosomal mean, (2) site genotype conflicts with proximal indel call, (3) locus read evidence displays unbalanced phasing patterns, (4) genotype call from variant caller not consistent with chromosome ploidy, (5) the fraction of basecalls filtered out at a site >0.4，（6）基因座质量得分< 14 for heterozygous or homozygous SNP, (7) locus quality score < 6 for heterozygous, homozygous or het-alt indels, (8) locus quality score < 30 for other small variant types or quality score is not calculated. Strelka filters out the following somatic variant calls: (1) all calls with a normal sample depth three times higher than the chromosomal mean, (2) all calls where the site in the normal sample is not a homozygous reference, (3) somatic SNV calls with empirically fitted VQSR score < 2.75 (recalibrated quality score expressing the phred scaled probability of the somatic call being a false positive observation), (4) somatic indels where fraction of basecalls filtered out in a window extending 50 bases to either side of the indel’s call position is >0.3，（5）具有质量得分的躯体插入< 30 (joint probability of the somatic variant and a homo ref normal genotype), (6) all calls that overlap LINE repeat region.

　　Structural variants (SVs) and long indel (>50 bp）使用Manta（版本0.28.0）72进行呼叫，该命令结合了配对和分读证据以发现和分数。使用Canvas（版本1.3.1）73调用副本号变体（CNV），该副本使用覆盖范围和次要等位基因频率来分配复制号。这些工具过滤了以下变体调用：（1）蒙塔名式SVS的样本深度接近一个或两个变体断裂端的SV，是染色体平均值的三倍，（2）具有躯体质量得分的麦田群岛的SV< 30, (3) Manta-called somatic deletions and duplications with length >10KB，（4）卫生员的躯体小型变体（<1kb) where fraction of reads with MAPQ0 around either break-end >0.4，（5）帆布称为长度< 10kb, (6) Canvas-called somatic CNVs with quality score < 10. The full details of bioinformatics pipeline can be found at https://research-help.genomicsengland.co.uk/pages/viewpage.action?pageId=38046624.

　　PRDM9 determines the locations of meiotic recombination hotspots where meiotic DNA DSBs are formed. To investigate the mechanism of NUMT insertions, we compared the NUMTs with a set of 170,198 published PRDM9-binding peaks cross the genome74. We counted the number of NUMTs overlapping PRDM9-binding peaks and performed the permutation analysis (see the details in ‘Enrichment analysis’). Next, we calculated the distance between the breakpoint of each NUMT (from both the germline and tumour-specific NUMTs) with the nearest PRDM9-binding site.

　　A list of known human DNA repair genes was downloaded from Human DNA Repair Genes website (https://www.mdanderson.org/documents/Labs/Wood-Laboratory/human-dna-repair-genes.html)38,39. We extracted the somatic missense mutations in DNA repair genes from all cancer samples, and compared the relationship between samples carrying the mutations and tumour-specific NUMTs.

　　Somatic mutation signatures are the consequence of multiple mutational processes that the human body is subjected to throughout life. Each different process generates a unique combination of mutation types that are called mutation signatures (https://cancer.sanger.ac.uk/signatures/signatures_v2/). Mutational signature was computed using the R package nnls (https://CRAN.R-project.org/package=nnls). The details of how the signatures were computed is described in Alexandrov et al., 201375 and online document https://research-help.genomicsengland.co.uk/pages/viewpage.action?pageId=38046624.

　　The Genomics England PanelApp (https://panelapp.genomicsengland.co.uk/)76 list of genes and genomic entities were used to provide a list of potential disease genes (N = 5,883). NUMTs were identified that had a frequency of < 1%, and their breakpoints within 200 bp flanking regions of one of these genes. Consequence annotation was done with gencode v29, including gene, intron, exon, CDS, start codon, stop codon, five prime UTR and three prime UTR regions64. NUMTs which were annotated as falling in an exon were analysed in detail. For each gene, we considered the strength of evidence that the gene is associated with a disease, the inheritance pattern of the disorder, the reported types of pathogenic variants and reported mechanism of disease (for example, haploinsufficiency, gain of function or repeat expansion), using information from OMIM (https://omim.org/)77 and by searching PubMed (https://pubmed.ncbi.nlm.nih.gov/). For the established disease genes, we considered available clinical information for each proband which included their Human Phenotype ontology terms91, family history and age at enrolment. We assumed that the rare NUMT was present on one allele only, unless it was present in both parents or there was documented consanguinity (where parental data was not available). For recessive disorder genes containing a NUMT, we looked whether it was present in one or both parents (if available), whether there was a family history of consanguinity, and at the sequence data to see whether there was a second rare variant. The location of the NUMT insertion was explored in UCSC genome browser66.

　　Participants with a confirmed genetic diagnosis were identified from the Genomic Medicine Centre exit questionnaire (https://research-help.genomicsengland.co.uk/pages/viewpage.action?pageId=38046767). Genomic coordinates of the causative variant were compared with the genomic coordinates of the NUMTs using bedtools49.

　　Participants with mitochondrial DNA maintenance disorders78 were identified from the Genomic Medicine Centre exit questionnaire and from our previous analysis of participants with suspected mitochondrial disorders79. We also identified affected family members who had genome sequencing data available. 122 NUMTs were detected from 20 individuals. only 4 NUMTs (2 different NUMTs) from two families in exons. We compared the average number of NUMTs in these participants to the rest of the rare disease participants.

　　To determine whether a NUMT insertion was a driver mutation in the development of cancers, NUMTs with 200 base pairs flanking region were identified which were located genes of interest. Our genes of interest were defined as those on the COSMIC (Catalogue of Somatic Mutations in Cancer) Cancer Gene Census list (tier 1 and tier 2) which includes genes known to contain mutations causally implicated in cancer28. We also used a list of known human DNA repair genes38,39. The location of the NUMT insertion in relation to these gene lists was explored in the UCSC genome browser.

　　To validate NUMT detection in short-read sequencing, we carried out whole-genome sequencing on Oxford Nanopore PromethION in 39 individuals from rare disease genomes. To maximize sequencing yield, 4 μg of germline DNA from 100KGP participants was fragmented to 15–30 Kb with Covaris g-tubes (4,000 rpm, 1 min, 1–3 passes until the desired length was achieved) and then depleted of low molecular weight DNA (<10 Kb) with the Short Read Eliminator kit (Circulomics, SS-100-101-01) as described by the manufacturer. After checking DNA size distribution on an Agilent Femto Pulse system, a sequencing library was generated with the Oxford Nanopore SQK-LSK109 kit, starting from 1.2 µg of high molecular weight-enriched DNA. Samples were quantified with a Qubit fluorometer (Invitrogen, Q33226) and 500 ng loaded onto a PromethION R.9.4.1 flow cell following manufacturer’s instructions. In experiments where throughput was limited by a rapid increase in unavailable pores, the library was re-loaded following a nuclease flush ~20hrs after the initial run. base-calling was performed with Guppy-3.2.6/3.2.8 in high accuracy mode. Full details of the protocol can be found at https://research-help.genomicsengland.co.uk/display/GERE/Genomic+Data+from+ONT?preview=/38046759/38047942/v1_protocol_ONT_LSK109.pdf. Sequencing reads were aligned to GRCh38 using minimap280 version 2.17. QC statistics and plots were generated using Nanoplot81 version 1.26.0. The full details of bioinformatics pipeline can be found at https://research-help.genomicsengland.co.uk/display/GERE/Genomic+Data+from+ONT?preview=/38046759/38047944/PromethION%20SV%20calling%20pipeline%20GRCh38.docx. We then extracted the long reads aligned to the same region where a NUMT detected using short-read sequencing from the same individual. The extracted long reads were re-aligned using BLAT. The observed NUMTs were also manually inspected on Integrated Genomics Viewer (IGV)82. 182 out of 184 NUMTs (29 out of 31 distinct NUMTs) detected using short-read sequencing were also seen in long-read sequencing data. Two NUMTs from the same individual were missing in long-read sequencing likely due to the low number of aligned reads in long-read sequencing.

　　Whole-genome-wide methylation detection was carried out using call-methylation function from Nanopolish v0.13.383 in 39 individuals. The methylation detection output includes the position of the CG dinucleotide on the reference genome, the ID of the read that was used to make the call, and the log-likelihood ratio. We extracted the long reads mapped to mtDNA genome, and further grouped them into two groups: (1) long reads also mapped to nuclear genome, (2) long reads only mapped to mtDNA genome. Next, we calculated methylation frequency of each site using the calculate_methylation_frequency.py script from the package in each read group. The methylation calls detected by the 1st group were from NUMTs, and the calls detected by the 2nd group were from true mtDNA. We used the methylation profile of true mtDNA as reference, and NUMTs methylation was estimated as the log2 ratio of methylation frequency of each site between NUMTs and true mtDNA from the same individual. Note, if the individuals carried concatenated NUMTs, the calls detected by 2nd group were from mixed true mtDNA and concatenated NUMTs. We were not able to separate the long reads mapped to the middle of concatenated NUMTs where the reads also only mapped to mtDNA genome and true mtDNA genome.

　　In this analysis, we focused on the concatenated NUMTs and the large NUMTs where long reads were confidently aligned to NUMTs. We only included the calls with at least 3 reads mapped to NUMTs and at least 10 reads mapped to true mtDNA sequences. We also used 4 reads, 5 reads, 6 reads, 7 reads, 8 reads 9 reads and 10 reads as the cut-offs to detect NUMTs methylation. We observed the same distribution of methylation frequency across different cut-offs (Fig. 3a), indicating read-thresholds did not affect our results.

　　We performed a de novo assembly of all 335,891 NUMTs detected in this study. The steps of processes were: (1) we clustered the discordant reads detected from each NUMT in the same individual. (2) The consensus sequence of NUMT contig was generated using CAP384. (3) The contigs were then aligned against mitochondrial reference genome85 using Blat63 and Clustal Omega86. (4) The aligned sequences from Clustal Omega were used to detect the nucleotide changes between NUMT sequences and mitochondrial reference genome sequences using BioPython87. To ensure the confident calls, we applied the additional filtering as follows: (1) we only included NUMTs shorter than 1,000 bp; (2) we excluded the variants within 5 bp of NUMT breakpoints; (3) we removed the variants where the aligned reference allele were different from mtDNA reference genome at the same position; (4) we only included single nuclear variations; (5) we excluded the individuals carrying many more variants than the overall population (> 平均变体 + 3×s.d.）。

　　为了定义NUMT特异性变体，我们应用了其他过滤：（1）我们排除了携带相同常见或稀有麻子的50％以上的个体和75％的个体，携带相同的超稀有麻烦。这种严格的过滤策略旨在提供最大的信心，即在将NUMT序列插入核基因组之后，任何NUMT特异性变体都很可能发生，从而损害了分析的灵敏度。（2）我们排除了仅在1个个体中检测到的变体，以最大程度地减少测序误差的可能性；（3）为了获得最自信的Numt特异性突变，我们仅包括在同一家族的至少两个人中检测到的变体。在主要文本中，我们报道了3组Numt特异性变体。总计A组，在应用步骤（1）之后；子组B，一个步骤（2）之后；和子组C，一个步骤（3）。

　　使用先前描述的方法估算了NUMT的年龄19。我们使用Clustal Omega从人，黑猩猩和每个Numt Contig的共有序列中对齐线粒体序列。来自黑猩猩的祖先线粒体序列是从Ensembl（Pan_tro_3.0）下载的。对齐的序列用于使用Biopython产生核苷酸变化。我们计算了将人类等位基因与人类和祖先线粒体序列在每个NOMT区域不同的地点总数与位点的总数相匹配的比率。相对于估计的人类 - 奇曼Zee差异时间为600万年，该比率用于得出每个数字的大约年龄。为了确保有信心的结果，我们应用了以下滤波：（1）我们仅包括长度在50至1,000 bp之间的数字；（2）我们排除了人和黑猩猩之间没有不同等位基因的数字；（3）从超过50％的携带相同麻烦的个体和至少2个人中估计了年龄。应用此过滤后，我们排除了仅在一个人中看到的所有私人发药。（4）我们排除了串联的数字。

　　本研究中的所有统计分析均在文本中提出，并使用R68（http://cran.r-project.org/）和Python（http://wwww.python.org）进行。使用Python中的R和Matplotlib（https://matplotlib.org）生成图。使用Circos（http://circos.ca/)88制造了Circos地块。使用ChromoMAP89制作染色体图。

　　使用Shiny V1.7.1（https://cran.r-project.org/package=shiny）开发了在本研究中检测到的NOMTS的Web界面。

　　通过https://wwei.shinyapps.io/numts/在本研究中检测到的NOMTS。

　　有关研究设计的更多信息可在与本文有关的自然研究报告摘要中获得。

左文资讯声明：未经许可，不得转载。