2025-06-23 17:29来源:本站
小心选择了该项目中包含的分离株,以代表整个物种的酿酒酵母。补充表1列出了所有孤立细节,包括生态和地理起源,提供者和参考。我们通过包括葡萄酒和清酒发酵,酿造和乳制品,以及诸如土壤,昆虫,昆虫,昆虫,昆虫,树木和水果等自然环境,最大程度地提高了孤立的生态起源。地理起源也高度多样化,并且具有全球分布(补充表1)。除了研究实验室和酵母收集提供的918个分离株外,我们还包括了以前的研究中测序的93个菌株6,7,8,在这项研究中总共分析了1,011个样本。在测序之前,我们试图将分离株保持在自然状态,以提供全球杂质性和杂交水平的全球图景。但是,在918个选定的分离株中,有124个单倍体的单倍体被删除,并在测序之前对93个外部分离株进行了遗传操纵并制成纯合子。
酵母细胞培养物在30°C的15 mL YPD培养基到早期固定期生长过夜。随后,使用Masterpure酵母DNA纯化试剂盒和基因组Illumina HISEQ 2000测序文库提取总基因组DNA,用于918菌株,插入尺寸在300至600 bp之间。每个Illumina HISEQ2000车道都有十个库是多路复用的,并进行了配对的测序,产生了102个基部的读数。
内部质量控制过程应用于通过Illumina质量过滤器的读取。从读取和低质量核苷酸中除去光明的测序适配器和引物序列(Q< 20) were discarded from both ends of the reads. Reads shorter than 30 nucleotides after trimming were removed. These trimming and removal steps were achieved using software designed in-house, based on the FastX package. The last step identifies and discards read pairs that mapped to the phage phiX genome (GenBank code: NC_001422.1) using SOAP51. A total of 3.35 Tb of high-quality genomic sequence was generated with a mean sequencing depth of 232 × per isolate (ranging from 50 × to 1,014 ×, Supplementary Fig. 1c). For the publically available Illumina paired-end reads related to 93 strains (see ‘S. cerevisiae sequenced isolates’), the mean sequencing depth is 169 × (from 20 × to 570 × , Supplementary Fig. 1c).
For each isolate, the reads were mapped to the S. cerevisiae S288C reference genome (version R64-1-1) with Burrows–Wheeler Aligner (BWA v.0.7.4-r385)52, using default parameters. Duplicated reads were marked with Picard-tools (v.1.124) (http://picard.sourceforge.net) and local realignment around indels and variant calling were performed with GATK (v.3.3-0)53. Default parameters were applied except for the realignment step (GATK IndelRealigner), in which the following parameters were set: ‘–maxReadsForConsensuses 500–maxReadsForRealignment 40000–maxConsensuses 60–maxPositionalMoveAllowed 400–entropyThreshold 0.2’. GATK Variant Annotator was run to add allele balance information in the.vcf files.
The natural ploidy of the 794 natural isolates (see ‘S. cerevisiae sequenced isolates’), as well as their aneuploidy and segmental duplication content were investigated by combining three complementary approaches:
First, measurement of the cell DNA content using high-throughput flow cytometry: DNA content was analysed using a propidium iodide (PI) staining assay. Cells were first pulled out from glycerol stocks in liquid YPD in 96-well plates (30 °C, overnight). Five microlitres of the culture were transferred into 195 μl of fresh YPD and incubated for 8 h at 30 °C. Then, 3 μl were taken and resuspended in 100 μl of cold 70% ethanol. Cells were fixed overnight at 4 °C, washed twice with PBS, resuspended in 100 μl of staining solution (15 μM PI, 100 μg/ml RNase A, 0.1% v/v Triton-X, in PBS) and finally incubated for 3 h at 37 °C in the dark. Ten thousand cells for each sample were analysed on a FACS-Calibur flow cytometer using the HTS module for processing 96-well plates. Cells were excited at 488 nM and fluorescence was collected with a FL2-A filter. The distributions of both FL2-A and FSC-H values have been processed to find the two main density peaks, which correspond to the two cell populations in G1 and G2. The peaks were detected using the densityClust R package after removing the cells reaching the FACS saturation (either FLS-A or FSC-H values equal to 1,000). We categorized the values of FLS-A, which correlate with the DNA quantity, to estimate the ploidy according to the following scheme: strains with G1 cell values between 39 and 181 and G2 values between 148 and 255 were labelled as haploid; strains with G1 cell values between 145 and 265 and G2 values between 295 and 500 were labelled as diploid; strains with G1 cell values between 245 and 355 and G2 values between 500 and 700 were labelled as triploid; strains with G1 cell values between 295 and 500 and G2 values between 700 and 905 were labelled as tetraploid; strains with G1 cell values between 395 and 605 and G2 values over 905 were labelled as over 4n; strains with other combinations of values have been manually evaluated.
Second, study of sequencing coverage: systematic analysis of the coverage depth along the genome was performed with 1-kb non-overlapping sliding windows, which enabled the survey of chromosomal copy number variations as well as segmental duplications. The ratio between the coverage of the aneuploid chromosomes and the rest of the genome was also used to validate the ploidy of isolates.
Third, investigation of the allele balance ratio associated with heterozygous SNPs, as heterozygous sites should fit an expected range of ratios according to the copy number of the chromosome being considered (see ‘SNPs filtering and matrix’).
The precise locations of segmental duplications were manually investigated in the .vcf files (Supplementary Table 16).
For each sample, variants were first called with GATK HaplotypeCaller (see ‘Reads mapping and variant calling’). At this stage, isolates with less than 5% of heterozygous sites (average percentage of heterozygous sites detected in a sample of 104 haploid and/or homozygous diploid isolates) were considered as homozygous. The raw files were then post-processed to deal with highly confident variants to be included in our complete SNPs matrix, based on both coverage and allele balance information:
First, a minimal coverage depth of 50 × was required for a SNP to be retained for the 918 isolates that were sequenced in this study; this coverage depth was lowered to 10 × for the other 93 previously sequenced isolates.
Second, for the haploid and homozygous isolates (< 5% of heterozygous sites), the fraction of heterozygous SNPs detected was considered as representing false positives and was therefore filtered out.
Third, for heterozygous isolates, heterozygous sites were filtered according to their allele balance ratio (ABHet). The thresholds for allele balance ratios were determined according to the allelic frequency distribution all over the heterozygous samples at each level of ploidy (from 2n to 5n). A heterozygous site was rejected when its ratio did not fit the expected range according to the copy number of the chromosome being considered (or region, in the case of segmental duplication).
The joint calling method of GATK was run with the cleaned .vcf files to create a complete genotyping matrix (.gvcf format, see ‘Data availability’). This matrix of SNPs included 1,625,809 segregating sites across our 1,011 isolates (Supplementary Table 1).
SnpEff (v.4.1)54 was used to annotate and predict the effect of the variants. Non-synonymous SNPs predicted as deleterious by SIFT (v.5.2.2)17 as well as nonsense mutations were considered as deleterious for protein function. Insertions and deletions were considered to cause frame shifts when their sum produced a number not divisible by three in a single gene.
A sequence representative of each isolate was constructed by inserting these filtered SNPs into the reference sequence with GATK FastaAlternateReferenceMaker (see ‘Data availability’).
We used Abyss software (v.1.5.2)55 with the option ‘-k 64’ to produce the de novo assemblies (see ‘Data availability’). The pre-assembly filtering step was performed with condetri v.2.2 to remove the 6 bases closest to the 5′ end and to discard low-quality 3′-end bases of reads. The resulting assemblies had a median N50 of 136 kb and a median number of contigs of 3,259. The median length of the genome is 12.1 Mb and the median GC content is 38.06 (Supplementary Table 17).
We set up a custom pipeline to identify non-reference genome material. Each genome was aligned to the reference sequence (S288C, version R64-1-1) using blastn with the following settings: ‘-gapopen 5 -gapextend 5 -penalty -5 -reward 1 -evalue 10 -word_size 11 -no_greedy’.
The CDH and CFH strains were excluded from the identification of non-reference genome material owing to the presence of Staphylococcus epidermis contamination. The sequences aligned with an identity greater than 95% were divided in three categories to be further processed. If the aligned sequence belonged to contigs shorter than 100 bp or if the aligned sequence was up to 200 bp and belonged to a contig with a length that was shorter than the length of the alignment plus 75 bp, the contig was discarded. If the aligned sequence was in the range 200–1,500 bp only the aligned sequence was discarded. If the aligned sequence was longer than 1,500 bp, it was divided into segments of 250 bp. Each sub-sequence was aligned again to the reference and discarded if found with an alignment identity of over 95% on an alignment length of at least 187 bp (75% of the subsequence length). After this step the relative position of the retained sequence has been evaluated. If two or more of them belonged to the same contig and were separated each other by less than 100 bp, the sequence from the starting of the first one to the end of the final one was kept as a whole. In the subsequent step, all the kept sequences from the 1,011 genomes were sorted for length in decreasing order. The set of sequence was then aligned against itself (with the same criteria as the first step) to eliminate repeated elements. When two sub-sequences were found to have an alignment identity of over 95%, the one belonging to the shorter sequence was eliminated. The process led to 12,325 sequences for 9.3 total Mb.
To annotate ORFs in dispensable regions, we set up an integrative yeast gene annotation pipeline by combining different existing annotation approaches, which gave rise to an evidence-leveraged final annotation24. We independently ran the three individual components (RATT56, YGAP57 and MAKER58) for gene annotation, and subsequently integrated their results using EVM59. Proteomes of the Saccharomyces species (S. cerevisiae, S. paradoxus, S. mikatae, S. kudriavzevii, S. arboricolus, S. uvarum and S. eubayanus) were retrieved and used in our annotation pipeline to provide protein alignment support for annotated gene models60,61.
We compiled the pangenome by adding the 2,245 non-reference ORFs annotated here to the 6,713 genomic reference ORFs listed in the set ‘orf_genomic_all’ from the Saccharomyces Genome Database (SGD) database (updated 13 January 2015, https://downloads.yeastgenome.org/sequence/S288C_reference/orf_dna/). Three RDN genes were also added (RDN18-1, RDN25-1 and RDN37-1) from the set ‘rna_genomic’ available from the SGD database (https://downloads.yeastgenome.org/sequence/S288C_reference/rna/). We applied a graph-based pipeline to this set of ORFs, to remove duplicate and closely related sequences. This step also removed overlapping ORFs present in the ‘orf_genomic_all’ SGD dataset. A disconnected graph was created in which each node is an ORF and each edge represents an alignment identity of over 95% on at least 75% of the sequence of the smaller ORF in the couple. Each connected subgraph represents a single ORF family. For each of these families a representative has been chosen. The connectivity has been computed for each node. The first choice for the representative was the most central, non-dubious reference ORF, if any of them were present. The second choice was the most central reference ORF; if only non-reference ORFs were present in the family, the most central of these was taken.
This led to a catalogue of 7,796 non-redundant ORFs (see ‘Data availability’), which represent the S. cerevisiae pangenome (Supplementary Table 3). Among the reference ORFs, the number of similar ORFs collapsed for each final ORFs has a wide range (up to 67, which is the cluster of Ty1 elements), although usually this number does not exceed 2. Other large clusters are the Y′ elements one (59 ORFs) and the Ty2 elements (27 ORFs). Out of the 6,713 reference ORFs, 5,679 were not redundant and 1,032 were collapsed into 402 unique ORFs; 89% of these unique ORFs (n = 357) are duplicated ORFs.
To assess the copy number of each ORF of the pangenome, we mapped the reads from each strain to the pangenomic ORFs with BWA,using default parameters and the option –U 0. The result was then filtered using samtools view with the options –bSq 20 –F 260. The median coverage for each ORF was taken as coverage for the ORF in the specific isolate. The ratio between the values of individual ORFs and the values of genome coverage on the reference of the isolate (as the median of the median coverage for each nuclear chromosome) was considered as the copy number for the haploid genome (see ‘Data availability’).
The mapping was also used as a confirmation step for the presence of the ORFs in each strain, leading to the identification of 4,940 ORFs present in the 1,011 strains of the collection, representing the core genome plus 2,856 ORFs present in different subsets of the population. Fifty-two ORFs were removed because they were present in single strains with low coverage (~10% of the genome wide coverage) and were likely to be contaminations from Escherichia coli and Clavispora lusitaniae. Eighty-nine other ORFs that did not have sufficient coverage were kept in the pangenome, but were not used for subsequent analyses owing to poor mapping. Three of the core ORFs are present but not annotated in the S288C reference and were annotated by our annotation pipeline as 584-snap_masked-1700-AIE_1, 610-snap_masked-2999-BGP_1 and 611-snap_masked-3001-BGP_1.
To evaluate the difference between domesticated clades and wild clades, we normalized the data by calculating the clade copy-number median for each ORF to avoid sample bias. The distributions of medians in the domesticated and wild clades were then compared using the Mann–Whitney–Wilcoxon test (R function wilcox.test) (Supplementary Fig. 28 and Supplementary Tables 18, 19).
We constructed a local ORFs database for 57 representative species that deeply probed both closely related Saccharomyces species as well as a highly divergent yeast species62 (Supplementary Table 20). In addition, we added the ORFs of 12 representative S. cerevisiae and S. paradoxus strains with complete genome sequenced by long reads (PacBio)24. For each annotated variable ORF, we first performed a BLASTN search (‘-evalue 1E-6’) against this local ORFs database to find its best hit. ORFs without hits in our local yeast ORFs database underwent a further round of BLAST searching (-evalue 1E-6) against the NCBI non-redundant database (ftp://ftp.ncbi.nlm.nih.gov/blast/db/). based on the sequence identity and query coverage of the top hits, we classified the variable genes into different categories.
For all isolates, sequences of the protein-coding genes were inferred from the filtered SNPs and inserted into the reference sequence with GATK FastaAlternateReferenceMaker. For each gene, the coding sequences were aligned and the ratio of nonsynonymous to synonymous polymorphisms (dN/dS) was computed with the yn00 program in PAML software63. Median values were used for comparison.
The 1,544,489 biallelic segregating sites were used to construct a neighbour-joining tree (Fig. 1), using the R packages ape and SNPrelate. The .gvcf matrix was first converted into a .gds file and individual dissimilarities were estimated for each pair of individuals with the snpgdsDiss function. The bionj algorithm was then run on the distance matrix that was obtained.
The genomic content distance (see ‘Data availability’) has been calculated as the number of ORF differences in the pangenome presence/absence profile (that is, the number of ORFs present in only one strain for each pairwise strain comparison) (see ‘Data availability’).
As an estimate of the scaled mutation rate, we computed π, the average pairwise nucleotide diversity θw, the proportion of segregating sites and Tajima’s D value, which represents the difference between π and θw.
Variscan 2.064 was run (runmode = 12, 10-kb non-overlapping windows) on multiple alignments of the concatenated chromosomes that were representative of the isolates.
Model-based ancestry estimation was performed on the biallelic SNPs using ADMIXTURE v.1.2321 in unsupervised mode.
Principal components analysis on the biallelic SNPs was performed using EIGENSOFT v.6.0.1. The ‘-w’ argument was used to calculate the principal components using only a subset of the samples, with the remaining samples then being projected onto the resulting components.
The matrix of presence/absence of ORFs in the population has been analysed using the discriminant analysis of principal components (DAPC) algorithm implemented in the R package adegenet 2.0.1. DAPC describes clusters by maximizing the between-cluster variance while minimizing the within-cluster variance. The number of components retained for the principal component analysis calculation was 150, accounting for > 88% of total variance. For the subsequent DAPC calculation, the alpha-score indicates 25 as the optimal number of discriminant principal components to be retained. Clustering was performed using the K-means with different number of groups (n = 5, 10, 15, 20, 25, 30, 35, 40, 45 and 50).
The Plink package65 was used to compute r2, the correlation coefficient between pairs of loci that stands as a measure of association for linkage disequilibrium. All pairs of polymorphic sites were investigated through .map and .ped files generated with vcftools66, excluding SNPs with a MAF lower than 5%.
We averaged r2 based on the SNP distance (100-bp intervals) over 25-kb regions and calculated the half-length of r2, which is the distance at which linkage disequilibrium decays to half of its maximum value.
Heterozygous isolates were investigated for LOH regions with an R script generated in-house (see ‘Data availability’). Regions over 50 kb with less than 10 heterozygous sites per 50 kb were considered to be under LOH (Supplementary Table 7).
To construct the tree, we used 22 S. cerevisiae isolates representative of the species genetic diversity that were sequenced with Oxford Nanopore technology67. We annotated these 22 assemblies with the pipeline described above. The annotated protein-coding genes were pooled together with the S. cerevisiae reference genome (SGD R64-1-1) and another 18 yeast strains for orthology identification. These 18 other yeast strains included 7 S. cerevisiae strains, 5 S. paradoxus strains and 6 out-group strains from other Saccharomyces yeast species as previously described24. The orthology identification was carried out using Proteinortho (v.5.15)68 and synteny information was considered (the PoFF feature of Proteinortho). This leads to the delineation of 2,018 1-to-1 orthologous groups across all the 41 sampled genomes. For each orthologous group, the protein sequences across the 41 strains were aligned with MUSCLE (v.3.8.1551)69, and the resulting protein alignment was further used to guide the corresponding CDS alignment using PAL2NAL (v.14)70. A concatenated multi-gene matrix was built for the CDS alignment of these 2,018 orthologous groups, which was further partitioned based on codon positions (for example, 1st, 2nd and 3rd codon positions). We used RAxML (v.8.2.6) to build the maximum likelihood tree based on the GTRGAMMA model with 100 rapid bootstraps. As an alternative, we also performed phylogenetic analysis using the consensus tree approach, in which we built individual gene trees for each of the 2,018 orthologous groups using the same method described for the concatenated tree. These individual gene trees were further summarized by ASTRAL (v.4.7.12)71 to produce the ‘species tree’. Both the concatenated tree and the consensus tree were visualized in FigTree (v.1.4.2) (http://tree.bio.ed.ac.uk/software/figtree/).
Quantitative high-throughput phenotyping was performed using end-point colony growth on solid medium. Strains were pregrown in flat-bottom 96-well microplates containing liquid YPD medium. The replicating ROTOR HDA benchtop robot (Singer Instruments) was used to mix and pin strains onto a solid YPD matrix plate at a density of 384 wells. The matrix plates were incubated overnight at 30 °C to allow sufficient growth and replicated on 36 medium conditions, including YPD 30 °C as the pinning and growth control condition (Supplementary Table 2). Each isolate was present in quadruplicates on the corresponding matrix (interplate replicates) and at two different positions (intraplate replicates). The plates were incubated at 30 °C for 40 h and were scanned at a resolution of 600 dpi and 16-bit grayscale. Quantification of the colony size from plate images was performed using the software package Gitter72. Each value was normalized using the growth ratio between stress media and standard YPD medium at 30 °C (see ‘Data availability’). Pairwise Pearson’s correlations of fitness trait values between replicates were calculated for each condition.
Mixed-model association analysis was performed using FaST-LMM v.2.0719. We used the normalized phenotypes by replacing the observed value with the corresponding quantile from a standard normal distribution, as FaST-LMM expects normally distributed phenotypes. In this step, we used the markers showing a MAF >5%。我们还将缺少基因型的丢失过滤为“ FS”:已设置了任意阈值,以排除少于1,000个个体中的所有变体的总矩阵。
用于关联的命令如下:“ fastlmmc -bfile $ snp -bfilesim $ snp -pheno $ pheno $ pheno -out $ coose_file - verboseOutput”。
混合模型为标准线性回归添加了多基因术语,旨在规避相关性和种群分层的影响。为了量化大容量通货膨胀的程度和多余的假阳性速率,我们计算了每种条件的基因组膨胀因子λ(补充图34)。该因子定义为在预期中位数上观察到的测试统计量分布的中位数之间的比率。例如,用于关联的标准等位基因测试的λ基于1度自由的χ2分布的中值(0.456)。在无关联和未链接变体的无效模型下,期望为λ为1。λ大于1表示相关的p值膨胀的p值,这可能是由于尚未考虑的混杂因素。
我们通过置于100次个体之间的表型值来估计每种条件的特定性状P值阈值。显着性阈值是5%分位数(排列的最低p值第五)。使用此方法,通过该阈值的变体具有5%的家庭错误率(补充图33)。
通过将NULL模型的遗传方差除以使用Fast-LMM计算的NULL模型(遗传方差和残余方差)的总方差(补充图6)来完成基因组遗传性的估计。此处报告的值基于分位数归一化的表型。为了计算通过我们的显着关联标记所解释的方差,我们将它们与“ -bfilesim”选项一起包含在协方差矩阵中,并再次进行相同的计算(图6)。
有关实验设计的更多信息可在与本文有关的自然研究报告摘要中获得。
1002酵母基因组网站(http://1002genomes.u-strasbg.fr/files/)提供了对“ -scripts.tar.gz”的访问,其中包含用于(1)从大型集会中提取所有非报酬材料的PERL和R脚本;(2)崩溃的ORF比特定阈值更相似;(3)检测LOH的区域。
除11个无法分发的分离株外,所有菌株均可根据要求提供(请参阅补充表1)。
Illumina读取可在序列读取档案中获得,登录编号ERP014555。
1002酵母基因组网站(http://1002genomes.u-strasbg.fr/files/)提供:
-1011matrix.gvcf.gz:所有SNP和Indels都以人口级别(.GVCF格式)呼叫。
-1011gwasmatrix.tar.gz:用于GWAS的矩阵,其中包含所有以1,000个或更多分离株为1,000个或更多MAF> 5%的双重位置以及CNV(编码为0/1/2,0/1/2,缺席/0.5-1副本/0.5-1副本/多个副本/多个副本)(bed,.bim和.bim和.fam和.fam和.fam)。
-1011distancematrixbasedonsnps.tab.gz:对于每对菌株,该值是基于SNP的百分比,是非相同碱基的。与纯合差异相比,杂合差异半加权。
-1011distanceMatrixbasedOnorfs.tab.gz:对于每个菌株,值是两个分离株中仅一个中存在的ORF的数量。
-1011 Assemblies.tar.gz:1,011个分离株(.fasta格式)的从头组装。
-allefrencegeneswithsnpsandindelsinferred.tar.gz:参考基因组中发现的基因序列,其中已自动推断出每个分离株自动推断出SNP和Indels。
-Allorfs_pangenome.fasta.gz:7,796个pangenomic orfs(.fasta格式)的序列。
-genesmatrix_preseceabersence.tab.gz:每个分离株的存在和/或不存在的模式,其中ORF的存在标记为1,其不存在为0。
-genesmatrix_copynumber.tab.gz:每个pangenomic ORF的估计拷贝数,每个分离株。给出了单倍体基因组的值,因此可以找到非全能值(同源染色体上的不同拷贝数)。
-genesmatrix_frameshift.tab.gz:对于每个分离株,根据Indels影响的碱数量,在每个基因中都报告了每个基因的存在或不存在(分别由1或0表示)。
-phenomatrix_35conconditionsnormalizedbyypd.tab.gz:在35个应力条件与30°C标准YPD培养基之间的生长比,用于971个分离株。
所有其他数据都可以根据合理的要求从相应的作者那里获得。