扩展的人类微生物组项目中的应变,功能和动态

2025-06-24 08:54来源:本站

  没有使用统计方法来预先确定样本量,因为此处包含的数据是从先前在人类微生物组项目研究的第一波浪潮中收集的生物测量来得出的。由于未包括治疗或表型组,因此未进行实验或盲的随机分组。

  如HMP11所述,进行了样本收集,存储,处理和WMS测序。有关IRB审查,知情同意,主题排除标准,采样协议和时间表的详细信息可以在先前的出版物1,13,40中找到。SRA使用BMTagger去除人DNA后,从SRA获得了此处分析的所有宏基因组(扩展数据图7a)。使用SRA SDK Toolkit19的FastQ-Dump实用程序,将所有SRA本机格式读取文件转换为FASTQ,以进行进一步分析。

  每个示例中的一个或多个SRA读取文件是每个读取方向串联的,以为每个示例创建一对FastQ文件。使用PICARD(http://broadinstitute.github.io/picard/)将这些FASTQ转换为未对齐的BAM,并使用修改版的Picard EntipalibraryComplexity模块删除了精确的重复。最后,使用trimbwastyle.usedbam.pl脚本从UC Davis Genome Center的BioInformatics Core进行了修剪和长度过滤(-Q2 -L60)(-Q2 -L60)(https://github.com/genome/genome/blob/master/lib/lib/perl/gerl/genome/site/tgi/hmp/hmpsraprocess/trimbwastyle.ususe.usedbam.pl)。

  分类学分析(下图)后,根据中位物种级别的bray – curtis与来自同一身体部位的其他样品的相似性,确定了生态异常的WMS样品,以进一步样本质量控制。如果样品的中值差异超过了与其身体部位的所有中位差异的上部栅栏上的上部围栏(比第三四分位数高于第三四分位数的四分位数范围),则样品被标记为离群值并丢弃。该过程删除了86(3.6%)WMS样品,这些样本对于各自的身体部位高度非典型。下游分析使用了其余2355个样本。

  使用graphlan220对元基因组样品进行分类学分析,该expaphlan220使用了进化枝特异性标记的库来提供Panmicrobial(细菌,细菌,古细菌,病毒性和真核生物谱分析(http://huttenhower.sph.sph.harvard.eduu/metaphlan2))。paphlan2概述了从HMP1概括了观察到的生态模式(扩展数据图1b),并同意直接读取映射到参考基因组。在所有样品中,映射的读数覆盖了每个中等优势菌株的参考基因组序列的平均81.7%(中值92.8%)。这些菌株在所有样品上的平均覆盖深度(对齐的读取的总碱基对除以参考基因组中的总碱基对)为3.9倍,其深度覆盖量的均值因体位的变化从0.04×(右二键率fossa)到11.1×(舌头porsum portsum)(补充表8)。在每个身体部位的前两个变异轴中,批处理效应不可见(扩展数据图1D)。

  使用应变phlan14进行应变表征。菌株磷酸盐表征了生物体中的aphlan2标记基因中的单核苷酸变体。对于给定样品,我们需要至少80%的标记才能使给定物种具有最小的平均读取深度为10倍,以确保足够的数据执行单倍型调用。总共151种在至少两个WMS样品中满足了这些要求(补充表2)。使用Kimura两参数距离17评估菌株之间的距离(可从扩展数据表1b获得)。spaphlan2和应变phlan都与默认设置一起使用。

  参考基因组覆盖范围是通过HMP1-II菌株和参考基因组之间的不对称系统发育距离的补体(1- unifrac G41)评分。所有覆盖估计均在补充表2中列出。

  通过类似于轮廓评分的度量,检测到具有小众相关的亚种的物种,该测度比较了每个身体部位中菌株的平均系统发育差异与跨越体位部位的菌株(在同一物种中)的差异。具体而言,我们首先定义一个身体位点分数D(U,V)的身体位点u和v的给定物种的分数D(U,V)为:

  其中SX是通过人体位点X中应变磷覆盖要求的一组样品,而D(i,j)是样品I和J中显性单倍型之间的两参数距离。然后将每个物种的小众缔合得分A(图1b)定义为所有定向的身体位点u和v的最大观察到的d(u,v),其中两个地点中至少满足了五个样品的菌株覆盖范围要求。也就是说,对于一组车身站点b:

  该分数的一个问题是,在一个站点中呼叫单核苷酸变化的技术难度更大,可能会导致没有存在的明显小众关联。但是,这并不是一个问题,因为计算出利基 - 关联评分的所有站点都是口头位点具有相似的技术变异性(图1A)。这是一个限制的副产品,即该物种必须在多个地点具有足够的存在(通过应变覆盖率的五个样本),这是在生态上更相似的口服位点以外的。

  使用Humann224(http://huttenhower.sph.harvard.edu/humann2)进行功能分析。简而言之,对于给定的样品,Humann2构建了来自spaphlan2在样品中检测到的物种子集的样本特异性参考数据库(pangenomes是给定物种的ORF的预定代表)。然后,Humann2映射样本读取此数据库,以量化基因的存在和每种物种的丰度。通过翻译搜索与基于联合国联合国的蛋白质序列catalogue43的翻译搜索进一步映射。最后,对于在核苷酸和蛋白质水平上量化的基因家族,Humann2从功能表征的子集中重建途径,并根据metacyc Pathway database44评估社区的总体,物种分辨和未分类的途径丰度。

  代谢途径的分析集中在1,087个HMP1-II元基因组上,代表了每个受试者在六个靶向身体部位的第一次测序访问。排除了给定(主题,身体部位)组合的后续样品和技术复制,以避免在其方向上偏见人口估计。我们将特定身体部位的“核心”途径定义为一种“核心”途径,该途径在至少75%的受试者唯一样本中以相对丰度> 10-4检测到。我们进一步过滤了这些高度普遍的途径,以确保明智的分类范围和自信的分类归因。具体而言,如果其生物循环的分类学范围不包括任何与人相关的微生物属(定义为在至少5个具有相对丰度> 10-3的HMP受试者中检测到的属),则排除潜在的核心途径,或者> 50%的路径副本“未经分类”的分类属性属于> 25%> 25%。这些过滤标准总共产生了950个核心(途径,身体部位)关联,涵盖了258个独特的metacyc途径。值得注意的是,这些数字对上述确切的参数设置不敏感,但前提是,(1)大多数人口普遍性(即,> 50%),(2)非高级检测阈值[即(pathlays of Pathroway of Pathroway)-1],以及(3)某些形式的综合性(例如,(3)构成的综合性,以限制了质子的变化(例如),(2)限制了某些形式的质子 - 替代性 - 例如,(3)9)。

  我们将途径的分类范围量化为在生物气体中注释的独特属的比例。我们将这项措施细分为“人类相关”和“非人类相关”属(如上所述)的范围,并专注于后一种措施以避免循环推理(根据定义,在人类微生物组中广泛分布在人类相关类群中的函数)。作为进一步的控制,我们还将Humann2直接应用于其潜在的Pangenome数据库,以将途径与> 4,000个微生物物种相关联。要保守地将核心途径定义为“丰富人类微生物组”,我们要求将它们注释为 <10% of non-human-associated genera in BioCyc, and also directly annotated to <10% of non-human-associated pangenomes. The second criterion further reduced cases of rare variants of common pathways (as defined by metaCyc) being called as enriched in metagenomes owing to cross-detection of the common pathway.

  We defined a core pathway to be strongly enriched in a particular body site if the first quartile of the abundance of the pathway at that site was >2× larger than the third quartile of abundance at sites from all other body areas (that is, the focal and background abundance distributions must be very well separated, as opposed to just significantly different). Notably, this definition only requires core pathways at oral body sites to separate well from non-oral sites, and not other oral sites (very few core pathways at oral body sites were strongly enriched relative to other oral sites).

  We investigated the relationship between coreness and essentiality of functions using a dataset of around 300 essential COG45 gene families determined in E. coli46 (the ‘Keio collection’). We computed COG abundance across the 1,087 metagenomes introduced above by summing the abundance of individual UniRef gene families (as computed by HUMAnN2) according to UniProt-derived COG annotations47. We considered a COG to be confidently detected in a sample if its relative abundance exceeded 10−4. Among detected COGs, essential COGs (n = 272) were both more globally prevalent than non-essential COGs (n = 3,629; median 0.94 versus 0.24) and core to more body sites (mean 4.7 versus 1.2; core defined here as >75%的现场盛行);这两种趋势在统计学上都是高度显着的(P< 0.001) by Wilcoxon signed-rank tests and robust to a smaller detection threshold (10−6).

  A Gaussian process is a nonparametric probabilistic model for performing inferences about sampled continuous functions. This section covers the justification of the specific Gaussian process model used to model microbial and functional abundances (referred to here as ‘features’) in the microbiome, and discusses its assumptions, advantages and drawbacks. Implementation details are presented in the following section.

  In a Gaussian process, the joint distribution of the modelled function at any finite set of points follows a multivariate normal distribution. Without loss of generality, Gaussian processes can be parameterized solely by their covariance function or kernel, defining the covariance of the output between any pair of sample points. This pairwise definition permits the use of the irregular temporal sampling present in the HMP1-II dataset (Extended Data Fig. 4a). The shape of the covariance function of the Gaussian process determines several properties of the modelled function, such as its smoothness, how quickly it changes, and which features of the input vector it is sensitive to. Our first goal here was thus to assess the strength of the evidence for several common covariance functions describing biologically meaningful behaviours, and to determine which components should be included in a parsimonious model that captures the observable dynamics for the majority of features. The candidate set of covariance functions we considered includes: fast variation (‘biological noise’), inter-individual differences, an Ornstein–Uhlenbeck process, a squared-exponential covariance function, and seasonal dynamics with a period of one year (formulae can be found in Supplementary Table 10).

  All candidate covariance functions describe stationary processes, given the inherently limited state space of relative abundances, although they differ in their temporal dynamics and in their implications for the biological systems that generate these behaviours. "Fast variation," meaning variation on a timescale faster than measurable, is represented by a Gaussian white noise process. Inter-individual differences are modelled by constant covariance between samples from the same person. The two time-varying components, the Ornstein–Uhlenbeck process and the squared-exponential covariance function, both describe monotonically decreasing covariance as the difference in time between two samples increases; that is, time points closer to one another will be more similar than those farther apart. These two functions primarily differ in the smoothness of the underlying function. The Ornstein–Uhlenbeck process is the only stationary Markovian Gaussian process with non-trivial covariance over time, and produces functions that are not differentiable, and thus very jagged, resembling Brownian motion. For example, this covariance function is expected for the abundance of a slowly changing feature under continuous stochastic perturbation from the environment. Meanwhile, the squared-exponential covariance function describes functions that are infinitely differentiable, and are thus extremely smooth. This function implies a considerable amount of latent state relevant to the process generating the abundance of the feature. Both of these time-varying covariance functions are parameterized by their length scale, the characteristic time scale at which the function changes. Lastly, the seasonal component is represented by the canonical periodic covariance function from Gaussian process literature, with its period fixed at one year, but with an unknown length scale. Here, a model refers to a combination of these covariance functions.

  Models were compared based on their marginal likelihoods (also termed ‘evidence’), reported in bits (that is, log2 ratio of marginal likelihoods = log2 Bayes factors) of evidence against a given model when compared to the best model for a feature (Supplementary Table 10). More than 3.3 bits is considered strong evidence against a model, and more than 6.6 bits is considered decisive. Marginal likelihoods were estimated from Markov chain Monte Carlo (MCMC) samples of the posterior distribution by a truncated harmonic mean of the un-normalized posterior distribution at the sampled points. Truncation was performed, as this estimator is known to have poor convergence characteristics because MCMC samples with very low likelihoods have an unreasonable influence on the harmonic mean. Comparisons were performed for models fit to the abundances of the top 10 most prevalent species (with at least 70% non-zero abundances) and top 5 most abundant pathways at each targeted body site (Supplementary Table 10). Comparisons were also performed for a set of simulated features with known dynamics (‘controls’), which were sampled from the corresponding Gaussian process with 5% of variance due to technical noise and the remaining variance distributed evenly between components.

  To determine which of these components have statistical support in the data, we employed a standard greedy search through the space of possible models, which starts from the simplest model (all variation is technical) and iteratively rejects simpler models in favour of a more complex one if the evidence against the simpler model exceeded six bits. The set of more complex models considered at each iteration are those with only one more parameter, and contain the simpler model as a special case (pseudocode presented in Supplementary Table 6). This procedure selected models that included the two simplest components, biological noise and inter-individual differences, 47 and 53 times among the 72 features tested, respectively. Among more complex components, the Ornstein–Uhlenbeck component was selected 13 times, whereas neither the squared-exponential covariance function nor the seasonal component were selected for a single tested feature. These trends were robust to increases in the model rejection threshold, with the evidence for the Ornstein–Uhlenbeck component remaining significant to at least 10 bits, whereas the squared-exponential covariance function and seasonal components are only selected for more lenient thresholds (≤4 bits). We note, however, that this procedure had difficulty identifying the squared-exponential covariance function and seasonal components in control samples that included other components (in particular, biological noise), indicating that these components are difficult to distinguish given the available temporal sampling pattern. Thus, although the data clearly currently prefer the Ornstein–Uhlenbeck component over the squared-exponential covariance function, and do not support the inclusion of a seasonal component, we are not sufficiently powered to eliminate these as potentially significant contributors to the dynamics of the microbiome. Finally, the null model with only technical noise was rejected for 71 out of the 73 features, often with very high evidence (median 69.6 bits).

  For the remainder of the analysis, we thus converged on a model with four components: inter-individual differences, an Ornstein–Uhlenbeck process, biological noise, and technical noise. Let U, T, B, and N be the respective magnitudes of these components, and l be the timescale of the Ornstein–Uhlenbeck process. Estimation of these parameters (hyperparameters in Gaussian process nomenclature) was performed by fitting a Gaussian process with the following covariance function to all features (species and pathways) with at least 75% prevalence within a site (Fig. 3, Supplementary Table 5):

  This function describes the covariance between samples i and j, where tx and sx are respectively the sampling date and subject identifier of sample x. All four parameters were fit simultaneously by MCMC (next section). Since the three magnitude components must sum to the variability of the population, this can be seen as a decomposition of variance into sources of variability that differ in their temporal signature. As we are interested only in the three biological components here, we therefore normalize out the estimated technical noise component (that is, [U, T, B] N) before visualizing the decomposition on a standard ternary plot (Fig. 3b, c). For illustration, we show three examples that illustrate the three types of dynamics on a plot designed to allow a direct comparison between the data and the fit Gaussian processes (Extended Data Fig. 4d–f).

  The identifiability of any component of a time-dependent model is limited by the temporal sampling pattern available. The current dataset contains only up to three time points per person, with the time between samples roughly evenly distributed between one month and one year for each body site (Extended Data Fig. 4a). Processes too fast to measure will contribute to the biological noise component, whereas processes much slower than the maximum time intervals available contribute to the inter-individual component. We tested what time scales would be detected by the Ornstein–Uhlenbeck component, and which would contribute to the inter-individual or biological noise components, by simulating data from Ornstein–Uhlenbeck processes of varying length scales and performing parameter fits (Extended Data Fig. 4b). These show that the time-varying component is sensitive to processes with characteristic length scales of around 3 to 24 months.

  We note that the resolution of the time-varying component is only possible because of the large spread in the time differences between samples available in the HMP1-II dataset (Extended Data Fig. 4a). In another common longitudinal study design, in which a small number of samples are gathered per person with a fixed time interval between them, this would not be possible, although this design may make the analysis simpler (samples can be grouped by time point and a method such as Gaussian processes would not be necessary). Likewise, richer longitudinal data in the form of longer time series would allow even more to be inferred about the dynamics of the microbiome. Of particular interest, this would enable differences in the temporal component(s) to be resolved between people. Here, with only up to three time points per person, the fit model parameters describing temporal changes (B, T, and l) are only a best-fit over the population. Such a sampling pattern would also provide the opportunity to differentiate more conclusively between the Markovian Ornstein–Uhlenbeck process and other possible non-Markovian processes (such as described by the squared-exponential covariance function, or an intermediate such as the Matérn covariance functions), indicative of latent state or time-delayed events in the microbiome.

  The HMP1-II dataset also includes many technical replicates (252 in total), which were instrumental in distinguishing the two fast-varying components (biological and technical noise). We encourage the addition of a non-trivial number of technical replicates in future longitudinal studies, not simply for validation but also to allow a quantitative characterization of diversity that is not captured in the remainder of the experiment owing to limited sampling rates. Since technical noise is also estimated with the other variance components, estimates of the relative magnitude of the technical noise are also reported (Supplementary Table 5). The proportion of variance due to technical noise was generally lower for species abundances (median of 5.4%, 90th percentile of 19.3%) than for pathways (median of 16%, 90th percentile of 44%), consistent with the observation that true biological variation between pathway abundances is lower than between species abundances1. Noise levels in pathways were predominantly influenced by body site, with pathways in the anterior nares having the greatest noise (median of 40%).

  We assessed the accuracy of the parameter fitting process under these noise conditions by simulating samples from mixtures of the three components and performing parameter fits for each targeted body site (Extended Data Fig. 4c). For all noise levels, pure components were always inferred with high confidence, with inter-individual differences being the most identifiable. Mixtures of inter-individual dynamics with biological noise were also confidently recovered, whereas mixtures of inter-individual and biological noise were more variable, and mixtures of inter-individual and time-varying dynamics were biased towards a greater influence of time-varying dynamics. Thus, when the time-varying component is present, parameter estimates should be considered biased away from the inter-individual corner of the ternary diagram. Mixtures of all three components had the greatest uncertainty. Among body sites, inferences at the anterior nares and posterior fornix sampling distributions were the most unreliable, owing to the relatively limited number of samples at these sites (Extended Data Fig. 4a), reflected as a large number of highly uncertain features at these sites (Fig. 3). At 20% technical noise (the 90th percentile of the noise distribution for species), parameter estimates degrade noticeably, and tend towards the mean of the prior (an even mixture of all components). This therefore results in the low-confidence species and pathways tending to locate towards the centre of the ternary diagrams (Fig. 3).

  We note that for a particular feature (microbe or pathway abundance), each of the non-technical components represents the sum of all processes with that temporal signature that affect that feature, and these do not necessarily reflect intrinsic properties of the feature. Examples of extrinsic processes that are likely to produce biological noise include, among others, day-to-day dietary differences, the timing of sample collection relative to meals, tooth brushing and other personal hygiene, spatial variation of the microbiome within subjects (for example, gradients across the stool), and weekend/workday differences. Extrinsic sources of inter-individual differences may arise from culture/ethnicity (ethnicity is strongly associated with the abundances of several microbes1), differences in habits (for example, habitual versus infrequent tooth brushers and flossers), and long-term dietary differences, among others. Finally, time-varying processes may include properties such as weight or slowly changing preferences in diet.

  All parameter fits and model comparisons were performed by MCMC sampling with the GPstuff toolbox version 4.6 in MATLAB. Before fitting, relative abundances were first arcsine square-root transformed, filtered for outliers using the Grubbs outlier test (significance threshold 0.05), and standardized to have zero mean and unit variance. A gamma-distributed prior with shape 3.1 and mean 10 months was imposed on the lengthscale parameter of all time-varying components. These parameters for l were selected based on the intervals between samples, and guarantee that the model is identifiable when the biological noise and/or inter-individual difference components are included by ensuring that l is neither too short nor too long. All parameters of all models were fit simultaneously. All models were fit using a Gaussian likelihood. This function performs poorly for highly non-Gaussian distributions, which frequently occur in microbiome data in the form of zero-inflated abundance distributions. For this reason, the dynamics analysis was performed for highly prevalent features (species with ≥ 75% prevalence within a site, and core pathways). One exception was made for this: species with mean abundance when present at ≥ 2% and non-zero in at least 50 samples were also included, so as to include important species such as Prevotella copri that have lower prevalence but exceptional abundance when present. Other models specifically accounting for zero-inflation (both technical and real) will be needed to study the dynamics of the rarer microbiome.

  Evidence presented in Supplementary Table 5 was calculated from 5 MCMC chains per model, with 150 samples after a 20 sample burn-in, which were started from a random point in the prior distribution. Parameter estimates presented in Fig. 3 and Supplementary Table 5 were fit with the additional constraint that U + T + B + N = 1, to eliminate an additional degree of freedom from the model. A Dirichlet(1, 1, 1, 1) prior was imposed on [U, T, B, N]. For each feature tested here, a more thorough MCMC sampling was performed than for the model selection, consisting of 10 chains with 200 samples each (after 30 burn-in and thinning every other sample), starting from a random point from the prior distribution. In all cases, all parameters were fit simultaneously. Convergence was assessed with the statistic48. Over all 196 species and 950 pathways tested, 97% of statistics were <1.1 for all parameters (median 1.01, max 1.17), indicating good convergence.

  Associations between microbial and pathway abundances and metadata were determined using MaAsLin1,49. MaAsLin tests a sparse multivariate generalized linear model against each feature independently. Relative abundances were first arcsine square-root transformed for variance stabilization, and the Grubbs test was used (significance level 0.05) to remove outliers. A univariate prescreen was applied using boosting to identify potentially associated features, and significantly associated covariates among the remaining features were identified with a multivariate linear model without zero-inflation. Unless otherwise stated, a final FDR < 0.1 (Benjamini–Hochberg controlled across feature tests) was used as a significance threshold.

  The same model was applied to all features (microbial and pathway) during this analysis and included the following covariates: broad dietary characterization, whether the subject was breastfed, temperature, introitus pH, posterior fornix pH, gender, age, ethnicity, study day processed, sequencing centre, clinical centre, number of quality bases, percentage of human reads, systolic blood pressure, diastolic blood pressure, pulse, whether the subject had given birth, HMP1/HMP1-II, and BMI. A summary of these metadata can be found in Extended Data Table 1a. Of note, several recently identified confounders such as transit time8 for stool samples were not collected during sampling.

  We benchmarked several assemblers including IDBA-UD38, metaVelvet50, SOAPDenovo251, Newbler (Roche, basel, Switzerland), Ray52, SPAdes53, and Velvet54 using eight samples (SRS017820, SRS014126, SRS052668, SRS017820, SRS048870, SRS020220, SRS057205 and SRS017820) across five body sites that represented a range of metagenomic complexity. On the basis of the assembly size, median length, fragmentation level, and N50 length, we chose IDBA-UD to process all HMP1-II samples.

  Following quality control, sequence reads for each sample were run through a ‘digital normalization’ pipeline before assembly. This process was designed to reduce, as much as possible, the volume of information from the most dominant source taxa (without sacrificing the ability to assemble what remains) so that lower-abundance taxa could be assembled more evenly, instead of having their reads discarded by the assembler software as not being sufficiently covered (compared to the dominant taxa).

  Median k-mer coverage was first estimated for all reads using the khmer Python library55. These data were then used to filter input reads so as to normalize k-mer coverage within preselected bounds: for each k-mer of length 20 nucleotides in each read, the total number of observations of the k-mer was used as a proxy for coverage. Reads for which median k-mer coverage was already greater than 20 were discarded. Remaining reads were then trimmed at the first instance of a single-copy k-mer (representing putative error sequences). Reads with a post-trim length of less than the k-mer length (20 nucleotides) were also discarded. Surviving reads were trimmed again, this time at the first instance of a high-abundance (>50×) k-mer; again, reads whose post-trim length was less than 20 nucleotides were discarded. For remaining reads, we re-normalized (based on median k-mer coverage as in the first step) to remove all reads whose median k-mer coverage was >5×。在消除具有高度代表(冗余)K-MER或严重代表性不足(误差)K-MER的初始读取之后,这是对推定冗余序列的更具侵略性的过滤器。

  对于此质量控制和归一化后的随后组装,我们将K增加到32个核苷酸(以最大程度地提高其剩余读数的灵敏度),并从所有其余读取中构建了一个重叠图。然后将此图分为具有内部重叠的可能性很小的读数组,将组件分离为预先计算的“ stoptags”:Khmer在其初始分析扫描中自动识别为不可靠的组装组装 - 感染性节点时自动识别的K-MER序列。然后将读取从每个此类分区中提取到单独的FASTA文件中。测试了每个分区的更连贯的亚组,从最小一致的开始(按照图可分离性排名)。重新分配如上所述,但具有更具侵略性的参数:在重新分配之前,明确检测并删除了最初计算的重叠图中的stoptag(其中包括从较早的图表后的其余图中生成新的stoptags))。最不吻合的读取组以这种方式完全分为子分区:进一步的迭代风险过于拟合,并且不能保证会融合到有意义的结果。

  在数字归一化之后,每个最终分区都与其他分区独立于IDBA-UD组装。对于(20、30、40,...,80)中的k值,IDBA-UD将尝试使用大小为K的k-Mers组装其分区(通过De Bruijn图方法),然后将合并并扩展所有通过的结果以产生该分区的最终组装(需要100个核苷酸的最小重叠元长)。对于每个样品(或池),然后将所有(独立生产的)分区组件加入。作为减少最终串联组件中存在的任何冗余的最后一步,我们基于40个核苷酸或更多的重叠,合并并扩展了所有组装重叠群(跨所有分区),以产生最终的“合并”序列收集。

  为了评估组装质量,我们进行了许多组装后质量控制检查,包括检查与组件保持一致的速率以及识别嵌合体的速率,这是由错误组件引起的潜在问题。

  为了检查将哪些读数合并到组件中,使用Bowtie V1将样品读取与它们的组件对齐,从而导致至少有一个对齐的读数和未能对齐的读数的计数。总读取包括来自人类宿主的读物。由于使用BMtagger将人读物掩盖为SRA的所有NS,因此人类的读物会影响不结盟读数的一部分。为了评估效果,我们计算了蒙版读数的数量,以获取人类读数的计数。这些由人体位点总结在扩展数据中图7c。

  为了检查Chimaeric重叠群和错误组件的速度,我们对HMP期间生成的2个模拟数据集进行了组装评估,该数据集是由所有21种相等丰度(偶数)的21种生物创建的社区,一个是一个相等的丰度(偶数)。我们使用相同的协议和对齐的所有21个输入基因组组合了这些模拟社区,并对齐组装重叠群。我们发现,所有组装的重叠群的分别为94.21%和96.84%,这些重叠群唯一与单个参考基因组对齐的偶数和交错覆盖的模拟群落(此处的“对齐”表示与≥95%的序列身份对齐,超过其长度的≥95%)。与与与其他菌株相符的重叠群相比,与与其他菌株相比,与与其他菌株相比,与密切相关的葡萄球菌和链球菌菌株对齐的重叠群表现出略多的非排他性匹配(或交叉匹配)。For the even set, an average of 97.85% of all Staphylococcus- and Streptococcus-aligned contigs were uniquely aligned to their reference strain, with an average of 92.98% for the staggered set, as compared to averages across all other strains of 99.89% (even) and 98.98% (staggered), neatly reflecting the inherent genetic ambiguity of these taxonomically narrow亚组却显示出很强的区分相关菌株的能力。

  恢复统计量与交错集中的输入覆盖范围不太相关,这表明我们的管道(至少为4倍覆盖率)与在这些尺度上的相对丰度高达三个数量级的相对丰度差异很强。在这种情况下,系统发育接近似乎比覆盖范围比覆盖范围显示出对组装唯一性的影响(尽管仍然非常弱)。与21个参考菌株中的任何一个(≥95%的身份时长度的超过95%的长度超过95%)分别为5.6%和3.0%的重叠群的分数分别为5.6%和3.0%。因此,我们可以假定这些比例是我们管道产生的嵌合体和错误组件的综合速率,与其他Chimaera组装指标一致。

  使用metagenemark-3.2557进行组装重叠群中的ORF检测。使用RAPSEARCH259,将最终的ORF序列用作针对(1)UNIREF10058的搜索的输入;(2)使用HMMER-3.062的PFAM60和TIGRFAM61 HMM型号;(3)TMHMM63用于识别跨膜螺旋;(4)对膜脂蛋白脂质附着位点的正则搜索,以鉴定推定的信号肽。后三个搜索是按照Ergatis Workflow监视System64实现的。

  注释是由Attributor(https://github.com/jorvis/attributor)使用层次方案从IGS原核生物注释管道中开发的65。Attributor分配了基于层次结构的证据,包括对HMM模型的命中,UNIREF100序列,TMHMM预测的螺旋跨度和脂蛋白基序,分配了普通名称,基因符号,酶委员会(EC)数字和基因本体论(GO)项(GO)项(GO)术语。任务是独家的,这意味着对于每个ORF,属性都采取了最有力的证据,并根据该证据分配所有属性。属性不会从多个来源分配,以确保分配给单个ORF的注释属性不会冲突。属性注释作为GFF3和FASTA文件输出(扩展数据表1b)。

  Rarefaction curves were generated by extracting predicted polypeptides from the metaGeneMark output for each sample, and estimating a ‘unique gene family’ count for rarefied sample size n as follows, using usearch v.8.1.1861 x6466: (1) Concatenate the metaGeneMark predicted polypeptides from a random sampling of n samples that were not technical replicates, eliminating duplicates;(2)通过减小长度进行排序序列;(3)以90%身份(使用usearch cluster_fast)处的集群序列;(4)从结果中检索“独特的基因家族”。从每个n的50个随机子集估计唯一簇的数量。为n = 1,10,20的每个身体部位重复此过程,直到身体部位可用的唯一样品数量为止。

  除了上述分类学和功能分析外,所有样品的单个原始读取均直接对齐到metareF42参考基因组。在对齐之前,使用Biocode FastQ :: Filter_fast_fastq_by_n_content Utilility(https://github.com/jorvis/jorvis/biococe/blob/blob/master/master/master/fastq/fastq/filter_fastq_fast_fast_fastq_by_by_content.py),使用Biocode fast_by_by_n_content实用程序丢弃所有读取的NS百分比。然后使用bowtie267(v2.2.4)来对齐读取,以使用默认,配对端对齐选项并包括单例读取。所得的SAM文件被转换为BAM,分类,然后将每个示例分配为两个单独的文件,一个唯一的匹配读数之一,另一个是未对齐的读取。整个管道都封装在Biocode Generate_to_to_to_metaref_seed_alignment.py Pipeline脚本(https://github.com/jorvis/jorvis/biocode/biocode/blob/master/master/sandbox/jorvis/generate/generate/generate_to_to_to_to_to_to_to_to_to_to_to_to_to_to_to_to_metareAref_Seed_Alignment.py)。

  使用Bowtie(v0.12.9),将每个样品的质量缩放读数映射到同一样品中的组合重叠群上,最大最佳优先搜索框架值,PHRED33分数质量设置,21个碱基对种子长度和每种种子不匹配2不匹配。据报道,所有读取的所有对齐方式(除非给定的读取有20多个),并保证了命中率最佳层面和质量打破的联系。未报告亚地区的命中。

  注释管道和高斯过程分析的代码可从扩展数据表1b获得。

  序列数据可从HMP DACC(http://hmpdacc.org)或Amazon(https://aws.amazon.com/datasets/human-microbiome-project/)获得序列数据;WMS reads and accompanying metadata are available at the Sequence Read Archive (SRA; https://www.ncbi.nlm.nih.gov/sra) and the Database of Genotypes and Phenotypes (dbGaP; https://www.ncbi.nlm.nih.gov/gap) under two studies: SRP002163(Bioproject PrjNA48479)和SRP056641(Bioproject PrjNA275349)。来自扩展数据表1的公共和私人元数据可与HMP DACC(https://www.hmpdacc.org/hmsmcp2/)一起获得元基因组分类单元的丰度表,并分别通过DBGAP,分别具有访问号码phs000228.v3.p1。所有其他数据都可以根据合理的要求从通讯作者那里获得。

左文资讯声明:未经许可,不得转载。