生物学和设计中蛋白质折叠稳定性的大规模实验分析_生活

　　我们首先在2021年6月在30-100氨基酸长度范围内收集了PDB中的所有单体蛋白。接下来，我们排除了只有一个螺旋的结构，其中包含其他分子（例如，蛋白质，核酸或金属）被注释，以注释DNase，RNase，RNase，RNase，RNase，RNase或蛋白酶抑制活性，或更多的CYSTINTE，或更多的CyStine。然后，我们去除了冗余序列（氨基酸序列距离 <2). We then predicted the structures of these PDB sequences using AlphaFold (even though the PDB structures were known), and used the AlphaFold models to trim amino acids from the N- and C termini that had a low number of contacts with any other residues. Finally, we selected domains with up to 72 amino acids after excluding N- or C-terminal flexible loops.

　　EEHH protein design was performed in three steps: (1) backbone construction, (2) sequence design, (3) selection of designs for deep mutational analysis. Backbone construction (the de novo creation of a compact, three-dimensional backbone with a pre-specified secondary structure) was performed using a blueprint-based approach described previously51,52. All blueprints are included as Blueprints_for_EEHH.zip in Source data.

　　We used a TrRosetta hallucination protocol described previously in the previous reports31,32 and available at https://github.com/gjoni/trDesign/tree/master/02-GD to unconditionally generate protein backbones and sequences with lengths ranging from 46 to 69 amino acids by maximizing the Kullback–Leibler divergence between the predicted and background distance/angle distributions. Predicted distograms and anglegrams were used to obtain 3D structures of these models as described in the TrRosetta paper53. We selected the best designs according to the predicted distogram and 3D structure match.

　　All sequences were reverse-translated and codon-optimized using DNAworks2.054. Sequences were optimized using E. coli codon frequencies because we used an in vitro translation kit derived from E. coli. Oligonucleotide libraries encoding amino acid sequences of Library 1 were purchased from Agilent Technologies.

　　We selected ~250 designed proteins and ~50 natural proteins that are shorter than 45 amino acids. Then, we created amino acid sequences for deep mutational scanning followed by padding by Gly, Ala and Ser amino acids so that all sequences have 44 amino acids. The total number of sequences is ~244,000 sequences Purchased from Agilent Technologies, length 230 nt.

　　We selected ~350 natural proteins that have PDB structures that are in a monomer state and have 72 or less amino acids after removing N and C-terminal linkers. Then, we created amino acid sequences for deep mutational scanning followed by padding by Gly, Ala and Ser amino acids so that all sequences have 72 amino acids. The total number of sequences is ~650,000 sequences. This library also includes scramble sequences to construct unfolded state model. Purchased from Twist Bioscience, length 250 nt.

　　We selected ~150 designed proteins and created amino acid sequences for deep mutational scanning of the proteins. We also included comprehensive deletion and Gly or Ala insertion of all wild-type proteins included in Library 1 and Libary 2. Additionally, amino acid sequences for comprehensive double mutant analysis on polar amino acid pairs were also included. The total number of sequences is ~840,000 sequences. Purchased from Twist Bioscience, length 250 nt.

　　Amino acid sequences for exhaustive double mutant analysis on amino acid pairs located in close proximity were included. We also include overlapped sequences to calibrate effective protease concentration and to check consistency between libraries. The total number of sequences is ~900,000 sequences. Purchased from Twist Bioscience, length 300 nt.

　　Oligonucleotide libraries were amplified by PCR using KOD PCR Master Mix (Toyobo) to add T7 promoter, PA tag to an N-terminal, and His tag to an C-terminal of the proteins. The number of cycles was chosen based on a test qPCR run to avoid overamplification using SsoAdvanced Universal SYBR Green Supermix (Bio-Rad). The PCR product was gel extracted to isolate the expected length product. Then we used T7-Scribe Standard RNA IVT Kit (Cellscript) to synthesize mRNA using the DNA fragment as a template.

　　We followed the protocol essentially as described24,55, with some modifications, described below.

　　We prepared the photocrosslinking reaction solution (usually at 40 μl scale) using 100 mM NaCl, 20 mM Tris-HCl (pH 7.5), 1 μM cnvK linker (EME), 1 μM mRNA. The solution was incubated at 95 °C for 5 min, then slowly cooled down to 45 °C (0.1 °C s−1) using a thermal cycler. Then the solution including the duplex was irradiated with UV light at 365 nm using a 6 W Handheld lamp (Thermofisher) for 15 min. At 40 μl scale (40 pmol cnvK linker and 40 pmol mRNA total), this produces crosslinked mRNA sufficient for 48 proteolysis reactions.

　　We used the PUREfrex 2.0 (GeneFrontier) translation system according to the manufacturer protocol. We typically used a 160 μl total reaction including 40 μl of the mRNA-cnvK linker duplex product from Step 1 and RiboLock RNase Inhibitor (Thermofisher). We incubated the reaction at 37 °C for 2 h. After the incubation, 500 mM EDTA (16 μl for a 160 μl reaction) was added to the sample to dissociate ribosomes. Then, an equal amount (160 μl for a 160 μl reaction) of 2× binding/washing buffer (20 mM Tris pH 7.5, 2 mM EDTA, 2M NaCl, 0.2% Tween) was added. The solution was added to Dynabeads MyOne Streptavidin C1 (Thermofisher, 200 μl for 40 pmol mRNA) to pull down the protein-mRNA complex and incubated at room temperature for 20 min. Before use, streptavidin beads were pre-washed with (1) 100 mM NaOH, 50 mM NaCl, then (2) 100 mM NaCl to remove any RNase activity. After streptavidin pull-down, the beads were washed by 1× binding/washing buffer once and rinsed twice by TBS (10 mM Tris-HCl pH7.5, 100 mM NaCl), and we added reverse transcription solution (Primescript RT Reagent Kit; Takara) onto the beads with protein mRNA complex, and incubated the beads at 37 °C for 30 min.

　　After the reverse transcription, the protein–cDNA complex was eluted with His-binding buffer (30 mM Tris pH7.4, 0.5 NaCl, 0.05% Tween) with RNase T1 (Thermofisher) usually in 400 μl scale. The eluent was added to His Mag Sepharose Ni (Cytiva) (800 μl for 40 pmol starting mRNA) and incubated at room temperature for 30 min. Then the complex was eluted by His-binding buffer with 400 mM imidazole (usually 400 μl) and the eluent was buffer-exchanged to PBS by Zeba Spin Desalting Column (Thermofisher). Then the complex was snap-frozen with liquid nitrogen and stored at −80 °C until the following protease assay. When starting from 40 pmol cnvK linker and 40 pmol mRNA for step 2, we would typically finish this step with 400 μl of protein–cDNA complex divided into 4 frozen tubes (100 μl each) for four sets of 12 protease experiments (48 conditions total).

　　Proteolysis reactions were performed in two ‘replicates’ of 12 conditions each (11 protease concentrations in a threefold dilution series and one condition with no protease). Replicate 1 used a maximum protease concentration of 25 μM and replicate 2 used 43.3 μM (25 x 30.5 μM). For one replicate (12 reactions), we started from ~25 μl complex, diluted this in PBS up to 240 μl, then added 20 μl to each of the 12 Protein LoBind tubes used for that replicate. Each reaction contained protein–cDNA complex equivalent to 0.83 pmol starting cnvK linker and 0.83 pmol starting mRNA. To start each reaction, we added 40 μl of protease solution to each tube. After 5 min protease digestion at room temperature, we added 200 μl chilled 2% BSA in PBS to quench the reaction, then the solution was added to 40 μl Dynabeads Protein G (Thermofisher) preincubated with anti-PA tag (Wako; Clone number: NZ-1; 1 μg antibody per 30 μl beads), and incubated at 4 °C for 1 h. Then the beads were washed by washing buffer (PBS including 800 mM NaCl and 1% Triton) three times and rinsed by PBS three times, then the complex was eluted with 50 μl PBS including 250 μg ml−1 PA peptide (Wako) and 200 μg ml−1 BSA (Thermofisher). Trypsin experiments used Trypsin-EDTA (0.25%) with phenol red (Thermo Fisher Scientific) for consistency with ref. 18 and chymotrypsin experiments used α-chymotrypsin from bovine pancreas (Sigma).

　　The cDNA amount for each specific sequence in the eluents was quantified by qPCR using SsoAdvanced Universal SYBR Green Supermix and specific primers for each sequence. The qPCR was performed using CFX96 Touch Real-Time PCR Detection System (Bio-Rad), and the qPCR cycles were determined by the CFX Maestro Software (Bio-Rad). Extended Data Fig. 1.

　　For DNA library analysis, one-half volume (25 μl) of the eluted cDNA of the complex was amplified by PCR using SsoAdvanced Universal SYBR Green Supermix (BioRad) to add P5 and P7 NGS adapter sequences. The number of cycles was chosen based on a test qPCR run using the same PCR reagents to avoid overamplification. The DNA fragment length and concentration were confirmed by 4200 TapeStation System (Agilent), then the samples were analysed by NovaSeq 6000 System (Illumina).

　　Each library in a sequencing run was identified via a unique 6- or 8-bp barcode. Following sequencing, reads were paired using the PEAR program56 then the adapter sequences were moved by Cutadapt57. Reads were considered counts for a sequence if the read perfectly matched the ordered sequences at the nucleotide level.

　　We used Bayesian inference to infer K50 and ΔG values for all sequences in our library. This analysis uses two main models. The first model is called the ‘K50 model’ and infers each sequence’s K50 values based on the sequencing count data. The second model is called the ‘unfolded state model’ and predicts each sequence’s unfolded state K50 value (K50,U) based on its sequence. Both models are implemented in Python 3.9 using the Numpyro package58 version 0.80. In Supplementary Notes, we describe the structure of each model and the procedure for fitting each model. Our scripts to reproduce the complete fitting process are provided in the Source Data.

　　Instead of sampling K50 values using 24 samples per protease at one time as described in step 5 above, we sampled K50 values using one experiment set (that is, 12 samples) and obtained K50 for trypsin replicates 1 and 2, and chymotrypsin replicates 1 and 2. Note that we still used the calibrated protease concentrations to improve consistency between replicates. The replicates were conducted on different days using the same preparation of the protein–cDNA complex.

　　The data on purified proteins shown in Fig. 1g was taken from refs. 29,59,60,61,62,63,64,65,66,67,68,69,70,71.

　　Our data (Fig. 2) were filtered for quality in three stages. First, our Bayesian procedure produces confidence intervals for K50 and ΔG estimates, producing a quality estimate for each individual measurement. Second, we evaluated the quality of each full mutational scan, classified these into categories, and removed the low quality categories from our main analysis (below). Third, we filtered our mutational scanning data to remove mutants that showed evidence of causing cleavage from the folded state or intermolecular disulfide cross-linking.

　　Nearly all low-confidence ΔG estimates result from stabilities that are outside the main dynamic range of the assay (−1 to 5 kcal mol−1). This is due to the very steep slope of ΔG with respect to K50 in this range (see Fig. 1e). For all figures, we clip all ΔG estimates to the range of −1 to 5 kcal mol−1 before further analysis. In the table of all data, the ‘dG_ML’ column categorizes sequences as ‘<−1’ and ‘>5’ if the 95% confidence interval is fully outside the range. Of the sequences with ΔG estimates between −1 and 5 kcal mol−1, the median sequence had a 95% confidence interval width of 0.14 kcal mol−1, and 99.9% of sequences had confidence intervals smaller than 0.96 kcal mol−1. Although a very small fraction of ΔG estimates were low confidence (that is, had a wide confidence interval), we still included these sequences in all analyses. Note that these confidence intervals only reflect the model’s uncertainty stemming from the finite deep sequencing counts; other uncertainties (such as uncertainty in K50,U, K50,F, protease concentrations, the validity of the kinetic model, and so on) are not reflected in these confidence intervals.

　　All mutational scanning data were classified into 12 groups (0 to 11) according to the protocol in Extended Data Fig. 8. Groups 0 and 1 contain the mutational scans that passed all quality filters. Domains in group 0 have wild-type ΔG values below 4.75 kcal mol−1 so that stabilizing mutations can still fall within the cDNA proteolysis assay’s dynamic range. Group 1 contains the remaining high-quality domains. Groups 2–11 contain mutational scans that failed one or more quality filters. All mutational scans are included in only one group, so a mutational scan classified as ‘group 5’ (for poor correlation between independent trypsin and chymotrypsin results) might also fail other filters (such as having a poor slope or intercept between trypsin and chymotrypsin results).

　　Below, we define each group, along with a short explanation of possible causes.

　　Group 0: Passing all quality filters.

　　Group 1: Passing all quality filters, but wild-type ΔG > 4.75 kcal mol -1，因此与野生型相比，稳定突变体可能无法解决。

　　第2组：基于下一代测序的低计数，在测定中表达不佳。

　　第3组：野生型蛋白太不稳定，无法看到序列稳定关系。这可能是由于真正不稳定的野生型序列，或者是由于折叠状态下某些片段的快速切割所致。

　　第4组：野生型稳定性（ΔG）不一致。我们经常在我们的第一个库中观察到这一点，以高稳定性蛋白，在该文库中，野生型稳定性超过了测定的动态范围。

　　第5组：胰蛋白酶实验与甲侧晶实验之间的相关性差。这可能表明一种或两种蛋白酶没有探测全局的展开，从而导致蛋白酶之间的突变模式不同。

　　第6组：胰蛋白酶实验和胰凝乳蛋白酶实验之间的斜率差。这可以表明一种或两种蛋白酶从折叠状态发生一些裂解。如果可以从一个蛋白酶的折叠状态发生裂解，则模型的K50，F将与真实的K50 F不同，从而在推断的ΔG值和TRUEΔG值之间产生斜率（见图1E）。

　　第7组：太多的稳定突变体。在一个典型的折叠式结构域中，大多数突变是中性的，因此很大一部分稳定突变表明野生型ΔG可能无法准确测量。此外，当表面部位的绝大多数疏水取代稳定时，这表明该结构域可能通过非特异性分子间相互作用稳定。由于这些原因，我们删除了显示这些模式的域。

　　第8和9组：包括多个具有正确折叠（G8）或错误折叠（G9）的半胱氨酸。即使在蛋白质被蛋白质蛋白质后，二硫键可以通过防止C-末端cDNA与蛋白N末端分离，从而破坏我们的测定法。通常，我们发现在我们的测定中执行效果较差的蛋白质，其中许多蛋白质在第2-7组中发现。由于这些结果，我们决定以> 1 Cys（第9组）去除其余蛋白质。但是，两种蛋白质似乎产生了良好的效果。尽管我们选择不包括这些蛋白质在我们的主要分析中，但它们已分为第8组（来自> 1 Cys的蛋白质的高质量数据）。

　　第10组：胰蛋白酶实验和胰凝乳蛋白酶实验之间的截距不佳。较差的截距表明我们的胰蛋白酶和甲状腺胆蛋白蛋白酶实验无法就整体突变扫描的位置达成共识。这取决于每个蛋白酶的展开模型（K50的推断，每个蛋白酶的u）。由于两个蛋白酶在这些序列的ΔG值上不一致，因此ΔG值可能不如第0组和第1组中的ΔG值可靠。但是，该组的ΔΔG值在两个蛋白酶中仍然是一致的。

　　第11组：可能在折叠状态下可分解。在许多情况下，从折叠状态或部分折叠状态的过度切割将导致野生型稳定性低（G3），蛋白酶（G5）之间的较差的相关性或斜率较差（G6）。但是，即使在通过这些过滤标准的突变扫描中，我们也看到了一些折叠状态切割的证据。具体而言，我们观察到突变野生型切割位点导致蛋白酶抗性增加（较高的K50），显然较高的稳定性（ΔG），而不是另一种蛋白酶（例如，扩展数据中的R16图6a，b）。只有一种蛋白酶的明显稳定性提高表明，要么可以从该蛋白酶的折叠状态切割该位点，要么可以以我们的模型无法正确考虑的方式降低了未折叠的状态敏感性（K50，U）。由于这些条件降低了ΔG估计的可靠性，因此我们从分析中删除了这些突变扫描。提供了执行此过滤的代码（data_quality_filtering_script.ipynb）。

　　在上一个阶段，我们滤除了整个域。在这里，我们从域中的单个突变体中滤除了其他传递过滤的数据（也就是说，在第0组或第1组中）。我们专注于两种可能破坏我们测定的突变类型。首先，我们滤除了数据，将新的切割位点引入蛋白质结构较差的区域，从而导致明显的不稳定。由于这些突变体位于结构较差的位点，因此明显的不稳定可能是由于折叠或部分折叠状态的裂解而导致的。基于（1）引入新的裂解位点的明显不稳定，以及（2）其他氨基酸之间的稳定性较低，这表明蛋白质结构较差，在折叠状态下可能会发生裂解。其次，我们滤除了将Cys突变体引入蛋白质结构较差的区域的数据，导致明显的稳定。同样，由于这些突变体位于结构较差的位点，因此明显的稳定可能是由于蛋白酶裂解后C末端cDNA的解离的形成或分子内二硫键的形成。提供了执行此过滤的代码（data_quality_filtering_script.ipynb）。

　　数据集2和数据集3中的所有序列都包含在tsuboyama2023_dataset2_dataset3_20230416.csv中。该文件中的所有序列均具有推断的ΔG估计值，但是数据集3中的序列只有列表ΔΔG估计值。当然，可以计算数据集2中其余序列的ΔΔG，但是这些ΔΔG值将偏向于破坏稳定的突变，因为稳定突变通常与野生型稳定性没有区别。请注意，数据集2和3包含具有低质量数据（宽度置信区间）的非常少数的序列，因为这些序列来自总体上高质量的突变扫描。尽管这些表包括所有K50，ΔG和ΔΔG数据（对于数据集3），但低质量数据（包括在阶段3中过滤的突变数据）已被过滤掉并用标有“ _ML”列中的符号替换为 - 符号（用于机器学习）。

　　我们进行了主成分分析，以确定影响不同氨基酸稳定性的因素（图3）。为此，我们在365个域中使用了17,093个站点，这些域被分类为上述G0。所有折叠稳定性数据均夹在-1至5 kcal mol -1之间，因为在动态范围之外的折叠稳定性不可靠，然后从数据中减去了每个位点20个氨基酸的稳定性平均值。使用数据，我们使用Python 3中实现的Scikit-Learn库进行了主组件分析。

　　使用随附的脚本burial_side_chain_chain_contact_fig3_fig6.ipynb根据Bio.pdb72和Biopython73计算了所有序列的Alphafold Models1，根据所有序列的AlphaFold模型1计算埋葬值和接触数（图3B和扩展数据图9H）。该计算基于Rosetta的“ Sidechain_neighbors” LayerDesign方法先前报道的18。简而言之，为了计算残基X的埋葬或接触，我们在残基X上的cβ原子上伸出9Å的锥体中的残基数，沿残基XCα-Cβ载体的方向上。“埋葬”（图6H）表明锥体中的Cα原子数量。接触计数（图3D）每个计数锥体内的不同原子：“侧链触点数”（图3D）计数所有Cβ原子；“芳族侧链接触数”计数PHE，Tyr和TRP的所有CE2原子；“酸性侧链接触数”计数所有GLU OE1和ASP OD1原子；和“基本侧链接触数”计数所有Lys NZ和Arg Ne原子。

　　使用DSSP算法74,75，我们根据Alphafold模型获得了二级结构信息（图3B）。

　　选择了双突变体（图4）以两种方式进行分析。首先，我们手动选择了极性相互作用，在单个突变分析中，氨基酸对稳定性似乎很重要。这些对主要包括在图书馆3中。其次，我们使用该程序confind76,77来识别相互作用的残基。选择了所有具有显着相互作用（例如极性相互作用和阳离子-π相互作用）的浓缩对，以及随机选择的更常见相互作用（例如疏水相互作用）的子集。这些对包含在图书馆4中。

　　在补充说明中描述了热力学耦合模型和拟合模型的程序（图4）。

　　野生型序列预测模型（图5）和拟合模型的程序在补充说明中描述。

　　为了计算归一化的平均GEMME评分，该评分代表了野生型氨基酸对从进化信息推断出的取代的敏感性（先前报道中的ΔΔE36,37），我们使用默认参数在每个天然氨基酸序列上运行了gemme42。我们通过平均19个氨基酸的分数（CYS除外）计算了每个站点的单个分数，然后分别对每个域进行标准化（减去域的平均值并除以域的标准偏差），以使该域内的位点得分具有零的平均值，并且标准偏差为一个。最后，我们翻转得分的迹象，以使正值表示对突变的敏感性高（也就是说，非野生型氨基酸的原始gemme得分非常负）。我们将每个站点的标准化得分定义为标准化的Gemme分数。为了构建输入多个序列比对，我们使用EVCOUPLINGS FRAGILWORK81进行了五个介绍HMM同源性搜索工具Jackhmmer78,79的案件HMM同源搜索工具78,79。我们使用了每个残基的默认比特孔阈值0.5位。

　　对于大多数结构分析，我们使用了Alphafold1预测的结构模型。我们使用默认参数运行Alphafold，并选择每个序列的PLDDT得分最高的模型。对于设计的序列，我们跳过了一个步骤来产生多个序列比对。

　　我们在这里没有使用统计测试。我们没有在完全相同的条件下执行多个实验，但是我们使用了两个不同的蛋白酶和两个不同的蛋白酶浓度集来确认可重复性。此外，我们还证实了相同的氨基酸序列在不同库中显示出一致的K50值（扩展数据图5）。

　　有关研究设计的更多信息可在与本文有关的自然投资组合报告摘要中获得。

左文资讯声明：未经许可，不得转载。