Protein Structure Prediction with Mass Spectrometry Data

Knowledge of protein structure is crucial to our understanding of biological function and is routinely used in drug discovery. High-resolution techniques to determine the three-dimensional atomic coordinates of proteins are available. However, such methods are frequently limited by experimental challenges such as sample quantity, target size, and efficiency. Structural mass spectrometry (MS) is a technique in which structural features of proteins are elucidated quickly and relatively easily. Computational techniques that convert sparse MS data into protein models that demonstrate agreement with the data are needed. This review features cutting-edge computational methods that predict protein structure from MS data such as chemical cross-linking, hydrogen–deuterium exchange, hydroxyl radical protein foot-printing, limited proteolysis, ion mobility, and surface-induced dissociation. Additionally, we address future directions for protein structure prediction with sparse MS data.


INTRODUCTION
Proteins are involved in nearly every life process, making them important subjects for studying the molecular basis of disease.Additionally, protein structures can be harnessed for structure-based drug discovery with existing and designed drug-like molecules (1).However, a disparity currently exists between the number of known protein sequences and the number of determined structures.Methodologies to elucidate protein structure are vital to our understanding of molecular biology and for continued use in drug discovery.
Multiple experimental techniques exist to determine high-resolution protein structure.In the popular X-ray crystallography method, a high concentration of a protein target is first crystalized.Then, the crystals are struck with an X-ray beam to elucidate a diffraction pattern from which atomic protein coordinates can be determined (2).While powerful, crystallography is rate-limited by the crystallization process, as ascertaining experimental conditions ideal for crystal growth can be a tedious, if not impossible, process.X-ray crystallography has historically been more successful for ordered and monomeric proteins.Nuclear magnetic resonance (NMR) spectroscopy is another high-resolution technique that uses the chemical shifts of protein atoms for structure determination (3).In most cases, this technique is limited to smaller proteins to avoid overlapping peaks.Cryo-electron microscopy (cryo-EM) has recently emerged as a promising structure determination technique that can elucidate larger, more complex proteins while bypassing the need for crystallization, probing the protein in more physiological conditions (4).However, further optimization of cryo-EM methodologies is required to consistently determine higher-resolution density maps.
Due to the limitations of these techniques, many proteins or protein complexes currently evade high-resolution structure determination.Thus, additional experimental methods are needed to provide insight into structural features.Structural mass spectrometry (MS) is a powerful complementary approach that can overcome the limitations of the above-mentioned methods with its high sensitivity, theoretically unlimited size constraint, and speed.Although the data provided by MS are too sparse for full high-resolution structure elucidation, structural MS can be used to examine the size, solvent accessibility, and topography of proteins (5)(6)(7).Several MS techniques exist that can elucidate elements of protein tertiary and quaternary structure, including chemical cross-linking (XL-MS) (8,9), hydrogen-deuterium exchange (HDX-MS) (10), covalent labeling (CL-MS) (11,12), limited proteolysis (13), ion mobility (IM-MS) (14), and surface-induced dissociation (SID-MS) (15), reviewed here (Figure 1).In XL-MS, residue modifications provide insight into the spatial proximity of modified residues.HDX-MS, CL-MS, and limited proteolysis data are used to infer residue solvent exposure.IM-MS data reveal information about the size and shape of proteins, while SID-MS is used to analyze protein complex connectivity and stoichiometry.Sparse experimental data from structural MS generally must be interpreted in combination with computational methods to elucidate protein structure.
Computational methods have increasingly been employed to complement experimental techniques in order to elucidate protein structures (16,17).As experimental data become more readily available, software packages can be employed to combine sparse data with advanced structure sampling and scoring techniques.A number of computational tools currently exist for protein structure modeling, including the Rosetta software suite (17,18), I-TASSER (19), Phyre2 (20), Integrative Modeling Platform (IMP) (21), HADDOCK (22), and MODELLER (23).Sparse experimental data can be implemented during the computational modeling process or used as a filter during post-model generation analysis.Here, we highlight work that combines computational efforts for protein structure examination with sparse experimental data from MS.We discuss work that incorporates XL-MS, HDX-MS, CL-MS, limited proteolysis, IM-MS, and SID-MS experimental data into computational modeling.( f ) Appearance energies (AEs) can be deduced from surface-induced dissociation, in which a protein complex collides with a surface (vertical black bar) and breaks apart, providing insight into the stoichiometry and connectivity of the complex.Data from these techniques are then incorporated into computational modeling techniques such as protein-protein docking to examine complexes, structure prediction via ab initio or homology modeling, and molecular dynamics based on experimental restraints.

CHEMICAL CROSS-LINKING
XL uses reagents to chemically link two amino acids, particularly the side chain atoms within lysine residues, to assess proximity within a protein or within protein complexes (9).After digestion and separation via liquid chromatography, cross-links can be identified via tandem MS.XL-MS experiments provide insight into protein structure.Residues that are distant from one another in amino acid sequence can be identified as being within spatial proximity.Interactions between protein complex subunits can also be inferred by residues that are identified as cross-linked.Only residues that are solvent-exposed should be modified by a cross-linking reagent.As such, the crosslinking agent can give insight into proximity between surface residues, from which contact information can be further derived with computational methods that use XL-MS data.XL-MS efforts  have been incorporated into the Critical Assessment of Structure Prediction (CASP) challenges to integrate high-density XL-MS data into prediction methods (24).Kahraman and coworkers (25) developed methodologies for applying cross-linking data to homology modeling, de novo modeling, and protein-protein docking.A database called XLdb was also assembled that contained XL-MS data for individual proteins and protein complexes with the corresponding Protein Data Bank entries, providing a source of accessible data for the MS and computational communities.Building on an earlier publication in which Xwalk, a program that determines the shortest distance between cross-linked amino acids within solvent-accessible regions (26), was established, distance restraints determined from XL-MS data were implemented in the Rosetta scoring function.The Rosetta functionality penalized models that conflicted with the experimental data.For instance, models with residues participating in a cross-link that were spatially farther apart than the spacer length of the cross-linker received a penalty.The distance restraints were also applied as filters to examine existing models.Overall, the use of the crosslinking distance restraints was found to improve both the root-mean-square deviation (RMSD) of the top-scoring models and protein-protein docking (Figure 2).Similar methodology was applied by Lössl and colleagues (27) to work in which cross-linking data were used to determine differences in conformational ensembles and interaction modes of singular and interacting proteins.Additionally, recent work by Piotrowski and colleagues (28) used XL distance restraints with Rosetta to build models of calmodulin interacting with bMunc13-2 and subsequently to identify a unique binding mode.
XL-MS data were used in the protein structure investigation of human serum albumin protein domains by Belsom and colleagues (29).Instead of traditional XL reagents, this work employed a photo-XL agent that led to increased XL data to probe the protein both in isolation and within blood samples.Upon modeling with XL-MS data as restraints and residue contacts predicted with a newly developed software, serum albumin protein models were successfully identified with low RMSD values (3-6 Å) for both the purified and sample models.A similar approach was explored in work from dos Santos et al. (30) in which XL-MS data along with coevolutionary information were applied to protein structure prediction.In the work, simplified models containing only alpha carbons were used with restraints from XL-MS and coevolutionary data via direct coupling analysis to elucidate tertiary structure.Models were evaluated by clustering and template modeling score for multiple proteins.Quality models were identified from the method, validating the effectiveness of the proposed methodology.
Hauri et al. (31) used computationally determined models for a very large (1.8-MDa) protein complex found in human plasma to examine specific peptides from XL-MS, an effort denoted as targeted XL-MS.Targeted XL-MS used different MS acquisition techniques to discriminate between computational models of the protein complex modeled by Rosetta's homology modeling protocol.Proteins from the complex were docked together to produce a collection of potential models that represented the quaternary structure of the complex.Models of the protein-protein complex that scored well with the cross-linking data were used to identify a short list of potentially cross-linked lysine pairs.Models then underwent a flexible backbone docking workflow with cross-linking data as distance restraints.Overall, the development of targeted XL-MS paved the way for continued improvement of the quaternary structure prediction of highly complex systems.Recent work by Khakzad (32) and others sought to elucidate another large protein complex, the membrane attack complex.A streamlined protocol for targeted XL-MS was pursued to examine the bacterial protein complex in human plasma.The cross-linking results were used to obtain a complete model of the complex that was corroborated with existing models from crystallography and cryo-EM.This work further demonstrated the applicability of XL-MS, particularly to complex targets from bacterial systems relevant to human disease.
XL force field (XLFF), a force field that relied upon XL-MS restraints, was applied to Rosetta's ab initio protocol by Ferrari and colleagues (33).This was accomplished by determining the probability of identifying residues that could potentially cross-link within a nonredundant set of proteins from the Protein Data Bank.The resulting probability curve was then used to determine a potential energy function reliant on the cross-linker length and the residues involved in linkage.Usage of the XLFF resulted in higher quality, more native-like models occurring within the top-scoring model distributions.
In addition to the inclusion of cross-linking data within Rosetta, software has been developed outside the Rosetta suite.Degiacomi and coworkers (34) implemented a software tool called DynamXL to consider the implications of protein dynamics when modeling cross-linking data.In contrast to other methods that rely upon the beta carbon for distance measurements, the DynamXL algorithm employed the side chain nitrogen atom of lysine for distance calculations, which was suggested as being more experimentally accurate and less computationally expensive.Additionally, the method took the flexibility of residue side chains into account by examining different rotamers and backbone conformations.The work sought to minimize the elimination of reasonable cross-links while simultaneously excluding impossible cross-links, which led to less error when classifying cross-linkages.Overall, the application of this methodology led to improved RMSD values from protein-protein docking, highlighting the accuracy of the implementation.
Recent work by Mintseris & Gygi (35) explored high-density XL-MS efforts in combination with IMP and Rosetta.The methodology was used to model carbonic anhydrase proteins and the yeast proteasome.To minimize computational cost, the implemented software reduced sampling of decoy and target peptides to minimize false discovery rates and simplify false discovery rate calculations.Alternative reagents that established cross-links with additional residue types promoted the cross-linking density, thus providing better results.XL-MS data were applied to the modeling of inhibitor-bound carbonic anhydrase via restraints applied during protein-protein docking with Rosetta.High-quality models were identified.Additionally, the work tackled the modeling of the yeast proteasome with both Rosetta and IMP based on the XL-MS data.Coarse-grained models of the complex were elucidated, and regions were verified by existing cryo-EM models.

HYDROGEN-DEUTERIUM EXCHANGE
HDX is a prevalent nonspecific covalent labeling technique in which a protein is exposed to a deuterium-rich solvent (10).Amide hydrogen atoms are able to exchange with deuterium atoms to label the protein backbone.After digestion and separation with liquid chromatography, MS can be used to identify regions of exchange.HDX-MS has also been used with other techniques such as electron capture dissociation to assess hydrogen-bonding configurations (36).Regions of the protein are more likely to be modified by HDX if the amide hydrogens are solvent accessible and not actively participating in a hydrogen bond.HDX data are often resolved to the fragment level, but occasionally residue-specific modifications are reported.From there, data can be expressed as percentage modification, rate constants, or protection factors (PF), all of which are routinely used as inputs into computational modeling to guide results based on agreement with HDX data.
HDX-MS data have been used with homology modeling, as seen in work from Zhang and coworkers (37).Homology modeling with MODELLER, Phyre2, and I-TASSER was used to model the tertiary structure of cytochrome c.HDX-MS results were taken into account when examining the models.Additionally, the relationship between HDX modification and solventaccessible surface area (SASA) was examined to identify the best models.The modeling efforts with Phyre2 were determined to demonstrate best agreement with the HDX-MS results, and the SASA values from this model led to better correlations with the percent modification identified from HDX experiments.The results of this work effectively demonstrated that both HDX data and solvent exposure could be used to identify better homology models and to improve on our previous understanding of the cytochrome c mechanism.While HDX-MS data have not been applied to ab initio modeling, HDX-NMR data have been recently implemented into protein structure prediction (38).
HDX-MS data, in combination with molecular dynamics (MD) simulations, were employed to examine empirical and fractional population models for G-protein-signaling regulator proteins in work from Mohammadiarani et al. (39).Using long-timescale MD simulations with AMBER and CHARMM force fields, PFs were calculated from simulation frames and then compared to experimentally determined percent modification data.Fractional population models were determined to be more accurate and less prone to error than empirical models, arguing that the SASA of amide hydrogens coupled with the distance between the amide hydrogen and first polar atom could be used for accurate predictions.This work also indicated that amide hydrogen atoms could fluctuate in exposure over a sub-100-ps timescale.HDX-MS and MD simulations were also applied to examine interactions between lipids and membrane proteins, such as lipid-induced conformational changes in proteins, in work from Martens and coworkers (40).The framework developed in the study emphasized a multistep protocol.After using HDX-MS to evaluate the protein in both the presence and absence of lipids, interactions were interpreted via MD simulations in various bilayer conditions.The interactions identified from the simulation were then corroborated by experimental mutagenesis of relevant sites.The methodology presented in this work was suggested as a basis for further study of various lipid-protein interactions in membranes.Beyond this work, size-exclusion chromatography, in combination with HDX-MS and circular dichroism, was used with computational techniques such as homology modeling and MD simulations to examine the activity of transaminases in work from Makarov and others (41).This study demonstrated that the protocol could be applied to enzyme-directed evolution efforts.
Recently, Zhang and colleagues (42) used both XL-MS and HDX-MS data to evaluate proteinprotein docking models of interleukin 7 (IL-7) and its alpha receptor (Figure 3).HDX-MS analysis was performed on IL-7 both free and bound with its receptor to elucidate changes in exposure.XL-MS was also applied to the system to identify residues involved in the receptor-binding interface of IL-7.Protein-protein docking with RosettaDock produced models of the complex, and top-scoring models were subsequently clustered.Clustering data were analyzed for different numbers of cross-links and subsequently validated by HDX data.When examining the cross-linking data, it was deduced that some cross-links that suggested an interface at a particular region were undermined by the HDX data that implied protection at the same region, suggesting that a twopronged approach was necessary to verify findings.Solvent exposure was additionally examined using SASA for identified models to determine if the models corroborated with regions of protection and exposure identified by HDX.Overall, this methodology elegantly emphasized the importance of more than one structural MS technique being applied to quaternary structure prediction.
HDX-MS data have also been applied to antibody-antigen modeling.Huang et al. ( 43) used HDX-MS data along with electron-transfer dissociation to examine binding of the mAb1 antibody with a cytokine with implications in autoimmune disease.SASA calculations and protein-protein docking provided additional insight into the antibody-antigen binding interface.The study emphasized the importance of HDX-MS data and complementary computational efforts for epitope elucidation.Additionally, recent efforts from Jeliazkov et al. (44) were applied to the improvement of Rosetta software for antigen-antibody modeling, RosettaAntibody and SnugDock.The SnugDock feature relies on flexible docking to elucidate the complementarity determining region (CDR) loop, which is indicated in antigen binding and unique among antibody structures, and to configure an adjustment of the heavy and light fragments relevant to antigen-antibody interactions.Restraints from HDX-MS data were used to score antigen-antibody complexes based on agreement with the data.When testing the HDX-MS restraints on an antibody-antigen complex with available labeling data, the HDX-MS restraint-based methodology led to more native-like structure of the CDR loop.

HYDROXYL RADICAL PROTEIN FOOTPRINTING
Hydroxyl radical protein footprinting (HRPF) is a nonspecific CL-MS technique in which hydroxyl radicals can covalently modify 19 of the 20 amino acid types in proteins (11).Synthesized via photolysis or radiolysis of water or hydrogen peroxide, hydroxyl radicals modify residues with varying degrees of reliability and reactivity, as indicated by a broad range of relative intrinsic reactivities (12).Rate constants for labeled peptide fragments and individual residues can be determined and used to calculate PF, the relative intrinsic reactivity divided by the labeling rate constant for the particular residue.Because HRPF is more likely to occur in regions that are solvent exposed, residues that are more protected (have a higher PF) are correlated with lower solvent exposure and vice versa.
Xie and colleagues (45) recently examined the relationship between residue protection and solvent exposure using MD simulations.The work emphasized that normalization of HRPF data should be sequence dependent and not based on standard values determined from free amino acids.With labeling data for myoglobin and lysozyme, a method was proposed in which accurate side chain SASA values are derived from HRPF data by normalizing labeling data based on sequence context.This was validated by improvements in correlation between labeling data and SASA.When examining the relationship between normalized PF and relative SASA, the correlation was determined to worsen as the relative intrinsic reactivity of the amino acids considered decreased, suggesting that only residues with higher intrinsic reactivity should be used in structural analysis based on PF.When the rate constant of a particular residue in the folded protein was normalized with the rate constant of the same residue in the denatured protein, the correlation improved for all non-sulfur-containing residues (Figure 4).A prediction equation that established a relationship between relative SASA and the normalized rate constant was determined such that relative SASA could be calculated from HRPF data.When the prediction equation was tested with homology models of lysozyme, models with backbone RMSD less than 3 Å could be differentiated from models with backbone RMSD greater than 4 Å.
Our group has used HRPF labeling data for protein structure prediction.We used the relationship between the natural logarithm of PF (lnPF) and a residue exposure metric, spherical neighbor count, for 15 relaxed crystal structures of calmodulin as a prediction equation.The equation was then implemented in the first available software to use HRPF data for protein structure prediction (46).When tested on ab initio models for four benchmark proteins, the addition of our score term within the Rosetta framework led to improvement in the best-scoring model RMSD and funnellike quality of the score versus RMSD distributions.Results were further validated through the use of a confidence metric that assessed the funnel-like quality of the score versus RMSD distribution when RMSD was calculated to the best-scoring model.Follow-up work explored the incorporation of labeling data into the ab initio folding algorithm, as opposed to using labeling data for model rescoring (47).
More recently, we sought to improve the correlation between the lnPF and the neighbor counts of HRPF-labeled residues, as we hypothesized that accounting for side chain flexibility would improve the relationship (48).We used a conical neighbor count for a subset of residue types selected based on intermediate to high intrinsic reactivity and simulated side chain flexibility with MD simulations and a Rosetta mover ensemble for four benchmark proteins.Upon determining that the normalized root-mean-square error of lnPF versus conical neighbor count was comparable between MD and the mover ensemble, we developed a new Rosetta score term.We scored 20,000 ab initio models with our term, then calculated a total score by combining the HRPF score with the Rosetta score.The top 20 scoring models were used as inputs for mover model generation, then scored with both Rosetta and HRPF data.Upon including mover models in our distributions, we found that the best-scoring model RMSD was identified at accurate atomic detail for three of the four proteins, indicating that HRPF with a Rosetta mover ensemble can be used to significantly improve model quality.

OTHER COVALENT LABELING METHODS AND LIMITED PROTEOLYSIS
Besides the popular HDX and HRPF techniques, other covalent labels have also been used to elucidate protein structure.Carbene, another nonspecific covalent labeling reagent, has been used for structural MS.Carbene footprinting was applied by Manzi and coworkers (49) to examine the binding sites of lysozyme and a large protease.Additional work by Manzi et al. (50) demonstrated that carbene footprinting could be applied to more complex cases by elucidating the interfaces of a trimer membrane protein.Radical trifluoromethylation, in which 18 amino acids can be modified, has also been used for covalent labeling structural MS.Myoglobin, beta-lactoglobulin, and membrane protein vitamin K epoxide reductase were explored by radical trifluoromethylation in novel efforts by Cheng and coworkers (51).This work paved the way for an additional study in which trifluoromethyl radicals were produced via synchrotron radiolysis (52).Radical trifluoromethylation is a particularly promising technique for future structure prediction efforts.
In addition to nonspecific covalent labeling reagents, other covalent labeling reagents that modify only specific residues have been used to probe protein structure.Diethylpyrocarbonate (DEPC) is a readily available labeling reagent that modifies cysteine, lysine, histidine, serine, threonine, and tyrosine residues along with the N-terminus.The residue microenvironment has been recently shown to play a role in labeling weakly nucleophilic serine, threonine, and tyrosine (STY) residues, as labeled STY residues with lower solvent exposure were found to be in the vicinity of hydrophobic residues (53).Based on this study, we developed a score term within Rosetta to reward models that demonstrated agreement with DEPC labeling data (54).Labeled STY residues with from 5% to 35% relative SASA were rewarded for having more hydrophobic neighbors, while unlabeled STY residues with the same solvent exposure were rewarded for having less hydrophobic neighbors.Additionally, our term rewarded labeled histidine and lysine residues with higher solvent exposure, as residues that are more exposed are more likely to be covalently labeled.The DEPC score was added to the Rosetta score, and models were ranked by total score.We tested our term with ab initio and homology models for six benchmark proteins and found that the bestscoring model RMSD and funnel-like quality of the score versus RMSD distributions improved with use of our term.
Similar to covalent labeling, limited proteolysis is a technique in which a protein is exposed to a low concentration of protease that cleaves solvent-accessible regions of the protein (13,55).Hennig and coworkers (56) developed a pipeline between MDMDAT, software that analyzes MS data, and HADDOCK, a protein-protein docking algorithm.Limited proteolysis data were first analyzed by MDMDAT and then used by HADDOCK to dock the protein Rpn13 with ubiquitin.This work demonstrated that limited proteolysis data could be applied to a protocol for protein complex modeling that was easier and quicker than structure determination methods such as NMR.Limited proteolysis was also applied to examine protein complexes in work by Proctor and colleagues (57).Limited proteolysis elucidated by MS guided the modeling of the Cu/Zn superoxide dismutase (SOD1) trimer protein complex.Software was developed to translate locations of proteolysis into restraints that were applied to discrete MD simulations.Such restraints emphasized the importance of regions affected by proteolysis being solvent exposed.After coarse-grained and full-atom MD simulations to isolate the lowest energy model, computational mutagenesis was applied to examine interface residues of importance to SOD1 trimer generation.

ION MOBILITY
IM is a structural native MS technique in which proteins are subjected to soft ionization in the gas phase and then exposed to a nitrogen or helium gas chamber in which an electric field is applied.Instead of residue-or fragment-resolved data, as for the previously described techniques, IM-MS provides insight into the shape of the protein.Commonly calculated from IM-MS data is the collision cross section (CCS), which is the rotationally averaged two-dimensional projection area of the protein.Computational methods currently exist to predict CCS from protein structure, including the trajectory method (58,59), projection superposition approximation (60), and projection approximation (61).
In elegant work by Bleiholder & Liu (62), MD simulations were employed to model ubiquitin at various charge states for ion spectra prediction.The structure relaxation approximation (SRA) method was introduced to examine the similarity of ubiquitin ions to the native protein.SRA operated with input MD simulation frames by removing solvent, adjusting the charge state via charged residues with high exposure, relaxing the structure with a short simulation of the gasphase protein, calculating average cross sections with the projection superposition approximation, and then determining the IM spectrum based on Gaussian distributions of the averaged cross sections.The method was validated by the agreement of residue interactions between the crystal structure and modeled states, demonstrating that ubiquitin remained native-like during the procedure.
Hall and colleagues (63) examined a modeling method in which coarse-grained models of protein complexes were evaluated with a scoring function based on their agreement with CCS data.Complexes from the Protein Data Bank were used to validate the use of coarse-grained models, and the CCS values of the coarse-grained models were demonstrated to be similar to those calculated using all-atom models.The coarse-grained model relied on spheres to represent individual proteins, while a complex was represented by multiple spheres.For the scoring function, volume and CCS restraints were implemented based on the findings from a benchmark set.This method was then applied to influenza B virus neuraminidase, for which models were scored based on volume and CCS restraints and then clustered by similarity to other models.The most native-like model was identified within the largest cluster.The method was further applied to tryptophan synthase and nitrobenzene dioxygenase complexes.The case study of nitrobenzene dioxygenase successfully identified high quality models, while the tryptophan synthase uncovered the relevance for symmetry data, which was identified by other experiments.This work confirmed that IM-MS data were able to play a valuable role in protein complex structure investigation.
Eschweiler and coworkers (64) used IM-MS data and computational modeling to elucidate a structural model of the urease activation complex.CCS values were determined for the subcomplexes of interest and used to guide coarse-grained model generation with IMP, representing subunits within the complex as individual spheres.A Monte Carlo algorithm was applied to sample conformational space with the aid of restraints from both CCS data and previous experimental data that established connectivity between particular subunits.IMPACT was applied to determine CCS values for complex models, followed by a clustering and comparison to existing complex structures.This study effectively modeled a very large complex using numerous restraints from experimental and calculated CCS, XL-MS, and small-angle scattering X-ray data.A similar methodology was applied in recent work by Wang and colleagues (65).In order to model apolipoprotein E oligomers relevant to Alzheimer's disease, IM-MS data were used to identify coarse-grained models using IMP.Additionally, collision-induced unfolding was used to examine the monomer and tetramer of apolipoprotein E. This work deviated from the use of spheres for each individual subunit within the complex.Instead, the monomer was modeled with two domains, or two spheres, within the coarse-grained model, which corroborated the CCS data.A Monte-Carlo algorithm was applied to identify models, which were subsequently clustered by similarity to determine a likely complex structure.Intriguingly, electron-capture dissociation was also implemented to validate models based on identification of flexible portions of the complex, demonstrating the capability of IM-MS and IMP modeling coupled with additional experimental techniques.
Finally, our group (66) has developed Rosetta functionality to use IM-MS data in protein tertiary structure prediction.An algorithm called Projection Approximation using Rough Circular Shapes (PARCS) was implemented to calculate CCS values from protein structure.PARCS was shown to perform as accurately and efficiently as the popular IMPACT method.A score term reliant upon IM-MS data was also incorporated into the Rosetta framework based on the PARCS predictions.The score term penalized models with differences in observed and predicted CCS.It was first tested on models for a benchmark set of proteins with PARCS-computed CCS values in which the RMSD of best-scoring models was improved for 82 of the 100 proteins examined (Figure 5).The funnel-like quality of the score versus RMSD distributions for model sets also tended to improve upon scoring with IM-MS data.Additionally, the score term was examined with ab initio and homology models for 23 proteins for which experimental IM-MS data were available, with the RMSD improving or exhibiting no change for all 23 instances.This work further solidified the capability of IM-MS methods to elucidate protein structure.

SURFACE-INDUCED DISSOCIATION
Recently emerging as a structural native MS technique, surface-induced dissociation (SID) relies on the breakage of interfaces within a protein complex when the complex strikes a surface.During SID-MS, protein complexes undergo soft ionization and are then collided with a surface, which can provide insight into the stoichiometry and interfaces within a protein complex.The dissociation observed in SID experiments can be correlated with identified assembly pathways (67)(68)(69).
We have demonstrated that predicting SID appearance energy (AE) from protein structure is possible (70).AE, specified as 10% fragmentation, was predicted from quantities such as the number of residues at the interface; number of unsatisfied hydrogen bonds; and rigidity factor, which was determined by intermolecular interactions such as hydrogen bonds, salt bridges, and disulfide bonds.A weighted sum of these terms was used in a prediction equation such that a strong correlation was observed between predicted and experimental AE.The development of this model suggested that the methodology could be applied to structure prediction applications.
Our group (71) then developed a computational algorithm to use SID-MS data for protein complex structure prediction.The number of residues at the interface, rigidity factor, and buried hydrophobic surface area were combined to better predict AE.The new model that combined these three terms was then used in the creation of a Rosetta scoring term that combined SID data with RosettaDock scoring.It was first tested on 57 protein systems using crystal structures to calculate the experimental AE, with 54 out of 57 cases demonstrating improvement or no change in best-scoring model RMSD.When using experimentally determined AE from SID-MS, six of the nine complexes examined demonstrated near-native structures within the top three scoring models (Figure 6).Additionally, a confidence metric was established in this work, using the average score per residue for the best 1,000 models to independently verify the accuracy of scoring.The confidence metric allowed identification of successful predictions, as proteins with more-negative scores per residue tended to have improved RMSD values compared to complexes with a higher score per residue.Overall, this work demonstrated that SID data with RosettaDock can be used to improve protein complex structure prediction effectively.In recent follow-up work, using SID-MS data with cryo-EM data was shown to result in improved flexible docking results for protein complexes and required less prior knowledge of structures (72).

FUTURE DIRECTIONS OF THE FIELD
While advances in MS and computational technologies have propelled the field forward in recent years, obstacles still exist and will require provocative solutions to overcome.
As MS data are too sparse to determine protein structure unambiguously, computational techniques will remain relevant to the interpretation of MS data for structure elucidation.One way in which the community can support computational method development is through the establishment of central data repositories.Such databases currently exist for other experimental techniques (73)(74)(75).Kahraman and coworkers (25) have started to pave the way for this effort by establishing a cross-linking database.Hopefully, other MS databases will follow in the near future.Publicly available data sets can lead to the creation and development of freely accessible, competitive algorithms that can harness sparse experimental data, such as the MS data outlined here, to improve structure prediction with machine learning and artificial intelligence methodologies.
Because MS data are sparse, even advanced computational methodologies will inevitably predict false positive structures.Going forward, integrative structural modeling that combines multiple sets of experimental data will be instrumental in reducing the rate at which false positives occur.Further exploration of protein complexes remains a key endeavor for the future of protein  Recently, the performance of AlphaFold at CASP14 has raised questions about the role of experimental techniques in protein structure determination (77).AlphaFold relies on artificial intelligence to accomplish protein structure prediction from amino acid sequences (78).Its impressive global distance test median score of 92.4 (79) redefined the field's expectations of how precise modeling algorithms could be.This inevitably caused speculation about the ability to determine protein structure purely computationally.We believe that this is unlikely to happen in the near future.We anticipate that computational researchers will continue to establish techniques that mimic AlphaFold.Callaway indicated in his Nature synopsis of CASP14 (77) that purely computational structure determination is unlikely, but rather that sparse experimental data will soon be sufficient for unambiguous structure elucidation in combination with the new wave of artificial intelligence technologies.As such, we anticipate that MS data will play a continued, if not growing, role alongside tools like AlphaFold.
An additional future avenue of protein structure prediction from MS data is citizen science.FoldIt is one such tool that enlists video game enthusiasts for structure prediction (80).With its colorful graphical user interface and endearing symbols for relevant scientific concepts such as steric hindrance and solvent exposure of hydrophobic regions, FoldIt uses the Rosetta software suite to reward user-sampled conformations of proteins.Users can advance through multiple levels of the game while supporting scientific efforts by sampling protein conformations that may be inaccessible to automated protein sampling algorithms.Overall, games such as FoldIt inspire a new generation of scientists while tackling the sampling problem and examining novel protein conformations.
In summary, the future of MS techniques with complementary computational methods appears promising.The combination of MS and computational protocols will, in our opinion, lead to the elucidation of many challenging protein structures.

CONCLUSION
The field of structural mass spectrometry has significantly benefited from the development of hybrid computational techniques for MS-guided protein structure prediction.Algorithms that use XL-MS, HDX-MS, HRPF-MS, limited proteolysis, IM-MS, and SID-MS data for tertiary and quaternary structure prediction, described here, successfully allow structure elucidation from sparse MS data.The field will continue to thrive with efforts to maintain accessible data sets and software packages, to combine multiple techniques for the purpose of protein complex elucidation, and to pursue out-of-the-box methods such as FoldIt that recruit the general public into structure prediction efforts.While it is encouraging to see how far the field has progressed recently, it remains even more exciting to envision where the field will go with continued advances in techniques and technology.

Figure 1
Figure 1 Mass spectrometry-based methods (blue box, top) and computational modeling (green box, bottom) explored in this review.(a) Chemical cross-linking involves the modification of residues, commonly lysine, to provide information on spatial proximity.(b) Hydrogendeuterium exchange examines the exchange rate of amide hydrogens with deuterium solvent to give insight into solvent exposure and residue flexibility.(c) Covalent labeling is reliant on the irreversible covalent modification of residues, illuminating solvent exposure and topology.(d) Limited proteolysis uses a protease enzyme to cleave proteins into fragments based on solvent exposure.(e) Ion mobility is used to investigate the shape and size of proteins based on the collision cross-sectional area.(f ) Appearance energies (AEs) can be deduced from surface-induced dissociation, in which a protein complex collides with a surface (vertical black bar) and breaks apart, providing insight into the stoichiometry and connectivity of the complex.Data from these techniques are then incorporated into computational modeling techniques such as protein-protein docking to examine complexes, structure prediction via ab initio or homology modeling, and molecular dynamics based on experimental restraints.

aFigure 2
Figure 2Improvement of model prediction and scoring with chemical cross-linking (XL) mass spectrometry data.(a) Best-scoring models of IgBP1 (green) complexed with PP2AA (purple), with the opaque cartoon depicting the best-scoring model from the largest cluster and the more transparent cartoons depicting the best-scoring models from the second to the fourth largest clusters.Cross-links are depicted as green, red, and blue spheres, with black spheres representing mutations.(b) Rosetta score versus root-mean-square deviation (RMSD) to the largest cluster plot for models with a minimum of six interprotein XLs (gray), a minimum of six interprotein XLs with binding interfaces larger than 900 Å2 (blue), and representative models from the four biggest clusters (red).Figure adapted from Reference 25 (CC BY 3.0).

Figure 4
Figure 4 Comparison of prediction equations using SASA and hydroxyl radical protein footprinting (HRPF) data.(a) Prediction equation between relative SASA (<SASA> N /<SASA> GXG , SASA normalized by SASA values standardized for each residue, X, in a glycine tripeptide) and normalized protection factor (slope N /relative intrinsic reactivity) using myoglobin data for residue types tryptophan, tyrosine, phenylalanine, histidine, leucine, and isoleucine.(b) Lysozyme <SASA> calculated using the prediction equation derived from panel a versus SASA observed in molecular dynamics (MD) simulations.(c) Prediction equation between relative SASA of the native (<SASA> N /<SASA> GXG ) and rate constant ratio (slope N /slope D ) for all non-sulfur-containing myoglobin residues, where N denotes native and D denotes denatured.(d) Lysozyme SASA calculated using the prediction equation shown in panel c versus SASA observed in MD simulations.Figure adapted from Reference 45 (CC BY 4.0).

Figure 5
Figure 5 Incorporation of ion mobility-mass spectrometry (IM-MS) data into Rosetta improved the root-meansquare deviation (RMSD) of the best-scoring models.(a) Depiction of a protein and its projection on a plane upon space-filling measures by the Projection Approximation using Rough Circular Shapes (PARCS) application.(b) Structural alignments of the crystal structure (gray) with the best-scoring model when scoring without (burgundy) and with (yellow) IM-MS data.(c) Comparison of best-scoring model RMSDs when scoring with and without IM-MS data.The black line indicates no change in RMSD with and without experimental data.Helium buffer gas conditions are depicted by teal dots, while nitrogen buffer gas conditions are shown by gold dots.Figure adapted with permission from SM Bargeen Alam Turzo.

Figure 6
Figure 6 Use of surface-induced dissociation mass spectrometry (SID-MS) data improved the root-mean-square deviation (RMSD) of best-scoring models.Alignment of the crystal structures (green) with one of the top three best-scoring models when (left) scoring without SID-MS data (purple) and when (right) including SID-MS data in scoring (pink) for three protein complexes (PDB IDs 1GNH, 1GZX, and 1SAC).Figure adapted with permission from Justin Seffernick.
Figure adapted with permission from Justin Seffernick.structure modeling.Protein complexes have been implicated to have roles in many biological processes, and structural changes to complexes can lead to human disease(76).Elucidation of protein complex structure can provide insight into the mechanisms of such complexes.Structural information can complement efforts to target protein complexes with drugs to alleviate implications in disease.The study of protein complexes benefits greatly from integrative experimental techniques to combat modeling ambiguities.This has been nicely demonstrated in work by Zhang and colleagues (42) that applied both HDX and XL data to quaternary structure investigation.The field should continue to emphasize the combination of multiple techniques to elucidate structural features of protein complexes.