合肥大学人工智能与大数据学院,安徽合肥 230031
[ "徐敏 男,1991年7月出生于安徽省芜湖市.现为合肥大学人工智能与大数据学院硕士研究生.主要研究方向为人工智能、生物信息学.E-mail: xumin@stu.hfuu.edu.cn" ]
[ "胡春玲 女,1970年1月出生于安徽省铜陵市.现为合肥大学人工智能与大数据学院教授、硕士生导师.主要研究方向为人工智能、生物信息学.E-mail: huchunling@hfuu.edu.cn" ]
[ "胡婷 女,2000年6月出生于安徽省安庆市.现为合肥大学人工智能与大数据学院硕士研究生.主要研究方向为药物与靶标亲和力预测.E-mail: 24085403019@stu.hfuu.edu.cn" ]
[ "张芳芳 女,2002年1月出生于山东省菏泽市.现为合肥大学人工智能与大数据学院硕士研究生.主要研究方向为人工智能、蛋白质与RNA相互作用.E-mail: 24085404032@stu.hfuu.edu.cn" ]
[ "代相龙 男,1999年11月出生于安徽省亳州市.现为合肥大学人工智能与大数据学院硕士研究生.主要研究方向为人工智能、蛋白质与RNA相互作用.E-mail: daixianglong@stu.hfuu.edu.cn" ]
收稿:2025-09-27,
录用:2025-11-10,
纸质出版:2025-11-25
移动端阅览
徐敏, 胡春玲, 胡婷, 等. 基于序列与跨模态对齐的蛋白质功能预测模型[J]. 电子学报, 2025, 53(11): 4022-4034.
XU Min, HU Chun-ling, HU Ting, et al. Sequence-Based and Cross-Modal Alignment Model for Protein Function Prediction[J]. Acta Electronica Sinica, 2025, 53(11): 4022-4034.
徐敏, 胡春玲, 胡婷, 等. 基于序列与跨模态对齐的蛋白质功能预测模型[J]. 电子学报, 2025, 53(11): 4022-4034. DOI:10.12263/DZXB.20250851
XU Min, HU Chun-ling, HU Ting, et al. Sequence-Based and Cross-Modal Alignment Model for Protein Function Prediction[J]. Acta Electronica Sinica, 2025, 53(11): 4022-4034. DOI:10.12263/DZXB.20250851
蛋白质功能预测是生物信息学核心任务之一.现有方法虽能实现蛋白质多模态特征的融合,但仍存在预测准确率不足、依赖有限的实验数据导致适用范围受限等问题.为解决此类问题,本研究提出基于序列与跨模态对齐的蛋白质功能预测模型(Sequence-based and Cross-Modal Alignment Model for Protein Function Prediction,SCMAGO),以蛋白质序列作为唯一输入,通过主流工具AlphaFold2、InterProScan分别预测三级结构和家族结构域信息;使用蛋白质大语言模型(Evolutionary Scale Model Cambrian,ESMC)实现序列嵌入,并采用几何向量感知机图神经网络(Geometric Vector Perceptron Graph Neural Network,GVP-GNN)提取三级结构特征,再通过广播嵌入方法获取家族结构域表示;模型SCMAGO设计两步跨模态对齐方法:基于双向交叉注意力,在残基层面对序列和结构特征进行对齐;结合图注意力池化方法,进一步融合家族结构域特征.实验结果表明,SCMAGO在Swiss-Prot数据集上的性能优于现有的基准方法,在生物过程(Biological Process,BP)、分子功能(Molecular Function,MF)和细胞组分(Cellular Component,CC)三方面的
F
max
分别为0.487、0.739和0.736,AUPR则分别达到0.507、0.760、0.800.此外,对序列一致性低于40%的蛋白质,仍能保持稳定的预测性能.
Protein function prediction is one of the core tasks in bioinformatics. Although existing methods can fuse multimodal features of proteins
they still suffer from issues such as insufficient prediction accuracy and limited application scope due to reliance on limited experimental data. To addr
ess these problems
this study proposes a sequence- and cross-modal alignment-based protein function prediction model (SCMAGO)
which takes protein sequences as the sole input. Specifically
it predicts tertiary structure and family domain information using the mainstream tools AlphaFold2 and InterProScan
respectively. It employs the protein large language model (Evolutionary Scale Model Cambrian
ESMC) to achieve sequence embedding
uses the geometric vector perceptron graph neural network (GVP-GNN) to extract tertiary structure features
and further obtains family domain representations through the broadcast embedding method. The SCMAGO model is designed with a two-step cross-modal alignment approach: first
it aligns sequence and structure features at the residue level based on bidirectional cross-attention; second
it further fuses family domain features by combining the graph attention pooling method. Experimental results show that SCMAGO outperforms existing benchmark methods on the Swiss-Prot dataset. Its
F
max
values for biological process (BP)
molecular function (MF)
and cellular component (CC) are 0.487
0.739 and 0.736
respectively
while the corresponding AUPR values reach 0.507
0.760 and 0.800. Furthermore
SCMAGO still maintains stable prediction performance for proteins with sequence identity below 40%.
UniProt: The universal protein knowledgebase in 2025 [J ] . Nucleic Acids Research , 2025 , 53 ( D1 ): D609 - D617 .
JUMPER J , EVANS R , PRITZEL A , et al . Highly accurate protein structure prediction with AlphaFold [J ] . Nature , 2021 , 596 ( 7873 ): 583 - 589 .
MULDER N , APWEILER R . InterPro and interproscan: Tools for protein sequence classification and comparison [M ] // Comparative Genomics . Totowa : Humana Press , 2007 : 59 - 70 .
TEAM ESM . ESM Cambrian: Revealing the mysteries of proteins with unsupervised learning [EB/OL ] . ( 2024-12-04 )[ 2025-10-10 ] . https://www.evolutionaryscale.ai/blog/esm-cambrian https://www.evolutionaryscale.ai/blog/esm-cambrian .
ALTSCHUL S F , GISH W , MILLER W , et al . Basic local alignment search tool [J ] . Journal of Molecular Biology , 1990 , 215 ( 3 ): 403 - 410 .
BHAGWAT M , ARAVIND L . Psi-Blast tutorial [M ] // Comparative Genomics . Totowa : Humana Press , 2007 : 177 - 186 .
KULMANOV M , KHAN M A , HOEHNDORF R , et al . DeepGO: Predicting protein functions from sequence and interactions using a deep ontology-aware classifier [J ] . Bioinformatics , 2018 , 34 ( 4 ): 660 - 668 .
CAO R Z , FREITAS C , CHAN L , et al . ProLanGO: Protein function prediction using neural machine translation based on a recurrent neural network [J ] . Molecules , 2017 , 22 ( 10 ): 1732 .
SUREYYA RIFAIOGLU A , DOĞAN T , JESUS MARTIN M , et al . DEEPred: Automated protein function prediction with multi-task feed-forward deep neural networks [J ] . Scientific Reports , 2019 , 9 : 7344 .
JIANG Y X , ORON T R , CLARK W T , et al . An expanded evaluation of protein function prediction methods shows an improvement in accuracy [EB/OL ] . ( 2016-01-03 )[ 2025-10-10 ] . https://arXiv.org/abs/1601.00891 https://arXiv.org/abs/1601.00891 .
ZHOU N H , JIANG Y X , BERGQUIST T R , et al . The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens [J ] . Genome Biology , 2019 , 20 ( 1 ): 244 .
KULMANOV M , HOEHNDORF R . DeepGOPlus: Improved protein function prediction from sequence [J ] . Bioinformatics , 2020 , 36 ( 2 ): 422 - 429 .
KRIZHEVSKY A , SUTSKEVER I , HINTON G E . Imagenet classifycation with deep convolutional neural networks [C ] // Proceedings of the 26th International Conference on Neural Information Processing Systems . New York : Curran Associates Inc , 2012 : 1097 - 1105 .
LIU Y W , HSU T W , CHANG C Y , et al . GODoc: High-throughput protein function prediction using novel k-nearest-neighbor and voting algorithms [J ] . BMC Bioinformatics , 2020 , 21 ( Suppl 6 ): 276 .
GLIGORIJEVIĆ V , RENFREW P D , KOSCIOLEK T , et al . Structure-based protein function prediction using graph convolutional networks [J ] . Nature Communications , 2021 , 12 : 3168 .
MISTRY J , CHUGURANSKY S , WILLIAMS L , et al . Pfam: The protein families database in 2021 [J ] . Nucleic Acids Research , 2021 , 49 ( D1 ): D412 - D419 .
SHERSTINSKY A . Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network [J ] . Physica D: Nonlinear Phenomena , 2020 , 404 : 132306 .
KIPF T N , WELLING M . Semi-supervised classification with graph convolutional networks [EB/OL ] . ( 2017-02-22 )[ 2025-10-10 ] . https://arXiv.org/abs/1609.02907 https://arXiv.org/abs/1609.02907 .
BERMAN H M , WESTBROOK J , FENG Z , et al . The protein data bank [J ] . Nucleic Acids Research , 2000 , 28 ( 1 ): 235 - 242 .
LAI B Q , XU J B . Accurate protein function prediction via graph attention networks with predicted structure information [J ] . Briefings in Bioinformatics , 2022 , 23 ( 1 ): bbab502 .
BARANWAL M , MAGNER A , SALDINGER J , et al . Struct2Graph: A graph attention network for structure based predictions of protein-protein interactions [J ] . BMC Bioinformatics , 2022 , 23 ( 1 ): 370 .
JING B W , EISMANN S , SURIANA P , et al . Learning from protein structure with geometric vector perceptrons [EB/OL ] . ( 2021-05-16 )[ 2024-10-11 ] . https://arXiv.org/abs/2009.01411 https://arXiv.org/abs/2009.01411 .
ZHENG R T , HUANG Z J , DENG L . Large-scale predicting protein functions through heterogeneous feature fusion [J ] . Briefings in Bioinformatics , 2023 , 24 ( 4 ): bbad243 .
ZHAO C G , LIU T , WANG Z . PANDA-3D: Protein function prediction based on AlphaFold models [J ] . NAR Genomics and Bioinformatics , 2024 , 6 ( 3 ): lqae094 .
RIVES A , MEIER J , SERCU T , et al . Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences [J ] . Proceedings of the National Academy of Sciences of the United States of America , 2021 , 118 ( 15 ): 1 - 12 .
LIN Z M , AKIN H , RAO R , et al . Language models of protein sequences at the scale of evolution enable accurate structure prediction [EB/OL ] . ( 2022-07-21 )[ 2025-10-15 ] . https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1 utm_campaign=Weekly%20Life%20Science%20Informatics%20News&utm_medium=email&utm_source=Revue%20newsletter https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1utm_campaign=Weekly%20Life%20Science%20Informatics%20News&utm_medium=email&utm_source=Revue%20newsletter .
ZHU Y H , ZHANG C X , YU D J , et al . Integrating unsupervised language model with triplet neural networks for protein gene ontology prediction [J ] . PLoS Computational Biology , 2022 , 18 ( 12 ): e1010793 .
YUAN Q M , XIE J J , XIE J C , et al . Fast and accurate protein function prediction from sequence through pretrained language model and homology-based label diffusion [J ] . Briefings in Bioinformatics , 2023 , 24 ( 3 ): bbad117 .
BOADU F , CHENG J L . Improving protein function prediction by learning and integrating representations of protein sequences and function labels [J ] . Bioinformatics Advances , 2024 , 4 ( 1 ): vbae120 .
WANG W K , SHUAI Y Y , ZENG M , et al . DPFunc: Accurately predicting protein function via deep learning with domain-guided structure information [J ] . Nature Communications , 2025 , 16 : 70 .
PAYSAN-LAFOSSE T , BLUM M , CHUGURANSKY S , et al . InterPro in 2022 [J ] . Nucleic Acids Research , 2023 , 51 ( D1 ): D418 - D427 .
DAUPHIN Y N , FAN A , AULI M , et al . Language modeling with gated convolutional networks [EB/OL ] . ( 2017-09-08 )[ 2025-10-10 ] . https://arXiv.org/abs/1612.08083 https://arXiv.org/abs/1612.08083 .
VASWANI A , SHAZEER N , PARMAR N , et al . Attention is all you need [EB/OL ] . ( 2023-08-02 )[ 2025-10-20 ] . https://arXiv.org/abs/1706.03762 https://arXiv.org/abs/1706.03762 .
ALEKSANDER S A , BALHOFF J , CARBON S , et al . The gene ontology knowledgebase in 2023 [J ] . Genetics , 2023 , 224 ( 1 ): iyad031 .
RIDNIK T , BEN-BARUCH E , ZAMIR N , et al . Asymmetric loss for multi-label classification [C ] // 2021 IEEE/CVF International Conference on Computer Vision . Piscataway : IEEE , 2022 : 82 - 91 .
UniProt: The universal protein knowledgebase in 2023 [J ] . Nucleic Acids Research , 2023 , 51 ( D1 ): D523 - D531 .
ASHBURNER M , BALL C A , BLAKE J A , et al . Gene Ontology: Tool for the unification of biology [J ] . Nature Genetics , 2000 , 25 ( 1 ): 25 - 29 .
VARADI M , ANYANGO S , DESHPANDE M , et al . AlphaFold protein structure database: Massively expanding the structural coverage of protein-sequence space with high-accuracy models [J ] . Nucleic Acids Research , 2022 , 50 ( D1 ): D439 - D444 .
FU L M , NIU B F , ZHU Z W , et al . CD-HIT: Accelerated for clustering the next-generation sequencing data [J ] . Bioinformatics , 2012 , 28 ( 23 ): 3150 - 3152 .
EVANS J P , AHN K , KLINMAN J P . Evidence that dioxygen and substrate activation are tightly coupled in dopamine beta-monooxygenase: Implications for the reactive oxygen species [J ] . The Journal of Biological Chemistry , 2003 , 278 ( 50 ): 49691 - 49698 .
0
浏览量
1
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构
京公网安备11010802024621