
A Deep Learning Approach for Phenotype Prediction: G2PDeep-v2
Advances in molecular profiling have made it possible to measure biological processes with remarkable precision. However, genome-wide data are often highly complex and difficult to interpret. Recent developments in artificial intelligence, particularly machine learning, have enabled the development of tools that can extract meaningful patterns from large biological datasets, supporting phenotype prediction and biomarker discovery.
Analysing multiple layers of biological information simultaneously, using different types of genomic data, provides a more comprehensive understanding of living systems. This enables more accurate predictions of genetic profiles that may be associated with higher risk of disease or greater resistance.
Research published in the Open Access journal Biomolecules presents G2PDeep-v2, a web-based deep learning framework for analysing multi-omics data across a wide range of organisms. The authors used the platform to demonstrate improved phenotype prediction and biomarker discovery in cancer survival studies and plant disease resistance.
Performance, compatibility, usability, and interpretability are the core principles of G2PDeep-v2. The platform combines a deep learning model with traditional machine learning models on a user-friendly web interface, enabling predictions from up to three types of multi-omics data simultaneously, such as gene or protein expression data.
This allows researchers to uncover complex relationships between genes, proteins, and other molecular features, making it easier to predict phenotypes and identify biomarkers that are responsible for the diversity of observable traits.
Complexity of biological systems
Biological systems are highly complex and contain layers of interconnected molecular processes. To better understand this complex network, researchers analyse biological data referred to as “omics” data. The six types of omics data used in this study include:
- Gene expression
- MicroRNA (miRNA) expression
- Protein expression
- DNA methylation
- Single nucleotide polymorphisms
- Copy number variations
Each data type provides a piece of the biological puzzle but only tells part of the story. By combining multiple layers, researchers can build a more complete picture of how biological systems function and how genetic changes contribute to phenotypes in virtually any organism.
G2PDeep-v2 helps makes sense of multi-omics data
Many existing approaches analyse only a single type of omics data to address bioinformatics problems, limiting their ability to capture the full biological complexity of a system. More recent methods attempt to integrate information from multiple omics layers to improve predictive performance.
However, these models are often designed for specific studies, making them difficult to adapt and limiting their application. In addition, many models generate predictions without providing a clear biological reasoning, requiring further analysis to connect results to meaningful insights.
G2PDeep-v2 offers a flexible and user-friendly platform for multi-omics analysis. Its web-based interface removes the need for coding or manual model construction, allowing users to upload data and perform analyses with minimal technical expertise. Further integration of tools that link newly identified genes to databases such as KEGG provides valuable biological context for significant findings. This design enables researchers across disciplines that rely on genomic data to analyse highly complex datasets more effectively.
Overall, G2PDeep-v2 represents a significant step towards simplifying multi-omics analysis into an end-to-end platform that supports both accurate prediction and biological interpretation.
Deep learning approach for multi-omics integration
Deep learning is a branch of machine learning, itself a form of artificial intelligence, that allows computers to learn patterns from data. Unlike traditional models, deep learning uses artificial neural networks made up of multiple layers. Each layer learns increasingly complex features, allowing the model to capture non-linear and hidden relationships that simpler models may miss.
A key feature of G2PDeep-v2 is its ability to process up to three different types of omics data in parallel, before combining learned features. This enables the integration of multiple biological perspectives and improved modelling of the molecular mechanisms that underly complex traits and disease.
Case studies demonstrating G2PDeep-v2
G2PDeep-v2 was trained on comprehensive studies that included multi-omics data. To demonstrate the real-world applicability, its ability to extract meaningful biological data from complex datasets was evaluated using two case studies:
- Long-term survival prediction in breast cancer.
- Resistance to soybean cyst nematode prediction in soybean cell lines.
These examples show how the framework can be applied across different organisms and biological problems.
Case study 1: long-term-survival prediction and markers discovery for cancer
In the first case study, G2PDeep-v2 was used to predict long-term survival in patients with breast invasive carcinoma cancer using integrated multi-omics data. The patients were classified as long-term survivors if they lived for more than three years after diagnosis.
The strongest performance was achieved when gene expression, miRNA expression, and single nucleotide polymorphism data were combined, with the deep learning model reaching a mean area under the curve score of 0.907, demonstrating its strong predictive power.
Beyond making accurate predictions, G2PDeep-v2 also extracted genetic information that varied across the two cohorts. Among the top 100 genes highlighted by the model, six were already known cancer-causing genes. This supports the biological relevance of the findings and demonstrates its potential to aid in the identification of biomarkers for future diagnostic research.
Further analysis showed upregulated genes that were involved in pathways linked to breast cancer development, such as cell growth, signalling, and tumour progression. This demonstrated that the model not only predicts survival effectively but also highlights meaningful biological mechanisms.
Case study 2: disease resistance prediction for soybean cyst nematode
The second case study focused on predicting soybean cyst nematode resistance using genetic variation data from over 1,000 soybean lines.
The deep learning model successfully distinguished resistant plants from susceptible ones. Importantly, the model highlighted a previously under-studied gene, Glyma.13g030200, as strongly associated with resistance. The same gene was suggested to provide nematode resistance in rice. Additional analyses showed that genetic differences near this gene may influence how it is regulated, suggesting a possible biological explanation for its role in resistance.
These findings indicate that G2PDeep-v2 can uncover novel candidate genes that may be valuable for future breeding and experimental validation.
Building a general platform genomic analysis
G2PDeep-v2 provides the first web server that allows researchers to build, train, and use models on their own multi-omics data. The platform has wide-reaching applications across biomedical science and agriculture. Diverse datasets can be brought together into a single predictive model to help uncover the biological mechanisms that drive disease susceptibility and treatment response.
Dr Dong Xu, a co-author of the study, summarises the impacts of G2PDeep-v2 as:
“We focus on developing bioinformatics platforms, not just algorithms, that can be widely adopted by the research community. G2PDeep demonstrates how advanced AI methods can be translated into practical tools with broad scientific impact.”
Current limitations of G2PDeep-v2
The platform currently supports the integration of three types of omics data at a time. There is limited availability of samples containing four or more omics layers, which restricts the tool’s ability to model more complex biological systems when richer datasets may be available.
The current implementation is also limited to scenarios where only two outcomes are available, such as long-term survival versus non-long-term survival. More complex tasks, including multi-class prediction, are not yet supported, which does limit its applicability to a broader range of biological and clinical questions.
Finally, while G2PDeep-v2 has been evaluated using both human and plant datasets, there is no example of cross-species validation. This limits the direct comparison of predictive patterns and biomarkers across different organisms and reduces the ability to assess how well learned biological representations generalise between species.
These limitations identify important directions for future development and highlight opportunities to further expand the scope and flexibility of the platform.
MDPI’s commitment to sharing research results openly
The development of a tool such as G2PDeep-v2 aligns closely with MDPI’s mission to make scientific research accessible to all. This free, web-based framework enables researchers to use advanced deep learning methods without requiring extensive technical expertise. The model has broad potential applications across fields that depend on complex omics data, including precision medicine, drug discovery, agriculture, and genomic epidemiology.
More studies on deep learning and phenotype prediction can be found across the Open Access journals Machine Learning and Knowledge Extraction and Biomolecules. Alternatively, you can access the full MDPI journal list here.










