Recent advances in genetic feature marker discovery through differential expression and biostatistical analysis

Ankita Saha; Shibakali Gupta; Chyan Paul; Saurav Mallik; Korhan Cengiz

doi:10.36922/AIH025180036

Journals

Books

Careers

HomeEditorial Office Submissions

Article

Article Types

Year

—

Volume

Issue

Pages

—

Submit to AIH

Apply for Special Issue

Cite this article

Download

498

Views

More by Authors Links

Saurav Mallik

Korhan Cengiz

Journal Browser

Volume | Year

Issue

Forthcoming Issue

Current Issue

View All

News and Announcements

View All

REVIEW ARTICLE

Recent advances in genetic feature marker discovery through differential expression and biostatistical analysis

Ankita Saha^1,2, Shibakali Gupta³, Chyan Paul⁴, Saurav Mallik^5,6*, Korhan Cengiz^7*

Show Less

¹ Department of Computer Science, Swami Vivekananda University, Barrackpore, West Bengal, India

² Department of Science and Management, ABS Academy of Management and Health Science, Durgapur, West Bengal, India

³ Department of Computer Science and Engineering, University Institute of Technology, Burdwan University, West Bengal, India

⁴ Department of Computer Science and Engineering, Swami Vivekananda University, Barrackpore, West Bengal, India

⁵ Department of Biostatistics, University of Miami, Florida, United States of America

⁶ College of Pharmacy, University of Arizona, Tucson, Arizona, United States of America

⁷ Department of Electrical Engineering, Prince Mohammad Bin Fahd University, Al Khobar, Saudi Arabia

AIH, 025180036 https://doi.org/10.36922/AIH025180036

Received: 28 April 2025 | Revised: 17 July 2025 | Accepted: 1 August 2025 | Published online: 9 September 2025

© 2025 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )

Download PDF

XML

Cite

Abstract

Genetic feature discovery is essential for understanding complex diseases and traits. This comprehensive review provides an in-depth comparison of differential expression analysis methods and statistical hypothesis tests—such as Student’s t-test, Chi-square test, analysis of variance, Empirical Bayes methods, and Significant Analysis of Microarrays—used in genetic feature marker discovery. Our analysis highlights the strengths and weaknesses of these approaches in terms of methodologies, applications, performance, and accuracy. While the statistical tests provide straightforward interpretation, machine learning techniques provide superior capabilities for handling high-dimensional data and complex biological interactions. We conducted two mini-experiments: (i) Identification of differentially expressed genes, upregulated genes and downregulated genes using statistical tools (i.e., Student’s t-test and Welch’s t-test) under different conditions (normalization methods and p-value correction strategies) using the GSE31699 dataset from the NCBI Gene Expression Omnibus, and (ii) gene set enrichment analysis—covering Kyoto Encyclopedia of Genes and Genomes pathways and Gene Ontology terms like Biological process, Cellular component and Molecular function—using the GSE30760 dataset with the DAVID 2021 tool. Furthermore, we discussed the potential of hybrid approaches combining statistical tests with machine learning and optimization techniques for enhanced feature discovery. Future work will focus on multi-omics data integration, the development of explainable AI methods, and scalable algorithms. This review aims to serve as a comprehensive guide for researchers involved in genetic marker identification, highlighting both statistical and computational perspectives on differential expression and gene set enrichment studies.

Keywords

Genetic feature discovery

Statistical tests

KEGG pathway analysis

Gene set enrichment analysis

Funding

None.

Conflict of interest

The authors declare that they have no competing interests.

References

What is Biomedical Research? California Biomedical Research Association. Available from: https://statesforbiomed.org/education/background-on-biomedical-research/what-is-biomedical-research [Last accessed on 2024 Oct 09].

Bayat A. Clinical review science, medicine, and the future bioinformatics. BMJ. 2002;324:1018-1022. doi: 10.1136/bmj.324.7344.1018

Chowdhary M, Rani A, Parkash J, Shahnaz M, Dev D. Bioinformatics: An overview for cancer research. J Drug Deliv Ther. 2016;6(4):69-72. doi: 10.22270/jddt.v6i4.1290

Zhang S, Liu K, Liu Y, Hu X, Gu X. The role and application of bioinformatics techniques and tools in drug discovery. Front Pharmacol. 2025;16:1547131. doi: 10.3389/fphar.2025.1547131

Bajwa J, Munir U, Nori A, Williams B. Artificial intelligence in healthcare: Transforming the practice of medicine. Future Healthc J. 2021;8(2):e188-e194. doi: 10.7861/fhj.2021-0095

Khan FA, Nsengimana B, Khan NH, et al. Differential expression profiles of circRNAs in cancers: Future clinical and diagnostic perspectives. Gene Protein Dis. 2022;1(2):138. doi: 10.36922/gpd.v1i2.138

Yeh C, Madison T, Plas K. Exploring the cell-to-cell communication network to better defeat cancer. Tumor Discov. 2025;4(2):92. doi: 10.36922/td.8323

Bandyopadhyay S, Mallik S, Mukhopadhyay A. A survey and comparative study of statistical tests for identifying differential expression from microarray data. IEEE/ACM Trans Comput Biol Bioinform. 2014;11(1):95-115. doi: 10.1109/TCBB.2013.147

Biomolecule. Encyclopaedia Britannica; 2022. Available from: https://www.britannica.com/science/biomolecule [Last accessed on 2023 Mar 15].

Morey JS, Ryan JC, Van Dolah FM. Microarray validation: Factors influencing correlation between oligonucleotide microarrays and real-time PCR. Biol Proced Online. 2006;8(1):175-193. doi: 10.1251/bpo126

Adler M, Alon U. Fold-change detection in biological systems. Curr Opin Syst Biol. 2018;8:81-89. doi: 10.1016/j.coisb.2017.12.005

Vickers AJ. Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data. BMC Med Res Methodol. 2005;5:35. doi: 10.1186/1471-2288-5-35

Hopkins S, Dettori JR, Chapman JR. Parametric and nonparametric tests in spine research: Why do they matter? Global Spine J. 2018;8(6):652-654. doi: 10.1177/2192568218782679

Ritchie ME, Phipson B, Wu D, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi: 10.1093/nar/gkv007

Sinsomboonthong S. Performance comparison of new adjusted min-max with decimal scaling and statistical column normalization methods for artificial neural network classification. Int J Math Math Sci. 2022;2022:3584406. doi: 10.1155/2022/3584406

Henderi H, Wahyuningsih T, Rahwanto E. Comparison of min-max normalization and Z-score normalization in the K-nearest neighbor (KNN) algorithm to test the accuracy of types of breast cancer. Int J Inform Informat Syst. 2021;4(1):13-20.

Välikangas T, Suomi T, Elo LL. A systematic evaluation of normalization methods in quantitative label-free proteomics. Brief Bioinform. 2018;19(1):1-11. doi: 10.1093/bib/bbw095

Li B, Tang J, Yang Q, et al. Performance evaluation and online realization of data-driven normalization methods used in LC/MS based untargeted metabolomics analysis. Sci Rep. 2016;6:38881. doi: 10.1038/srep38881

Uh HW, Klaric L, Ugrina I, Lauc G, Smilde AK, Houwing- Duistermaat JJ. Choosing proper normalization is essential for discovery of sparse glycan biomarkers. Mol Omics. 2020;16(3):231-242. doi: 10.1039/c9mo00174c

Kwak SG, Park SH. Normality test in clinical research. J Rheum Dis. 2019;26(1):5-11. doi: 10.4078/jrd.2019.26.1.5

Khatun N. Applications of normality test in statistical analysis. Open J Stat. 2021;11(1):113-122. doi: 10.4236/ojs.2021.111006

Das KR. A brief review of tests for normality. Am J Theor Appl Stat. 2016;5(1):5. doi: 10.11648/j.ajtas.20160501.12

Thadewald T, Büning H. Jarque-Bera Test and its Competitors for Testing Normality: A Power Comparison. Diskussionsbeiträge. Freie Universität Berlin, Fachbereich Wirtschaftswissenschaft, Berlin; 2004. Available from: https://hdl.handle.net/10419/49919 [Last accessed on 2025 Apr 19].

Razali NM, Wah YB. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-darling tests. J Stat Model Anal. 2011;2:21-33.

Thadewald T, Büning H. Jarque-Bera Test and its Competitors for Testing Normality: A Power Comparison. Diskussionsbeiträge. Freie Universität Berlin, Fachbereich Wirtschaftswissenschaft, Berlin; 2004.

Livingston EH. The mean and standard deviation: What does it all mean? J Surg Res. 2004;119(2):117-123. doi: 10.1016/j.jss.2004.02.008

Ugoni A, Walker BF. The chi square test: An introduction. Aust Chiropr Osteopathy. 1995;4(3):85-91.

McHugh ML. The Chi-square test of independence. Biochem Med (Zagreb). 2013;23(2):143-149. doi: 10.11613/BM.2013.018

McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25(6):765-771. doi: 10.1093/bioinformatics/btp053

Thanavathi C. Advanced Educational Research and Statistics; 2017. Available from: https://www.researchgate.net/ publication/337991541 [Last accessed on 2025 Apr 19].

Wang T, Li B, Nelson CE, Nabavi S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinformatics. 2019;20:40. doi: 10.1186/s12859-019-2599-6

Boareto M, Caticha N. t-Test at the probe level: An alternative method to identify statistically significant genes for microarray data. Microarrays. 2014;3(4):340-351. doi: 10.3390/microarrays3040340

Zhang L, Zhu T, Zhang JT. Two-sample Behrens-Fisher problems for high-dimensional data: A normal reference scale-invariant test. J Appl Stat. 2023;50(3):456-476. doi: 10.1080/02664763.2020.1834516

Hong S, Coelho CA, Park J. An exact and near-exact distribution approach to the Behrens-fisher problem. Mathematics. 2022;10(16):2953. doi: 10.3390/math10162953

Wan X, Wang W, Liu J, Tong T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med Res Methodol. 2014;14:135. doi: 10.1186/1471-2288-14-135

Dao PB. On Wilcoxon rank sum test for condition monitoring and fault detection of wind turbines. Appl Energy. 2022;318:119209. doi: 10.1016/j.apenergy.2022.119209

Botlagunta M, Khatri K, Devi BM, Doneti R, Pasha A, Pawar SC. Differential expression of DDX3 and microRNAs in response to hormone and cisplatin against cervical cancer. EJMO. 2022;6(4):307-316. doi: 10.14744/ejmo.2023.96531

Larsson O, Wahlestedt C, Timmons JA. Considerations when using the significance analysis of microarrays (SAM) algorithm. BMC Bioinformatics. 2005;6:129. doi: 10.1186/1471-2105-6-129

Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98(9):5116-5121. doi: 10.1073/pnas.091062498

Bewick V, Cheek L, Ball J. Statistics review 10: Further nonparametric methods. Crit Care. 2004;8(3):196-199. doi: 10.1186/cc2857

Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952;47(260):583-621. doi: 10.2307/2280779

Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics. 2002;18(11):1454-1461. doi: 10.1093/bioinformatics/18.11.1454

Massey FJ Jr. The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc. 1951;46(253):68-78. doi: 10.2307/2280095

Steinskog DJ, Tjøtheim DB, Kvamstø NG. A cautionary note on the use of the Kolmogorov-Smirnov test for normality. Mon Weather Rev. 2007;135(3):1151-1157. doi: 10.1175/MWR3326.1

Pushap AC, Sudershan S, Sudershan A. Type of error in statistics: A review. Haya Saudi J Life Sci. 2023;8(03):39-43. doi: 10.36348/sjls.2023.v08i03.001

Kaur P, Stoltzfus J. Type I, II, and III statistical errors: A brief overview. Int J Acad Med. 2017;3(2):268-270. doi: 10.4103/IJAM.IJAM_92_17

Shaffer JP. Multiple Hypothesis Testing: A Review. Technical Report No. 23. Research Triangle Park, NC: National Institute of Statistical Sciences; 1994. Available from: https:// www.niss.org [Last accessed on 2025 Apr 19].

El-Gohary TM. Hypothesis testing, type I and type II errors: Expert discussion with didactic clinical scenarios. Int J Health Rehabil Sci. 2019;8(3):132. doi: 10.5455/ijhrs.0000000180

Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. doi: 10.1186/s12864-019-6413-7

Lise S, Archambeau C, Pontil M, Jones DT. Prediction of hot spot residues at protein-protein interfaces by combining machine learning and energy-based methods. BMC Bioinformatics. 2009;10:365. doi: 10.1186/1471-2105-10-365

Gohary T. Hypothesis testing, type I and type II errors: Expert discussion with didactic clinical scenarios. Int J Health Rehabil Sci. 2019;8(3):132. doi: 10.5455/ijhrs.0000000180

Jafari M, Ansari-Pour N. Why, when and how to adjust your P values? Cell J. 2019;20(4):604-607. doi: 10.22074/cellj.2019.5992

Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-350. doi: 10.1007/s10654-016-0149-3

Silicon Genetics. Multiple Testing Corrections. Redwood City, CA: Silicon Genetics; 2003.

Vasilopoulos T, Morey TE, Dhatariya K, Rice MJ. Limitations of significance testing in clinical research: A review of multiple comparison corrections and effect size calculations with correlated measures. Anesth Analg. 2016;122(3):825-830. doi: 10.1213/ANE.0000000000001107

Sedgwick P. Multiple significance tests: The Bonferroni correction. BMJ. 2012;344:e509. doi: 10.1136/bmj.e509

Vickerstaff V, Omar RZ, Ambler G. Methods to adjust for multiple comparisons in the analysis and sample size calculation of randomised controlled trials with multiple primary outcomes. BMC Med Res Methodol. 2019;19(1):129. doi: 10.1186/s12874-019-0754-4

Blakesley RE, Mazumdar S, Dew MA, et al. Comparisons of methods for multiple hypothesis testing in neuropsychological research. Neuropsychology. 2009;23(2):255-264. doi: 10.1037/a0012850

Kang G, Ye K, Liu N, Allison DB, Gao G. Weighted multiple hypothesis testing procedures. Stat Appl Genet Mol Biol. 2009;8(1):23. doi: 10.2202/1544-6115.1437

Cox DD, Lee JS. Pointwise testing with functional data using the Westfall-Young randomization method. Biometrika. 2008;95(3):621-634. doi: 10.1093/biomet/asn021

Westfall PH, Young SS. p Value adjustments for multiple tests in multivariate binomial models. J Am Stat Assoc. 1989;84(407):780-786. doi: 10.1080/01621459.1989.10478837

Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics. 2005;21(13):3017-3024. doi: 10.1093/bioinformatics/bti448

Acharya A. A Complete Review of Controlling the False Discovery Rate in a multiple Comparison Problem Framework: The Benjamini-Hochberg Algorithm. arXiv:1406.7117v1 [stat. ME]; 2014. doi: 10.48550/arXiv.1406.7117

Benjamini Y. Discovering the false discovery rate. J R Stat Soc Series B Stat Methodol. 2010;72(4):405-416. doi: 10.1111/j.1467-9868.2010.00746.x

Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29(4):1165-1188. doi: 10.1214/aos/1013699998

Chakraborty A, Jiang G, Boustani M, Liu Y, Skaar T, Li L. Simultaneous inferences based on empirical Bayes methods and false discovery rates in eQTL data analysis. BMC Genomics. 2013;14(Suppl 8):S8. doi: 10.1186/1471-2164-14-S8-S8

Efron B. Microarrays, empirical Bayes and the two-groups model. Stat Sci. 2008;23(1):1-22. doi: 10.1214/07-STS236

Gu T, Zhao X, Barbazuk WB, Lee JH. miTAR: A hybrid deep learning-based approach for predicting miRNA targets. BMC Bioinformatics. 2021;22(1):96. doi: 10.1186/s12859-021-04026-6

Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8

Available from: https://gatk.broadinstitute.org/hc [Last accessed 2025 Jul 03].

Sekhon A, Singh R, Qi Y. DeepDiff: DEEP-learning for predicting DIFFerential gene expression from histone modifications. Bioinformatics. 2018;34(17):i891-i900. doi: 10.1093/bioinformatics/bty612

Gomez CG, Rosa-Calatrava M, Fouret J. Optimizing in silico drug discovery: Simulation of connected differential expression signatures and applications to benchmarking. Brief Bioinform. 2024;25(4):bbae299. doi: 10.1093/bib/bbae299

Peng H, Wang H, Kong W, et al. Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference. Nat Commun. 2024;15:3922. doi: 10.1038/s41467-024-47899-w

Aurelio AMM, Fabián CAF, Iván CCC, Felipe GL. Optimized method for differential gene expression analysis in non-model species: Case of Cedrela odorata L. MethodsX. 2023;11:102449. doi: 10.1016/j.mex.2023.102449

Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics. 2018;34:3223-3224. doi: 10.1093/bioinformatics/bty332

Available from: https://github.com/kharchenkolab/pagoda2 [Last accessed on 2025 Jul 15].

Hao Y, Stuart T, Kowalski MH, et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol. 2024;42(2):293-304.doi: 10.1038/s41587-023-01767-y

Senabouth A, Lukowski SW, Hernandez JA, et al. ascend: R package for analysis of single-cell RNA-seq data. Gigascience. 2019;8(8):giz087. doi: 10.1093/gigascience/giz087

Hussain SI, Toscano E. Optimized deep learning for mammography: Augmentation and tailored architectures. Information. 2025;16(5):359. doi: 10.3390/info16050359

Xu Z, Zhong S, Gao Y, et al. Optimizing breast lesions diagnosis and decision-making with a deep learning fusion model integrating ultrasound and mammography: A dual-center retrospective study. Breast Cancer Res. 2025;27:80. doi: 10.1186/s13058-025-02033-6

Shetty B, Fernandes R, Rodrigues AP, et al. Skin lesion classification of dermoscopic images using machine learning and convolutional neural network. Sci Rep. 2022;12:18134. doi: 10.1038/s41598-022-22644-9

Hussain SI, Toscano E. An extensive investigation into the use of machine learning tools and deep neural networks for the recognition of skin cancer: Challenges, future directions, and a comprehensive review. Symmetry. 2024;16(3):366. doi: 10.3390/sym16030366

Hussain SI, Toscano E. Enhancing recognition and categorization of skin lesions with tailored deep convolutional networks and robust data augmentation techniques. Mathematics. 2025;13(9):1480. doi: 10.3390/math13091480

Available from: https://davidbioinformatics.nih.gov/home. jsp [Last accessed on 2025 Jul 02].

Previous article in this issue

Next article in this issue

Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing