AccScience Publishing / AIH / Online First / DOI: 10.36922/AIH025180036
REVIEW ARTICLE

Recent advances in genetic feature marker discovery through differential expression and biostatistical analysis

Ankita Saha1,2 Shibakali Gupta3 Chyan Paul4 Saurav Mallik5,6* Korhan Cengiz7*
Show Less
1 Department of Computer Science, Swami Vivekananda University, Barrackpore, West Bengal, India
2 Department of Science and Management, ABS Academy of Management and Health Science, Durgapur, West Bengal, India
3 Department of Computer Science and Engineering, University Institute of Technology, Burdwan University, West Bengal, India
4 Department of Computer Science and Engineering, Swami Vivekananda University, Barrackpore, West Bengal, India
5 Department of Biostatistics, University of Miami, Florida, United States of America
6 College of Pharmacy, University of Arizona, Tucson, Arizona, United States of America
7 Department of Electrical Engineering, Prince Mohammad Bin Fahd University, Al Khobar, Saudi Arabia
Received: 28 April 2025 | Revised: 17 July 2025 | Accepted: 1 August 2025 | Published online: 9 September 2025
© 2025 by the Author(s). This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution 4.0 International License ( https://creativecommons.org/licenses/by/4.0/ )
Abstract

Genetic feature discovery is essential for understanding complex diseases and traits. This comprehensive review provides an in-depth comparison of differential expression analysis methods and statistical hypothesis tests—such as Student’s t-test, Chi-square test, analysis of variance, Empirical Bayes methods, and Significant Analysis of Microarrays—used in genetic feature marker discovery. Our analysis highlights the strengths and weaknesses of these approaches in terms of methodologies, applications, performance, and accuracy. While the statistical tests provide straightforward interpretation, machine learning techniques provide superior capabilities for handling high-dimensional data and complex biological interactions. We conducted two mini-experiments: (i) Identification of differentially expressed genes, upregulated genes and downregulated genes using statistical tools (i.e., Student’s t-test and Welch’s t-test) under different conditions (normalization methods and p-value correction strategies) using the GSE31699 dataset from the NCBI Gene Expression Omnibus, and (ii) gene set enrichment analysis—covering Kyoto Encyclopedia of Genes and Genomes pathways and Gene Ontology terms like Biological process, Cellular component and Molecular function—using the GSE30760 dataset with the DAVID 2021 tool. Furthermore, we discussed the potential of hybrid approaches combining statistical tests with machine learning and optimization techniques for enhanced feature discovery. Future work will focus on multi-omics data integration, the development of explainable AI methods, and scalable algorithms. This review aims to serve as a comprehensive guide for researchers involved in genetic marker identification, highlighting both statistical and computational perspectives on differential expression and gene set enrichment studies.

Keywords
Genetic feature discovery
Statistical tests
KEGG pathway analysis
Gene set enrichment analysis
Funding
None.
Conflict of interest
The authors declare that they have no competing interests.
References
  1. What is Biomedical Research? California Biomedical Research Association. Available from: https://statesforbiomed.org/education/background-on-biomedical-research/what-is-biomedical-research [Last accessed on 2024 Oct 09].

 

  1. Bayat A. Clinical review science, medicine, and the future bioinformatics. BMJ. 2002;324:1018-1022. doi: 10.1136/bmj.324.7344.1018

 

  1. Chowdhary M, Rani A, Parkash J, Shahnaz M, Dev D. Bioinformatics: An overview for cancer research. J Drug Deliv Ther. 2016;6(4):69-72. doi: 10.22270/jddt.v6i4.1290

 

  1. Zhang S, Liu K, Liu Y, Hu X, Gu X. The role and application of bioinformatics techniques and tools in drug discovery. Front Pharmacol. 2025;16:1547131. doi: 10.3389/fphar.2025.1547131

 

  1. Bajwa J, Munir U, Nori A, Williams B. Artificial intelligence in healthcare: Transforming the practice of medicine. Future Healthc J. 2021;8(2):e188-e194. doi: 10.7861/fhj.2021-0095

 

  1. Khan FA, Nsengimana B, Khan NH, et al. Differential expression profiles of circRNAs in cancers: Future clinical and diagnostic perspectives. Gene Protein Dis. 2022;1(2):138. doi: 10.36922/gpd.v1i2.138

 

  1. Yeh C, Madison T, Plas K. Exploring the cell-to-cell communication network to better defeat cancer. Tumor Discov. 2025;4(2):92. doi: 10.36922/td.8323

 

  1. Bandyopadhyay S, Mallik S, Mukhopadhyay A. A survey and comparative study of statistical tests for identifying differential expression from microarray data. IEEE/ACM Trans Comput Biol Bioinform. 2014;11(1):95-115. doi: 10.1109/TCBB.2013.147

 

  1. Biomolecule. Encyclopaedia Britannica; 2022. Available from: https://www.britannica.com/science/biomolecule [Last accessed on 2023 Mar 15].

 

  1. Morey JS, Ryan JC, Van Dolah FM. Microarray validation: Factors influencing correlation between oligonucleotide microarrays and real-time PCR. Biol Proced Online. 2006;8(1):175-193. doi: 10.1251/bpo126

 

  1. Adler M, Alon U. Fold-change detection in biological systems. Curr Opin Syst Biol. 2018;8:81-89. doi: 10.1016/j.coisb.2017.12.005

 

  1. Vickers AJ. Parametric versus non-parametric statistics in the analysis of randomized trials with non-normally distributed data. BMC Med Res Methodol. 2005;5:35. doi: 10.1186/1471-2288-5-35

 

  1. Hopkins S, Dettori JR, Chapman JR. Parametric and nonparametric tests in spine research: Why do they matter? Global Spine J. 2018;8(6):652-654. doi: 10.1177/2192568218782679

 

  1. Ritchie ME, Phipson B, Wu D, et al. Limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015;43(7):e47. doi: 10.1093/nar/gkv007

 

  1. Sinsomboonthong S. Performance comparison of new adjusted min-max with decimal scaling and statistical column normalization methods for artificial neural network classification. Int J Math Math Sci. 2022;2022:3584406. doi: 10.1155/2022/3584406

 

  1. Henderi H, Wahyuningsih T, Rahwanto E. Comparison of min-max normalization and Z-score normalization in the K-nearest neighbor (KNN) algorithm to test the accuracy of types of breast cancer. Int J Inform Informat Syst. 2021;4(1):13-20.

 

  1. Välikangas T, Suomi T, Elo LL. A systematic evaluation of normalization methods in quantitative label-free proteomics. Brief Bioinform. 2018;19(1):1-11. doi: 10.1093/bib/bbw095

 

  1. Li B, Tang J, Yang Q, et al. Performance evaluation and online realization of data-driven normalization methods used in LC/MS based untargeted metabolomics analysis. Sci Rep. 2016;6:38881. doi: 10.1038/srep38881

 

  1. Uh HW, Klaric L, Ugrina I, Lauc G, Smilde AK, Houwing- Duistermaat JJ. Choosing proper normalization is essential for discovery of sparse glycan biomarkers. Mol Omics. 2020;16(3):231-242. doi: 10.1039/c9mo00174c

 

  1. Kwak SG, Park SH. Normality test in clinical research. J Rheum Dis. 2019;26(1):5-11. doi: 10.4078/jrd.2019.26.1.5

 

  1. Khatun N. Applications of normality test in statistical analysis. Open J Stat. 2021;11(1):113-122. doi: 10.4236/ojs.2021.111006

 

  1. Das KR. A brief review of tests for normality. Am J Theor Appl Stat. 2016;5(1):5. doi: 10.11648/j.ajtas.20160501.12

 

  1. Thadewald T, Büning H. Jarque-Bera Test and its Competitors for Testing Normality: A Power Comparison. Diskussionsbeiträge. Freie Universität Berlin, Fachbereich Wirtschaftswissenschaft, Berlin; 2004. Available from: https://hdl.handle.net/10419/49919 [Last accessed on 2025 Apr 19].

 

  1. Razali NM, Wah YB. Power comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors and Anderson-darling tests. J Stat Model Anal. 2011;2:21-33.

 

  1. Thadewald T, Büning H. Jarque-Bera Test and its Competitors for Testing Normality: A Power Comparison. Diskussionsbeiträge. Freie Universität Berlin, Fachbereich Wirtschaftswissenschaft, Berlin; 2004.

 

  1. Livingston EH. The mean and standard deviation: What does it all mean? J Surg Res. 2004;119(2):117-123. doi: 10.1016/j.jss.2004.02.008

 

  1. Ugoni A, Walker BF. The chi square test: An introduction. Aust Chiropr Osteopathy. 1995;4(3):85-91.

 

  1. McHugh ML. The Chi-square test of independence. Biochem Med (Zagreb). 2013;23(2):143-149. doi: 10.11613/BM.2013.018

 

  1. McCarthy DJ, Smyth GK. Testing significance relative to a fold-change threshold is a TREAT. Bioinformatics. 2009;25(6):765-771. doi: 10.1093/bioinformatics/btp053

 

  1. Thanavathi C. Advanced Educational Research and Statistics; 2017. Available from: https://www.researchgate.net/ publication/337991541 [Last accessed on 2025 Apr 19].

 

  1. Wang T, Li B, Nelson CE, Nabavi S. Comparative analysis of differential gene expression analysis tools for single-cell RNA sequencing data. BMC Bioinformatics. 2019;20:40. doi: 10.1186/s12859-019-2599-6

 

  1. Boareto M, Caticha N. t-Test at the probe level: An alternative method to identify statistically significant genes for microarray data. Microarrays. 2014;3(4):340-351. doi: 10.3390/microarrays3040340

 

  1. Zhang L, Zhu T, Zhang JT. Two-sample Behrens-Fisher problems for high-dimensional data: A normal reference scale-invariant test. J Appl Stat. 2023;50(3):456-476. doi: 10.1080/02664763.2020.1834516

 

  1. Hong S, Coelho CA, Park J. An exact and near-exact distribution approach to the Behrens-fisher problem. Mathematics. 2022;10(16):2953. doi: 10.3390/math10162953

 

  1. Wan X, Wang W, Liu J, Tong T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range. BMC Med Res Methodol. 2014;14:135. doi: 10.1186/1471-2288-14-135

 

  1. Dao PB. On Wilcoxon rank sum test for condition monitoring and fault detection of wind turbines. Appl Energy. 2022;318:119209. doi: 10.1016/j.apenergy.2022.119209

 

  1. Botlagunta M, Khatri K, Devi BM, Doneti R, Pasha A, Pawar SC. Differential expression of DDX3 and microRNAs in response to hormone and cisplatin against cervical cancer. EJMO. 2022;6(4):307-316. doi: 10.14744/ejmo.2023.96531

 

  1. Larsson O, Wahlestedt C, Timmons JA. Considerations when using the significance analysis of microarrays (SAM) algorithm. BMC Bioinformatics. 2005;6:129. doi: 10.1186/1471-2105-6-129

 

  1. Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001;98(9):5116-5121. doi: 10.1073/pnas.091062498

 

  1. Bewick V, Cheek L, Ball J. Statistics review 10: Further nonparametric methods. Crit Care. 2004;8(3):196-199. doi: 10.1186/cc2857

 

  1. Kruskal WH, Wallis WA. Use of ranks in one-criterion variance analysis. J Am Stat Assoc. 1952;47(260):583-621. doi: 10.2307/2280779

 

  1. Troyanskaya OG, Garber ME, Brown PO, Botstein D, Altman RB. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics. 2002;18(11):1454-1461. doi: 10.1093/bioinformatics/18.11.1454

 

  1. Massey FJ Jr. The Kolmogorov-Smirnov test for goodness of fit. J Am Stat Assoc. 1951;46(253):68-78. doi: 10.2307/2280095

 

  1. Steinskog DJ, Tjøtheim DB, Kvamstø NG. A cautionary note on the use of the Kolmogorov-Smirnov test for normality. Mon Weather Rev. 2007;135(3):1151-1157. doi: 10.1175/MWR3326.1

 

  1. Pushap AC, Sudershan S, Sudershan A. Type of error in statistics: A review. Haya Saudi J Life Sci. 2023;8(03):39-43. doi: 10.36348/sjls.2023.v08i03.001

 

  1. Kaur P, Stoltzfus J. Type I, II, and III statistical errors: A brief overview. Int J Acad Med. 2017;3(2):268-270. doi: 10.4103/IJAM.IJAM_92_17

 

  1. Shaffer JP. Multiple Hypothesis Testing: A Review. Technical Report No. 23. Research Triangle Park, NC: National Institute of Statistical Sciences; 1994. Available from: https:// www.niss.org [Last accessed on 2025 Apr 19].

 

  1. El-Gohary TM. Hypothesis testing, type I and type II errors: Expert discussion with didactic clinical scenarios. Int J Health Rehabil Sci. 2019;8(3):132. doi: 10.5455/ijhrs.0000000180

 

  1. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. doi: 10.1186/s12864-019-6413-7

 

  1. Lise S, Archambeau C, Pontil M, Jones DT. Prediction of hot spot residues at protein-protein interfaces by combining machine learning and energy-based methods. BMC Bioinformatics. 2009;10:365. doi: 10.1186/1471-2105-10-365

 

  1. Gohary T. Hypothesis testing, type I and type II errors: Expert discussion with didactic clinical scenarios. Int J Health Rehabil Sci. 2019;8(3):132. doi: 10.5455/ijhrs.0000000180

 

  1. Jafari M, Ansari-Pour N. Why, when and how to adjust your P values? Cell J. 2019;20(4):604-607. doi: 10.22074/cellj.2019.5992

 

  1. Greenland S, Senn SJ, Rothman KJ, et al. Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. Eur J Epidemiol. 2016;31(4):337-350. doi: 10.1007/s10654-016-0149-3

 

  1. Silicon Genetics. Multiple Testing Corrections. Redwood City, CA: Silicon Genetics; 2003.

 

  1. Vasilopoulos T, Morey TE, Dhatariya K, Rice MJ. Limitations of significance testing in clinical research: A review of multiple comparison corrections and effect size calculations with correlated measures. Anesth Analg. 2016;122(3):825-830. doi: 10.1213/ANE.0000000000001107

 

  1. Sedgwick P. Multiple significance tests: The Bonferroni correction. BMJ. 2012;344:e509. doi: 10.1136/bmj.e509

 

  1. Vickerstaff V, Omar RZ, Ambler G. Methods to adjust for multiple comparisons in the analysis and sample size calculation of randomised controlled trials with multiple primary outcomes. BMC Med Res Methodol. 2019;19(1):129. doi: 10.1186/s12874-019-0754-4

 

  1. Blakesley RE, Mazumdar S, Dew MA, et al. Comparisons of methods for multiple hypothesis testing in neuropsychological research. Neuropsychology. 2009;23(2):255-264. doi: 10.1037/a0012850

 

  1. Kang G, Ye K, Liu N, Allison DB, Gao G. Weighted multiple hypothesis testing procedures. Stat Appl Genet Mol Biol. 2009;8(1):23. doi: 10.2202/1544-6115.1437

 

  1. Cox DD, Lee JS. Pointwise testing with functional data using the Westfall-Young randomization method. Biometrika. 2008;95(3):621-634. doi: 10.1093/biomet/asn021

 

  1. Westfall PH, Young SS. p Value adjustments for multiple tests in multivariate binomial models. J Am Stat Assoc. 1989;84(407):780-786. doi: 10.1080/01621459.1989.10478837

 

  1. Pawitan Y, Michiels S, Koscielny S, Gusnanto A, Ploner A. False discovery rate, sensitivity and sample size for microarray studies. Bioinformatics. 2005;21(13):3017-3024. doi: 10.1093/bioinformatics/bti448

 

  1. Acharya A. A Complete Review of Controlling the False Discovery Rate in a multiple Comparison Problem Framework: The Benjamini-Hochberg Algorithm. arXiv:1406.7117v1 [stat. ME]; 2014. doi: 10.48550/arXiv.1406.7117

 

  1. Benjamini Y. Discovering the false discovery rate. J R Stat Soc Series B Stat Methodol. 2010;72(4):405-416. doi: 10.1111/j.1467-9868.2010.00746.x

 

  1. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat. 2001;29(4):1165-1188. doi: 10.1214/aos/1013699998

 

  1. Chakraborty A, Jiang G, Boustani M, Liu Y, Skaar T, Li L. Simultaneous inferences based on empirical Bayes methods and false discovery rates in eQTL data analysis. BMC Genomics. 2013;14(Suppl 8):S8. doi: 10.1186/1471-2164-14-S8-S8

 

  1. Efron B. Microarrays, empirical Bayes and the two-groups model. Stat Sci. 2008;23(1):1-22. doi: 10.1214/07-STS236

 

  1. Gu T, Zhao X, Barbazuk WB, Lee JH. miTAR: A hybrid deep learning-based approach for predicting miRNA targets. BMC Bioinformatics. 2021;22(1):96. doi: 10.1186/s12859-021-04026-6

 

  1. Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8

 

  1. Available from: https://gatk.broadinstitute.org/hc [Last accessed 2025 Jul 03].

 

  1. Sekhon A, Singh R, Qi Y. DeepDiff: DEEP-learning for predicting DIFFerential gene expression from histone modifications. Bioinformatics. 2018;34(17):i891-i900. doi: 10.1093/bioinformatics/bty612

 

  1. Gomez CG, Rosa-Calatrava M, Fouret J. Optimizing in silico drug discovery: Simulation of connected differential expression signatures and applications to benchmarking. Brief Bioinform. 2024;25(4):bbae299. doi: 10.1093/bib/bbae299

 

  1. Peng H, Wang H, Kong W, et al. Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference. Nat Commun. 2024;15:3922. doi: 10.1038/s41467-024-47899-w

 

  1. Aurelio AMM, Fabián CAF, Iván CCC, Felipe GL. Optimized method for differential gene expression analysis in non-model species: Case of Cedrela odorata L. MethodsX. 2023;11:102449. doi: 10.1016/j.mex.2023.102449

 

  1. Miao Z, Deng K, Wang X, Zhang X. DEsingle for detecting three types of differential expression in single-cell RNA-seq data. Bioinformatics. 2018;34:3223-3224. doi: 10.1093/bioinformatics/bty332

 

  1. Available from: https://github.com/kharchenkolab/pagoda2 [Last accessed on 2025 Jul 15].

 

  1. Hao Y, Stuart T, Kowalski MH, et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat Biotechnol. 2024;42(2):293-304.doi: 10.1038/s41587-023-01767-y

 

  1. Senabouth A, Lukowski SW, Hernandez JA, et al. ascend: R package for analysis of single-cell RNA-seq data. Gigascience. 2019;8(8):giz087. doi: 10.1093/gigascience/giz087

 

  1. Hussain SI, Toscano E. Optimized deep learning for mammography: Augmentation and tailored architectures. Information. 2025;16(5):359. doi: 10.3390/info16050359

 

  1. Xu Z, Zhong S, Gao Y, et al. Optimizing breast lesions diagnosis and decision-making with a deep learning fusion model integrating ultrasound and mammography: A dual-center retrospective study. Breast Cancer Res. 2025;27:80. doi: 10.1186/s13058-025-02033-6

 

  1. Shetty B, Fernandes R, Rodrigues AP, et al. Skin lesion classification of dermoscopic images using machine learning and convolutional neural network. Sci Rep. 2022;12:18134. doi: 10.1038/s41598-022-22644-9

 

  1. Hussain SI, Toscano E. An extensive investigation into the use of machine learning tools and deep neural networks for the recognition of skin cancer: Challenges, future directions, and a comprehensive review. Symmetry. 2024;16(3):366. doi: 10.3390/sym16030366

 

  1. Hussain SI, Toscano E. Enhancing recognition and categorization of skin lesions with tailored deep convolutional networks and robust data augmentation techniques. Mathematics. 2025;13(9):1480. doi: 10.3390/math13091480

 

  1. Available from: https://davidbioinformatics.nih.gov/home. jsp [Last accessed on 2025 Jul 02].
Share
Back to top
Artificial Intelligence in Health, Electronic ISSN: 3029-2387 Print ISSN: 3041-0894, Published by AccScience Publishing