Tea plant (Camellia sinensis), as a crucial economic crop, faces core challenges in quality improvement through breeding. To address the prolonged traditional breeding cycle (≥2 years) and inefficient phenotypic identification, this study utilized 90 progeny from natural hybridization of a chlorotic cultivar ‘Anjihuangye’, integrating genotypic data from 40 326 core single nucleotide polymorphism (SNP) loci with biennial phenotypic observations (chlorotic∶non-chlorotic = 54∶36). We systematically compared three machine learning models (logistic regression, random forest, and support vector machine) for predictive performance. The results demonstrate that the random forest model achieved the best performance in the 10-fold cross-validation, and its accuracy was 78.96%, which was significantly better than other models (P<0.05). Feature importance analysis identifies two critical genetic markers: Chr8_142477650 (encoding the chloroplast-localized pyruvate dehydrogenase E1 beta subunit) and Chr8_126475215 (involved in RNA processing regulation). However, independent validation using 109 germplasms with diverse yellowing trait reveals that the prediction accuracy of the model decreased to 21.10%, and the feature weight deviations caused by genetic background heterogeneity was the main limiting factor. In this study, a machine learning prediction framework for tea yellowing trait was established, which shortened the phenotypic identification cycle from 24 months to real-time genotype analysis, and realized the prediction of traits in the early stage of breeding. Although cross-cultivar generalizability requires improvement, the developed SNP-phenotype association model provided an extensible paradigm for deciphering genotype-phenotype complexity in tea plants, representing an innovative application of artificial intelligence in predicting complex traits of woody perennials.
XU Xin
,
LI Yaqi
,
YANG Yiyang
,
XU Qi
,
QIAN Xuefei
,
MA Chunlei
,
MEI Jufen
. AI in Tea Breeding: A Case Study on Prediction of the Yellowing Trait[J]. Journal of Tea Science, 2025
, 45(3)
: 393
-401
.
DOI: 10.13305/j.cnki.jts.2025.03.006
[1] 杜茜雅, 刘馨秋, 卢勇. 长江流域茶叶产地历史变迁及其影响因素[J]. 茶叶科学, 2024, 44(4): 694-706.
Du X Y, Liu X Q, Lu Y.Historical changes and influencing factors of tea producing areas in Yangtze River Basin[J]. Journal of Tea Science, 2024, 44(4): 694-706.
[2] 涂良剑, 林用松, 黄学敏, 等. 高EGCG茶树品系杂交技术研究[J]. 茶叶科学, 2012, 32(5): 426-431.
Tu L J, Lin Y S, Huang X M, et al.Hybridization technique for tea plant lines with high EGCG content[J]. Journal of Tea Science, 2012, 32(5): 426-431.
[3] Burghardt L T, Young N D, Tiffin P.A guide to genome-wide association mapping in plants[J]. Current Protocols in Plant Biology, 2017, 2(1): 22-38.
[4] Li J W, Zhou P, Hu Z H, et al.CsPAT1, a GRAS transcription factor, promotes lignin accumulation by antagonistic interacting with CsWRKY13 in tea plants[J]. The Plant Journal, 2024, 118(5): 1312-1326.
[5] Wang W L, Wang Y X, Li H, et al.Two MYB transcription factors (CsMYB2 and CsMYB26) are involved in flavonoid biosynthesis in tea plant [Camellia Sinensis (L.) O. Kuntze][J]. BMC Plant Biology, 2018, 18(1): 288. doi: 10.1186/s12870-018-1502-3.
[6] Li H, Teng R M, Liu J X, et al.Identification and analysis of genes involved in auxin, abscisic acid, gibberellin, and brassinosteroid metabolisms under drought stress in tender shoots of tea plants[J]. DNA and Cell Biology, 2019, 38(11): 1292-1302.
[7] Greener J G, Kandathil S M, Moffat L, et al.A guide to machine learning for biologists[J]. Nature Reviews Molecular Cell Biology, 2022, 23(1): 40-55.
[8] Montesinos-López O A, Montesinos-López A, Pérez-Rodríguez P, et al. A review of deep learning applications for genomic selection[J]. BMC Genomics, 2021, 22(1): 19. doi: 10.1186/s12864-020-07319-x.
[9] Yoosefzadeh-Najafabadi M, Rajcan I, Eskandari M.Optimizing genomic selection in soybean: an important improvement in agricultural genomics[J]. Heliyon, 2022, 8(11): e11873. doi: 10.1016/j.heliyon.2022.e11873.
[10] Sandhu K S, Lozada D N, Zhang Z W, et al.Deep learning for predicting complex traits in spring wheat breeding program[J]. Frontiers in Plant Science, 2021, 11: 613325. doi: 10.3389/fpls.2020.613325.
[11] Ornella L, Gonzalez-Camacho J M, Dreisigacker S, et al. Methods in molecular biology[M]. New York: Springer, 2017: 173-182.
[12] Liu Q, Zuo S M, Peng S S, et al.Development of machine learning methods for accurate prediction of plant disease resistance[J]. Engineering, 2024, 40: 100-110.
[13] Zhou M M, Kimbeng C A, Tew T L, et al.Logistic regression models to aid selection in early stages of sugarcane breeding[J]. Sugar Tech, 2014, 16(2): 150-156.
[14] Awad M, Khanna R.Efficient learning machines[M]. Berkeley: Apress, 2015: 39-66.
[15] Xiong Z, Cui Y X, Liu Z H, et al.Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation[J]. Computational Materials Science, 2020, 171: 109203. doi: 10.1016/j.commatsci.2019.109203.
[16] Qi Y F, Wang X M, Lei P, et al.The chloroplast metalloproteases VAR2 and EGY1 act synergistically to regulate chloroplast development in Arabidopsis[J]. Plant Biology, 2020, 295(4): 1036-1046.
[17] Noam S, Tamar E, Rosalind W, et al.Use of plant chloroplast RNA-binding proteins as orthogonal activators of chloroplast transgenes in the green alga Chlamydomonas reinhardtii[J]. Algal Research, 2021, 60: 102535. doi: 10.1016/j.algal.2021.102535.