茶树作为重要经济作物,其品质改良是育种的核心目标。针对传统茶树育种周期长(≥2年)、表型鉴定效率低等问题,以黄化茶树品种‘安吉黄叶’自然杂交的90个子代为材料,整合40 326个核心单核苷酸多态性(Single nucleotide polymorphism,SNP)位点的基因型数据与黄化表型连续两年观测数据(黄化∶非黄化=54∶36),系统比较了逻辑回归、随机森林和支持向量机3种机器学习模型的预测效能。结果显示,随机森林模型在十折交叉验证中性能最优,其精确度达78.96%,显著优于其他模型。通过特征重要性分析鉴定出两个关键遗传标记位点:Chr8_142477650(编码叶绿体型丙酮酸脱氢酶E1β亚基)和Chr8_126475215(参与RNA加工调控)。然而,在涵盖109份多源黄化种质的独立验证中,模型预测准确率降至21.10%,遗传背景差异导致的特征权重偏移是主要限制因素。该研究建立了茶树黄化性状的机器学习预测框架,将表型鉴定周期从24个月缩短至基因型即时分析,实现了育种早期阶段的性状预判。尽管跨品种泛化能力有待提升,但构建的SNP-表型关联模型为解析茶树基因型-表型复杂关联提供了可扩展的研究范式,标志着人工智能技术在木本植物复杂性状预测中的创新应用。
Tea plant (Camellia sinensis), as a crucial economic crop, faces core challenges in quality improvement through breeding. To address the prolonged traditional breeding cycle (≥2 years) and inefficient phenotypic identification, this study utilized 90 progeny from natural hybridization of a chlorotic cultivar ‘Anjihuangye’, integrating genotypic data from 40 326 core single nucleotide polymorphism (SNP) loci with biennial phenotypic observations (chlorotic∶non-chlorotic = 54∶36). We systematically compared three machine learning models (logistic regression, random forest, and support vector machine) for predictive performance. The results demonstrate that the random forest model achieved the best performance in the 10-fold cross-validation, and its accuracy was 78.96%, which was significantly better than other models (P<0.05). Feature importance analysis identifies two critical genetic markers: Chr8_142477650 (encoding the chloroplast-localized pyruvate dehydrogenase E1 beta subunit) and Chr8_126475215 (involved in RNA processing regulation). However, independent validation using 109 germplasms with diverse yellowing trait reveals that the prediction accuracy of the model decreased to 21.10%, and the feature weight deviations caused by genetic background heterogeneity was the main limiting factor. In this study, a machine learning prediction framework for tea yellowing trait was established, which shortened the phenotypic identification cycle from 24 months to real-time genotype analysis, and realized the prediction of traits in the early stage of breeding. Although cross-cultivar generalizability requires improvement, the developed SNP-phenotype association model provided an extensible paradigm for deciphering genotype-phenotype complexity in tea plants, representing an innovative application of artificial intelligence in predicting complex traits of woody perennials.
[1] 杜茜雅, 刘馨秋, 卢勇. 长江流域茶叶产地历史变迁及其影响因素[J]. 茶叶科学, 2024, 44(4): 694-706.
Du X Y, Liu X Q, Lu Y.Historical changes and influencing factors of tea producing areas in Yangtze River Basin[J]. Journal of Tea Science, 2024, 44(4): 694-706.
[2] 涂良剑, 林用松, 黄学敏, 等. 高EGCG茶树品系杂交技术研究[J]. 茶叶科学, 2012, 32(5): 426-431.
Tu L J, Lin Y S, Huang X M, et al.Hybridization technique for tea plant lines with high EGCG content[J]. Journal of Tea Science, 2012, 32(5): 426-431.
[3] Burghardt L T, Young N D, Tiffin P.A guide to genome-wide association mapping in plants[J]. Current Protocols in Plant Biology, 2017, 2(1): 22-38.
[4] Li J W, Zhou P, Hu Z H, et al.CsPAT1, a GRAS transcription factor, promotes lignin accumulation by antagonistic interacting with CsWRKY13 in tea plants[J]. The Plant Journal, 2024, 118(5): 1312-1326.
[5] Wang W L, Wang Y X, Li H, et al.Two MYB transcription factors (CsMYB2 and CsMYB26) are involved in flavonoid biosynthesis in tea plant [Camellia Sinensis (L.) O. Kuntze][J]. BMC Plant Biology, 2018, 18(1): 288. doi: 10.1186/s12870-018-1502-3.
[6] Li H, Teng R M, Liu J X, et al.Identification and analysis of genes involved in auxin, abscisic acid, gibberellin, and brassinosteroid metabolisms under drought stress in tender shoots of tea plants[J]. DNA and Cell Biology, 2019, 38(11): 1292-1302.
[7] Greener J G, Kandathil S M, Moffat L, et al.A guide to machine learning for biologists[J]. Nature Reviews Molecular Cell Biology, 2022, 23(1): 40-55.
[8] Montesinos-López O A, Montesinos-López A, Pérez-Rodríguez P, et al. A review of deep learning applications for genomic selection[J]. BMC Genomics, 2021, 22(1): 19. doi: 10.1186/s12864-020-07319-x.
[9] Yoosefzadeh-Najafabadi M, Rajcan I, Eskandari M.Optimizing genomic selection in soybean: an important improvement in agricultural genomics[J]. Heliyon, 2022, 8(11): e11873. doi: 10.1016/j.heliyon.2022.e11873.
[10] Sandhu K S, Lozada D N, Zhang Z W, et al.Deep learning for predicting complex traits in spring wheat breeding program[J]. Frontiers in Plant Science, 2021, 11: 613325. doi: 10.3389/fpls.2020.613325.
[11] Ornella L, Gonzalez-Camacho J M, Dreisigacker S, et al. Methods in molecular biology[M]. New York: Springer, 2017: 173-182.
[12] Liu Q, Zuo S M, Peng S S, et al.Development of machine learning methods for accurate prediction of plant disease resistance[J]. Engineering, 2024, 40: 100-110.
[13] Zhou M M, Kimbeng C A, Tew T L, et al.Logistic regression models to aid selection in early stages of sugarcane breeding[J]. Sugar Tech, 2014, 16(2): 150-156.
[14] Awad M, Khanna R.Efficient learning machines[M]. Berkeley: Apress, 2015: 39-66.
[15] Xiong Z, Cui Y X, Liu Z H, et al.Evaluating explorative prediction power of machine learning algorithms for materials discovery using k-fold forward cross-validation[J]. Computational Materials Science, 2020, 171: 109203. doi: 10.1016/j.commatsci.2019.109203.
[16] Qi Y F, Wang X M, Lei P, et al.The chloroplast metalloproteases VAR2 and EGY1 act synergistically to regulate chloroplast development in Arabidopsis[J]. Plant Biology, 2020, 295(4): 1036-1046.
[17] Noam S, Tamar E, Rosalind W, et al.Use of plant chloroplast RNA-binding proteins as orthogonal activators of chloroplast transgenes in the green alga Chlamydomonas reinhardtii[J]. Algal Research, 2021, 60: 102535. doi: 10.1016/j.algal.2021.102535.