Мультимодальне метричне навчання для деталізованого прогнозування збереження відео в его-центричних мережах

Шукати за:

Роком видання

Автором

Назвою статті

https://doi.org/10.15407/jai2026.01.127

Мультимодальне метричне навчання для деталізованого прогнозування збереження відео в его-центричних мережах

Гусєв В.С.¹, Вергунова І.М.¹

¹ Київський національний університет імені Тараса Шевченка

gusevvovik@gmail.com; vergunova@hotmail.com

https://orcid.org/0000-0002-9274-0625 https://orcid.org/0000-0003-3052-9143

Повний текст (PDF)

УДК: 004.8:004.94
Мова публікації: Англійська
Stuc. intelekt. 2026; 31; (1):127-139

Анотація: Predicting user retention in digital video content is a pivotal challenge in modern recommendation systems. While global-scale models effectively leverage massive interaction logs, they often fail to capture the nuanced engagement dynamics of “ego-centric” networks - single-creator channels where viewer retention is driven by specific stylistic signatures rather than broad topic relevance. In this work, we present a comprehensive framework for predicting fine-grained retention curves in Minecraft gameplay videos using a novel Multimodal Metric-Regularized Regression approach. We introduce a rigorous normalization pipeline to standardize heterogeneous content and extract high-fidelity features using state-of-the-art foundation models: InternVideo2 (spatiotemporal), M2D2 (semantic audio), SigLIP 2 (visual static), and E5-Small (textual). Unlike traditional direct regression methods, we propose a two-stage training paradigm: first, we structure the latent space using ArcFace loss to maximize the geodesic distance between high- and low-performing content; second, we train a Cross-Modal Transformer to regress the retention curve from this discriminative manifold. Our experimental results demonstrate that this metric-regularization strategy reduces Mean Absolute Error (MAE) by 30% compared to direct regression baselines, achieving an XAUC of 0.74. Furthermore, latent space visualization reveals distinct clusters corresponding to “viral” hooks and “churn” patterns, offering interpretable insights into the audio-visual drivers of viewer engagement.

Ключові слова: user attention survival function, Ego-Networks, metric learning, metric-regularized regression, hyperspherical space, spatiotemporal dynamics, semantic depth

Посилання:

Zhan, R., Pei, C., Su, Q., Wen, J., Wang, X., Mu, G., Zheng, D., Jiang, P., Gai, K. (2022) Deconfounding duration bias in watch-Time prediction for video recommendation. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’22), 14-18 Aug. 2022, USA, 4472-4481. doi: 10.1145/3534678.3539092.
Biega, A. J., Gummadi, K. P, Weikum, G. (2018) Equity of attention: Amortizing individual fairness in rankings. Proceedings of 41st international ACM SIGIR conference on research & development in information retrieval (SIGIR ‘18), July 8-12, 2018, MI USA, 405-414. doi: 10.1145/3209978.3210063.
Christakopoulou, K., Traverse, M., Potter, T., Marriott, E., Li, D., Haulk, C., Chi, E. H., Chen, M. (2020) Deconfounding user satisfaction estimation from response rate bias. Proceedings of Fourteenth ACM Conference on Recommender Systems (RecSys '20), September 22-26, 2020, Brazil, 450-455. doi: 10.1145/3383313.3412208.
Chen, X., Lin, X., Li, C., Jiang, P. (2025) Personalized tree-based progressive regression model for watch-time prediction. Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM '25), November 10-14, 2025, Seoul, Republic of Korea, 5609-5616. doi: 10.1145/3746252.3761538.
Covington, P., Adams, J., Sargin, E. (2016) Deep neural networks for youtube recommendations. Proceedings of the 10th ACM conference on recommender systems (RecSys '16), September 15-19, 2016, Boston Massachusetts, USA, 191-198. doi: 10.1145/2959100.2959190.
Lin, X., Chen, X., Song, L., Liu, J., Li, B., Jiang, P. (2023) Tree based progressive regression model for watch-time prediction in short-video recommendation. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '23), August 6-10, 2023, USA, 4497-4506. doi.org/10.1145/3580305.3599919.
Ma, H., Tian, K., Zhang, T., Zhang, X., Zhou, H., Chen, C., Li, H., Guan, J., Zhou, S. (2025) Generative regression based watch time prediction for short-video recommendation. ArXiv preprint arXiv: 2412.20211v3. doi: 10.48550/arXiv.2412.20211.
Davidson, J., Liebald, B., Liu, J., Nandy, P., Vleet, T. V., Gargi, U., Gupta, S., He, Y., Lambert, M., Livingston, B., Sampath, D. (2010) The youtube video recommendation system. Proceedings of the fourth ACM conference on Recommender systems (RecSys '10), September 26-30, 2010, Barcelona Spain, 293-296. doi: 10.1145/1864708.1864770.
Zhang, Y., Bai, Y., Chang, J., Zang, X., Lu, S., Lu, J., Feng, F., Niu, Y., Song, Y. (2023) Leveraging watch-time feedback for short-video recommendations: A causal labeling framework. Proceedings of the 32-nd ACM International Conference on Information and Knowledge Management (CIKM '23), October 21-25, 2023, Birmingham, UK, 4952-4959. doi.org/10.1145/3583780.3615483.
Xu, Z., Ruibo, M., Jiaqi, C., Weiqi, Z., Ping, Y., Yao, H. (2025) Multi-Granularity Distribution Modeling for Video Watch Time Prediction via Exponential-Gaussian Mixture Network. Proceedings of the Nineteenth ACM Conference on Recommender Systems (RecSys '25), September 22-26, 2025, Prague Czech Republic, 309-318. doi: 10.1145/3705328.3748080.
Covington, P., Adams, J., Sargin, E. (2016) Deep neural networksfor youtube recommendations. Proceedings of the 10-th ACM conference onrecommender systems, September 15-19, 2016, Boston Massachusetts, USA, 191-198. doi: 10.1145/2959100.2959190.
Zhou, G., Zhu, X., Song, C., Fan, Y., Zhu, H., Ma, X., Yan, Y., Jin, J., Li, H., Gai, K. (2018) Deep interest network for click-throughrate prediction. Proceedings of the 24-th ACM SIGKDD international conference on knowledge discovery & data mining (KDD '18), August 19-23, 2018, London, UK, 1059-1068. doi: 10.1145/3219819.3219823.
Zhuang, X., Huang, Y., Palaniappan, K., Zhao, Y. (1996) Gaussian mixture density modeling, decomposition, and applications. IEEE Transactions on Image Processing, 5(9), 1293-1302. doi: 10.1109/83.535841.
Wang, Y., Li, K., Li, X., Yu, J., He, Y., Chen, G., Pei, B., Zheng, R., Xu, J., Wang, Z., Shi, Y., Jiang, T., Li, S.,·Zhang, H., Huang, Y., Qiao, Y., Wang Y., Wang, L. (2025) InternVideo2: Scaling Foundation Models for Multimodal Video Understanding. Proceedings of the European Conference on Computer Vision (ECCV). Computer Vision – ECCV 2024: 18th European Conference, September 29 – October 4, 2024, Milan, Italy, Part LXXXV, 396-416. doi: 10.1007/978-3-031-73013-9_23.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I. Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, PMLR 139. [Online]. Available: https://proceedings.mlr.press/v139/radford21a/radford21a.pdf.
Zhao, L., Gundavarapu, N. B., Yuan, L., Zhou, H., Yan, S., Sun, J. J., Friedman, L., Qian, R., Weyand, T., Zhao, Y., Hornung, R., Schroff, F., Yang, M. H., Ross, D. A., Wang, H., Adam, H., Sirotenko, M., Liu, T., Gong, B. (2024) Videoprism: A foundational visual encoder for video understanding. Proceedings of the 41st International Conference on Machine Learning, PMLR, 235, 60785-60811. [Online]. Available: https://proceedings.mlr.press/v235/zhao24f.html.
Tschannen, M., Gritsenko, A., Wang, X., Naeem, M. F., Alabdulmohsin, I., Parthasarathy, N., Evans, T., Beyer, L., Xia, Y., Basil, M., Hénaff, O., Harmsen, J., Steiner, A., Zhai, X. (2025) SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features. ArXiv preprint arXiv:2502.14786v1. doi: 10.48550/arXiv.2502.14786.
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A. (2021) Emerging properties in self-supervised vision transformers. Proceedings of IEEE/CVF International Conference on Computer Vision (ICCV), 10-17 Oct. 2021, Montreal, QC, Canada, 9650-9660. doi: 10.1109/ICCV48922.2021.00951.
Ding, J., Xue, N., Xia, G.-S., Dai, D. (2022) Decoupling zero-shot semantic segmentation. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), IEEE/CVF, 18-24 June 2022, New Orleans, LA, USA. 11583-11592. doi: 10.1109/CVPR52688.2022.01129.
Fang, A., Jose, A. M., Jain, A., Schmidt, L., Toshev, A. T., Shankar, V. (2024) Data filtering networks. Proceedings of 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2637. [Online]. Available: https://openreview.net/pdf?id=ZKtZ7KQ6G5.
Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.W., Fidler, S., Urtasun, R., Yuille, A. (2014) The role of context for object detection and semantic segmentation in the wild. Proceedings of 27th IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 23-28 June 2014, Columbus, OH, USA, 891-898. doi: 10.1109/CVPR.2014.119.
Yu, L., Poirson, P., Yang, S., Berg, A.C., Berg, T. L. (2016) Modeling context in referring expressions. Proceedings of 14th European Conference on Computer Vision (ECCV), October 11-14, 2016, Amsterdam, Netherlands, Part II, 69-85.
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A. (2018) Semantic understanding of scenes through the ADE20k dataset. International Journal of Computer Vision, 127(3), 302-321. doi: 10.1007/s11263-018-1140-0.
Niizumi, D. , Takeuchi, D., Ohishi, Y., Harada, N., Kashino, K. (2024) Masked Modeling Duo: Towards a Universal Audio Pre-Training Framework. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32, 2391-2406. doi: 10.1109/TASLP.2024.3389636.
Niizumi, D., Takeuchi, D., Ohishi, Y., Harada, N., Kashino, K. (2022) Composing general audio representation by fusing multilayer features of a pre-trained model. Proceedings of 30th European Signal Processing Conference (EUSIPCO), 29 August 2022 – 02 September 2022, Belgrade, Serbia, 200-204. doi: 10.23919/EUSIPCO55093.2022.9909674.
Ghosh, S., Seth, A., Umesh, S. (2022) Decorrelating feature spaces forlearning general-purpose audio representations. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1402-1414. doi: 10.1109/JSTSP.2022.3202093.
Elandt-Johnson, R.C., Johnson, N. L. (1999) Survival models and data analysis. 1st ed. Wiley-Interscience.
Branders, S., Frénay, B., Dupont, P. (2015) Survival Analysis with Cox Regression and Random Non-linear Projections. Proceedings of the 23th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 22-24 Apr. 2015, Bruges, Belgium. [Online]. Available: https://www.researchgate.net/publication/275653885_Survival_Analysis_with_Cox_Regression_and_Random_Non-linear_Projections#fullTextFileContent.
Ducasse, J., Kljun, M., Attygallea, N. T., Pucihar, K. C. (2022) Interactive Web documentaries: a case study of video viewing behaviour on iOtok. International journal of human-computer interaction, 38(10), 949-972. doi: 10.1080/10447318.2021.1976511.
Wu, M., Lin, L., Zhang, W., Wang X., Yang, Z., Hu, S. (2025) Preserving AUC fairness in learning with noisy protected groups. ArXiv preprint arXiv:2505.18532v1. [Online]. Available: https://arxiv.org/pdf/2505.18532v1.
Chay, C. (2025) Understanding ROC-AUC and single-factor AR: a practical guide to classification metrics. Measuring and understanding classifier performance. Medium. [Online]. Available: https://medium.com/@carelchay/understanding-roc-auc-and-single-factor-ar-a-practical-guide-to-classification-metrics-ca34ec2cbb1d.

Переглянути повний текст статті (PDF)

Штучний інтелект

Науковий журнал

Шукати за:

Мультимодальне метричне навчання для деталізованого прогнозування збереження відео в его-центричних мережах