Кластеризація за тематикою англо-українських текстових корпусів з використанням багатомовних векторних представлень

Шукати за:

Роком видання

Автором

Назвою статті

https://doi.org/10.15407/jai2026.01.035

Кластеризація за тематикою англо-українських текстових корпусів з використанням багатомовних векторних представлень

Павлишенко Б.М.¹, Стасюк М.І.¹

¹ Львівський національний університет імені Івана Франка

mykola.stasiuk@lnu.edu.ua

https://orcid.org/0009-0006-5256-2103

Повний текст (PDF)

УДК: 004.85:81'322:81'246.2
Мова публікації: Англійська
Stuc. intelekt. 2026; 31; (1):35-47

Анотація: This study investigates the impact of multilingual sentence-embedding models on unsupervised thematic clustering of a balanced bilingual corpus comprising English and Ukrainian documents. While modern embedding architectures project texts from multiple languages into a shared semantic space, it remains unclear how effectively different models preserve geometrically separable thematic structure suitable for clustering. A balanced dataset of 1,404 documents across four high-level thematic domains, namely History, Culture, Science, and Technology was constructed using a high-confidence zero-shot labeling procedure. Three multilingual sentence embedding models LaBSE, paraphrase-xlm-r-multilingual-v1, and paraphrase-multilingual-mpnet-base-v2 were evaluated in combination with three clustering approaches: KMeans, Agglomerative Hierarchical Clustering, and Fuzzy C-Means. Visual analysis using UMAP projections and quantitative evaluation via intrinsic (Silhouette, Calinski-Harabasz, Davies-Bouldin) and extrinsic (Adjusted Rand Index, Normalized Mutual Information) metrics reveal that clustering performance is primarily determined by the geometric properties of the embedding space rather than by the choice of clustering algorithm. LaBSE produces the most stable and well-separated thematic regions, achieving the highest agreement with ground truth categories. The paraphrase-xlm-r model exhibits reduced thematic separability, while the multilingual MPNet variant demonstrates intermediate behavior with stronger performance under hierarchical clustering. Across all models, English and Ukrainian documents remain interwoven within thematic regions, confirming effective cross-lingual alignment. However, the degree of semantic separability varies significantly between embedding architectures. The findings indicate that representation quality plays a dominant role in unsupervised thematic analysis of bilingual corpora. Embedding model selection is therefore a more impactful factor than clustering strategy when analyzing cross-lingual semantic structure.

Ключові слова: multilingual sentence embeddings, unsupervised clustering, cross-lingual semantic alignment, English-Ukrainian bilingual corpus, distance distribution, thematic analysis

Посилання:

Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2022). Language-agnostic BERT sentence embedding. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 878–891). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.62
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244. https://doi.org/10.1080/01621459.1963.10500845
Dunn, J. C. (1973). A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics, 3(3), 32–57. https://doi.org/10.1080/01969727308546046
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
Hadifar, A., Sterckx, L., Demeester, T., & Develder, C. (2019). A self-training approach for short text clustering. In I. Augenstein, S. Gella, S. Ruder, K. Kann, B. Can, J. Welbl, A. Conneau, X. Ren, & M. Rei (Eds.), Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) (pp. 194–199). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4322
Li, C., & Zhang, J. (2023). New interval improved fuzzy partitions fuzzy C-means clustering algorithms under different distance measures for symbolic data analysis. Applied Sciences, 13(22), Article 12531. https://doi.org/10.3390/app132212531
Abdulkareem, A. (2025). Unsupervised fake news detection on social media using hybrid Gaussian mixture model. PLOS ONE, 20(8), Article e0330421. https://doi.org/10.1371/journal.pone.0330421
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
Mehta, V., Bawa, S., & Singh, J. (2021). WEClustering: Word embeddings based text clustering technique for large datasets. Complex & Intelligent Systems, 7, 1–14. https://doi.org/10.1007/s40747-021-00512-9
Petukhova, A., Matos-Carvalho, J. P., & Fachada, N. (2025). Text clustering with large language model embeddings. International Journal of Cognitive Computing in Engineering, 6, 100–108. https://doi.org/10.1016/j.ijcce.2024.11.004
Wu, L., Li, R., & Lam, W.-H. (2023). Research on multilingual news clustering based on cross-language word embeddings. arXiv. https://doi.org/10.48550/arXiv.2305.18880
Wada, T., & Iwata, T. (2018). Unsupervised cross-lingual word embedding by multilingual neural language models. arXiv. https://doi.org/10.48550/arXiv.1809.02306
Schuster, T., Ram, O., Barzilay, R., & Globerson, A. (2019). Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1599–1613). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1162
Tamine, L., Amigó, E., & Mothe, J. (2023). Report on the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2022). SIGIR Forum, 56(2), Article 9. https://doi.org/10.1145/3582900.3582913
OpenAI. (2025, August 7). GPT-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf
Liberty, E., Lang, K., & Shmakov, K. (2016). Stratified sampling meets machine learning. In M. F. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd International Conference on Machine Learning: Vol. 48. Proceedings of Machine Learning Research (pp. 2320–2329). PMLR. https://proceedings.mlr.press/v48/liberty16.html
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Edunov, S., Stoyanov, V., & Zettlemoyer, L. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440–8451). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982–3992). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410
Song, K., Tan, X., Qin, T., Lu, J., & Liu, T.-Y. (2020). MPNet: Masked and permuted pre-training for language understanding. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Article 1414). Curran Associates Inc.
Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1–27. https://doi.org/10.1080/03610927408827101
Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.
Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075
Strehl, A., & Ghosh, J. (2002). Cluster ensembles -a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. https://www.jmlr.org/papers/volume3/strehl02a/strehl02apdf
McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint. https://doi.org/10.48550/arXiv.1802.03426

Переглянути повний текст статті (PDF)

Штучний інтелект

Науковий журнал

Шукати за:

Кластеризація за тематикою англо-українських текстових корпусів з використанням багатомовних векторних представлень