Штучний інтелект

Науковий журнал

ISSN 2710-1673

ONLINE: ISSN 2710-1681

Виберіть свою мову


Кластеризація за тематикою англо-українських текстових корпусів з використанням багатомовних векторних представлень

Павлишенко Б.М.1, Стасюк М.І.1
1 Львівський національний університет імені Івана Франка
mykola.stasiuk@lnu.edu.ua

Повний текст (PDF)

УДК: 004.85:81'322:81'246.2
Мова публікації: Англійська
Stuc. intelekt. 2026; 31; (1):35-47

Анотація: This study investigates the impact of multilingual sentence-embedding models on unsupervised thematic clustering of a balanced bilingual corpus comprising English and Ukrainian documents. While modern embedding architectures project texts from multiple languages into a shared semantic space, it remains unclear how effectively different models preserve geometrically separable thematic structure suitable for clustering. A balanced dataset of 1,404 documents across four high-level thematic domains, namely History, Culture, Science, and Technology was constructed using a high-confidence zero-shot labeling procedure. Three multilingual sentence embedding models LaBSE, paraphrase-xlm-r-multilingual-v1, and paraphrase-multilingual-mpnet-base-v2 were evaluated in combination with three clustering approaches: KMeans, Agglomerative Hierarchical Clustering, and Fuzzy C-Means. Visual analysis using UMAP projections and quantitative evaluation via intrinsic (Silhouette, Calinski-Harabasz, Davies-Bouldin) and extrinsic (Adjusted Rand Index, Normalized Mutual Information) metrics reveal that clustering performance is primarily determined by the geometric properties of the embedding space rather than by the choice of clustering algorithm. LaBSE produces the most stable and well-separated thematic regions, achieving the highest agreement with ground truth categories. The paraphrase-xlm-r model exhibits reduced thematic separability, while the multilingual MPNet variant demonstrates intermediate behavior with stronger performance under hierarchical clustering. Across all models, English and Ukrainian documents remain interwoven within thematic regions, confirming effective cross-lingual alignment. However, the degree of semantic separability varies significantly between embedding architectures. The findings indicate that representation quality plays a dominant role in unsupervised thematic analysis of bilingual corpora. Embedding model selection is therefore a more impactful factor than clustering strategy when analyzing cross-lingual semantic structure.

Ключові слова: multilingual sentence embeddings, unsupervised clustering, cross-lingual semantic alignment, English-Ukrainian bilingual corpus, distance distribution, thematic analysis

Посилання:

  1. Feng, F., Yang, Y., Cer, D., Arivazhagan, N., & Wang, W. (2022). Language-agnostic BERT sentence embedding. In S. Muresan, P. Nakov, & A. Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 878–891). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.acl-long.62
  2. Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137. https://doi.org/10.1109/TIT.1982.1056489
  3. Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244. https://doi.org/10.1080/01621459.1963.10500845
  4. Dunn, J. C. (1973). A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics, 3(3), 32–57. https://doi.org/10.1080/01969727308546046
  5. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. https://doi.org/10.1016/0306-4573(88)90021-0
  6. Hadifar, A., Sterckx, L., Demeester, T., & Develder, C. (2019). A self-training approach for short text clustering. In I. Augenstein, S. Gella, S. Ruder, K. Kann, B. Can, J. Welbl, A. Conneau, X. Ren, & M. Rei (Eds.), Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019) (pp. 194–199). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-4322
  7. Li, C., & Zhang, J. (2023). New interval improved fuzzy partitions fuzzy C-means clustering algorithms under different distance measures for symbolic data analysis. Applied Sciences, 13(22), Article 12531. https://doi.org/10.3390/app132212531
  8. Abdulkareem, A. (2025). Unsupervised fake news detection on social media using hybrid Gaussian mixture model. PLOS ONE, 20(8), Article e0330421. https://doi.org/10.1371/journal.pone.0330421
  9. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423
  10. Mehta, V., Bawa, S., & Singh, J. (2021). WEClustering: Word embeddings based text clustering technique for large datasets. Complex & Intelligent Systems, 7, 1–14. https://doi.org/10.1007/s40747-021-00512-9
  11. Petukhova, A., Matos-Carvalho, J. P., & Fachada, N. (2025). Text clustering with large language model embeddings. International Journal of Cognitive Computing in Engineering, 6, 100–108. https://doi.org/10.1016/j.ijcce.2024.11.004
  12. Wu, L., Li, R., & Lam, W.-H. (2023). Research on multilingual news clustering based on cross-language word embeddings. arXiv. https://doi.org/10.48550/arXiv.2305.18880
  13. Wada, T., & Iwata, T. (2018). Unsupervised cross-lingual word embedding by multilingual neural language models. arXiv. https://doi.org/10.48550/arXiv.1809.02306
  14. Schuster, T., Ram, O., Barzilay, R., & Globerson, A. (2019). Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 1599–1613). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1162
  15. Tamine, L., Amigó, E., & Mothe, J. (2023). Report on the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2022). SIGIR Forum, 56(2), Article 9. https://doi.org/10.1145/3582900.3582913
  16. OpenAI. (2025, August 7). GPT-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf
  17. Liberty, E., Lang, K., & Shmakov, K. (2016). Stratified sampling meets machine learning. In M. F. Balcan & K. Q. Weinberger (Eds.), Proceedings of the 33rd International Conference on Machine Learning: Vol. 48. Proceedings of Machine Learning Research (pp. 2320–2329). PMLR. https://proceedings.mlr.press/v48/liberty16.html
  18. Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Edunov, S., Stoyanov, V., & Zettlemoyer, L. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 8440–8451). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.747
  19. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 3982–3992). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1410
  20. Song, K., Tan, X., Qin, T., Lu, J., & Liu, T.-Y. (2020). MPNet: Masked and permuted pre-training for language understanding. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Article 1414). Curran Associates Inc.
  21. Rousseeuw, P. J. (1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
  22. Caliński, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics-theory and Methods, 3(1), 1–27. https://doi.org/10.1080/03610927408827101
  23. Davies, D. L., & Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-1(2), 224–227.
  24. Hubert, L., & Arabie, P. (1985). Comparing partitions. Journal of Classification, 2(1), 193–218. https://doi.org/10.1007/BF01908075
  25. Strehl, A., & Ghosh, J. (2002). Cluster ensembles -a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3, 583–617. https://www.jmlr.org/papers/volume3/strehl02a/strehl02apdf
  26. McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint. https://doi.org/10.48550/arXiv.1802.03426

Переглянути повний текст статті (PDF)