Cross-modal retrieval aims to bridge the semantic gap between different modalities, such as visual and textual data, enabling accurate retrieval across them. Despite significant advancements with models like CLIP that align cross-modal representations, a persistent challenge remains: the hubness problem, where a small subset of samples (hubs) dominate as nearest neighbors, leading to biased representations and degraded retrieval accuracy. Existing methods often mitigate hubness through post-hoc normalization techniques, relying on prior data distributions that may not be practical in real-world scenarios. In this paper, we directly mitigate hubness during training and introduce NeighborRetr, a novel method that effectively balances the learning of hubs and adaptively adjusts the relations of various kinds of neighbors. Our approach not only mitigates the hubness problem but also enhances retrieval performance, achieving state-of-the-art results on multiple cross-modal retrieval benchmarks. Furthermore, NeighborRetr demonstrates robust generalization to new domains with substantial distribution shifts, highlighting its effectiveness in real-world applications.
Figure 1: Overview of NeighborRetr approach, illustrating hubness balancing process and the identification mechanism for distinguishing between good and bad neighbors in cross-modal retrieval.
Our analysis reveals vanilla CLIP creates a problematic distribution with bad hubs dominating retrieval results. Examining k-occurrence frequency, we observe: (1) bad hubs with large Nk(x) frequently appear in top-15 nearest neighbors, (2) good neighbors distribute across lower frequencies, and (3) many anti-hubs rarely appear in retrievals. NeighborRetr significantly reduces bad hubs, enhances good ones, and minimizes anti-hubs, creating a balanced embedding space better aligned with ground truth.
Figure 2: Distribution of k-occurrence frequency in CLIP embeddings, demonstrating NeighborRetr's effectiveness in balancing the embedding space by reducing bad hubs and enhancing good neighbors while minimizing anti-hubs.
NeighborRetr introduces a comprehensive framework to address the hubness problem in cross-modal retrieval:
We measure sample centrality using an efficient memory bank approach to identify hubs, which allows us to directly emphasize the learning of hubs within each modality during training.
Our approach distinguishes between good and bad hubs by incorporating centrality into similarity measures, dynamically promoting good hubs while penalizing bad ones.
We employ a uniform marginal constraint to ensure anti-hubs have retrieval probabilities comparable to normal samples, creating a more balanced embedding space.
Our experiments demonstrate NeighborRetr's robust cross-domain generalization between MSR-VTT and ActivityNet datasets. When trained on MSR-VTT and tested on ActivityNet, our method achieves the lowest hub occurrence and best retrieval performance, indicating that addressing hubness during training significantly benefits cross-domain adaptation. Unlike QB-Norm that uses test-time adjustments, our approach performs better when facing large distribution shifts.
Figure 3: Cross-domain adaptation performance between MSR-VTT and ActivityNet datasets, demonstrating NeighborRetr's superior generalization capabilities and significantly lower hub occurrence compared to baseline methods.
NeighborRetr effectively ranks videos relevant to text queries by balancing similarity and centrality scores. Higher-ranked videos show larger gaps between similarity and centrality values, indicating our model's ability to prioritize less central samples and reduce bias towards over-represented data. For neighborhoods in Rank 2-5, our method identifies good neighbors while maintaining semantic diversity, showcasing adaptability to various contexts.
Figure 4: Text-to-video retrieval visualization showing how NeighborRetr effectively balances similarity and centrality scores to improve ranking quality while maintaining semantic diversity across retrieved results.
@article{lin2025neighborretr,
title = {NeighborRetr: Balancing Hub Centrality in Cross-Modal Retrieval},
author = {Lin, Zengrong and Wang, Zheng and Qian, Tianwen and Mu, Pan and Chan, Sixian and Bai, Cong},
journal = {arXiv preprint arXiv:2503.10526},
year = {2025}
}