Fine Tuning Dense Rankers Using Synthetic Data
Sun 15.02 09:00 - 09:30
- Graduate Student Seminar
-
Bloomfield 526
Abstract:
Dense retrieval models are commonly used in modern information retrieval, owing to their ability to capture nuanced semantic relationships between queries and documents. Traditionally, these models are fine-tuned using large-scale, human-annotated passage corpora, whose construction is costly, domain-dependent, and difficult to scale. Recent progress in large language models (LLMs) has enabled automatic generation of synthetic passages, raising the question of whether such data can replace or complement human supervision for training dense retrievers. This thesis studies the effectiveness of utilizing LLM-generated passage corpora for dense retrieval training. We investigate four passage corpus construction paradigms spanning different supervision regimes: fully supervised human-written passages, zero-shot LLM generation, unsupervised few-shot generation using raw text exemplars, and supervised few-shot generation using limited human-labeled query-passage pairs. We evaluate the impact of these corpus variants on two widely used retrieval architectures—cross-encoder rerankers and bi-encoder retrievers-across both in-distribution and out-of-distribution evaluation settings. Our findings reveal systematic and architecture-dependent effects of synthetic supervision. In out-of-distribution evaluation, LLM-generated passages provide stronger training signals than human-written passages for both cross-encoder and bi-encoder models, indicating improved robustness and transferability. Conversely, under in-distribution evaluation, human-authored passages remain more effective. For cross-encoder rerankers, combining human-written passages with zero-shot generated passages consistently outperforms training on human data alone. For bi-encoder retrievers, pretrained models such as E5 do not benefit from naive inclusion of zero-shot generated passages; however, we identify two few-shot generation strategies—under unsupervised and lightly supervised regimes-that achieve performance competitive with human-written passages in small-corpus settings and yield substantial gains when used for corpus enrichment. To support reproducibility and further research, we construct and release multiple synthetic passage datasets generated under these paradigms using state-of-the-art LLMs, including GPT-4.1-mini and Gemini-2.5-Flash.

