Learning to Select In-Context Demonstration Preferred by Large Language Model

Abstract

In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks during inference using only a few demonstrations. However, ICL performance is highly dependent on the selection of these demonstrations. Recent work explores retrieval-based methods for selecting query-specific demonstrations, but these approaches often rely on surrogate objectives such as metric learning, failing to directly optimize ICL performance. Consequently, they struggle to identify truly beneficial demonstrations. Moreover, their discriminative retrieval paradigm is ineffective when the candidate pool lacks sufficient high-quality demonstrations. To address these challenges, we propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL. Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations, leading to better ICL performance.

Motivation

Existing retrieval-based methods for demonstration selection face critical challenges. The most significant issue is the misalignment between the surrogate learning objective of the retriever and the intrinsic optimization goal of ICL. A discriminative model trained with metric learning objectives to approximate relevance scores does not necessarily indicate a candidate's effectiveness as an in-context demonstration for LLMs.

Furthermore, the scarcity of effective demonstration candidates poses another challenge. As illustrated in the figure below, most demonstration examples are ineffective for most queries, making retriever optimization particularly challenging.

The distribution of the useful example ratio across test sets from different datasets. The x-axis represents the ratio of useful examples to total examples, where a 'useful example' helps the LLM generate accurate output with in-context learning. The ratio remains low for most test queries, indicating that the majority of demonstration examples are ineffective for the LLM.

Method

To address these challenges, we propose GenICL, a novel generative preference learning framework that directly optimizes demonstration selection for ICL using LLM feedback. We reformulate ICL as a generative Bayesian optimization problem, introducing a latent variable to bridge demonstration selection and LLM inference.

Our optimization objective can be formulated as:

$$P_{\mathcal{M}}(Y \mid \{(X_k, Y_k)\}_{k=1}^K, X) = \int_{z} P_{\mathcal{M}}(Y \mid z, X)\, P_{\mathcal{M}}(z \mid \{(X_k, Y_k)\}_{k=1}^K, X)\, dz$$

Variable Definitions:

$(X, Y)$: A task sample with input $X$ and ground-truth output $Y$
$\{(X_k, Y_k)\}_{k=1}^K$: A set of $K$ demonstrations selected from the demonstration pool $\mathcal{P}$
$\mathcal{P}$: The demonstration pool containing all training samples from all tasks
$z$: A latent variable that bridges demonstration selection and LLM inference
$P_{\mathcal{M}}$: The probability distribution of the LLM $\mathcal{M}$

GenICL optimizes this objective through preference learning, which focuses on the relative effectiveness between demonstration samples. This approach allows our method to capture finer-grained information in scenarios where effective demonstrations are scarce, and aligns more closely with the intrinsic objective of ICL.

Experimental Results

Main Results on Classification, Multi-Choice, and Text Generation Tasks

GenICL achieves superior performance compared to existing demonstration selection methods across multiple datasets and task categories. The results demonstrate consistent improvements across classification tasks, multi-choice reasoning, and text generation tasks.

Classification Tasks

Method	Topic	Reading Comprehension		Paraphrase		NLI		Sentiment
	AGNews	BoolQ	MultiRC	PAWS	QQP	RTE	SNLI	Sentiment140	SST2
	Zero-shot	31.4	64.7	57.0	53.0	57.9	59.9	39.6	49.3	54.2
Random	65.0	69.6	60.4	49.6	54.0	65.7	40.4	78.8	64.1
BM25	90.0	74.0	58.7	56.5	80.3	59.9	47.7	88.3	84.7
SBERT	89.8	73.6	53.3	58.3	81.7	60.2	56.2	94.1	87.8
E5_base	90.6	71.0	54.0	55.6	77.3	68.5	53.7	93.0	92.4
CBDS	67.3	77.6	49.3	57.6	64.2	56.3	43.5	92.5	69.2
EPR	91.8	74.8	50.4	57.7	81.7	66.8	68.4	91.4	88.7
LLM-R	92.4	74.9	50.2	57.5	80.9	61.7	80.0	91.6	93.4
GenICL (ours)	92.6	78.1	56.9	63.9	82.0	72.9	84.6	94.7	95.0

Multi-Choice and Text Generation Tasks

Method	Multi-Choice Tasks				Text Generation Tasks
	Coreference	Commonsense Reasoning			Summarize		CommonGen	Data-to-text		CloseQA
	Winogrande	COPA	HellaSwag	OpenBookQA	AESLC	Gigaword	CommonGen	DART	E2ENLG	SQuADv1
Zero-shot	61.8	66.0	71.5	41.6	5.7	15.4	19.2	22.8	34.6	2.2
Random	62.1	74.0	72.1	43.0	6.5	27.2	36.3	34.5	51.1	47.3
BM25	66.4	77.0	74.6	47.8	23.9	32.5	38.3	55.4	53.8	53.7
SBERT	66.9	81.0	75.2	49.2	22.3	31.7	37.8	54.6	50.6	62.5
E5_base	66.7	84.0	75.0	51.0	23.4	31.9	37.6	54.9	52.3	61.9
CBDS	66.3	83.0	73.6	47.6	20.5	29.2	34.5	52.2	50.8	59.8
EPR	66.5	82.0	75.2	49.6	26.0	32.4	39.2	56.2	53.6	64.3
LLM-R	66.9	85.0	74.6	50.8	26.0	32.5	37.2	56.0	54.4	61.8
GenICL (ours)	68.0	86.0	74.6	51.8	24.4	33.0	41.0	56.4	55.1	65.7

Note: Bold numbers indicate the best performance, and underlined numbers indicate the second-best performance. GenICL consistently outperforms existing methods across most tasks, demonstrating the effectiveness of our generative preference learning approach.

Case Study

We present several case studies to illustrate how GenICL selects effective demonstrations across different task types. These examples showcase the quality and relevance of demonstrations chosen by our method.

Task Name	AGNews
Test Input	"Dominant US captures gold with 79th straight win The US softball team completed its scorched-earth run through the Olympics on Monday with a 5-1 win over Australia, America's third straight gold medal." What is this text about? World, Sports, Business, or Technology?
Demonstration	"US Women Shatter Olympic 800-Meter Freestyle Relay Record The United States has shattered a 17-year-old world record in the women's Olympic 800-meter freestyle relay." What is this text about? World, Sports, Business, or Technology? Sports
Test Answer	"Sports"

Task Name	Sentiment140
Test Input	zomg!!! I have a G2!!!!!!! What is the sentiment of this tweet?
Demonstration	My brother got his update for his G1 and I ain't got shit What is the sentiment of this tweet? Negative
Test Answer	"Positive"

Task Name	RTE
Test Input	The west has preferred to focus on endangered animals, rather than endangered humans. African elephants are hunted down and stripped of tusks and hidden by poachers. Their numbers in Africa slumped from 1.2m to 600,000 in a decade until CITES - the Convention on International Trade in Endangered Species - banned the trade in ivory. Based on the paragraph above can we conclude that "African elephants are endangered by ivory poachers."? Yes or No?
Demonstration	Three leading Japanese banks have announced an alliance forming the world's largest financial group. Fuji Bank, Dai-Ichi Kangyo and the Industrial Bank of Japan say their operations will be integrated by the spring of 2002. Based on the paragraph above can we conclude that "Merger of Japanese Banks creates the world's biggest bank."? Yes or No? Yes
Test Answer	"Yes"

Task Name	SST2
Test Input	Review: "instead of a hyperbolic beat-charged urban western , it 's an unpretentious , sociologically pointed slice of life ." Is this movie review sentence negative or positive?
Demonstration	Review: "offers an unexpected window into the complexities of the middle east struggle and into the humanity of its people ." Is this movie review sentence negative or positive? Positive
Test Answer	"Positive"

Task Name	CommonGen
Test Input	Concepts: kid, yard, ball. Write a sentence that includes all these words.
Demonstration	Concepts: kid, grass, crawl. Write a sentence that includes all these words. A kid is about to crawl through some grass.
Test Answer	"A kid is playing with a ball in his yard."

These examples demonstrate how GenICL leverages LLM feedback to directly optimize demonstration selection for in-context learning. Unlike traditional retrieval-based methods that rely on surrogate objectives, our generative preference learning framework identifies truly beneficial demonstrations that lead to better ICL performance across diverse task types including classification, natural language inference, sentiment analysis, and text generation.

BibTeX

@inproceedings{zhang2025learning,
  author    = {Zhang, Zheng and
               Lan, Shaocheng and
               Song, Lei and
               Bian, Jiang and
               Li, Yexin and
               Ren, Kan},
  title     = {Learning to Select In-Context Demonstration Preferred by Large Language Model},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
  year      = {2025},
}