Learning to Select In-Context Demonstration Preferred by Large Language Model

The 63rd Annual Meeting of the Association for Computational Linguistics
ACL 2025 Findings

1ShanghaiTech University 2Microsoft Research Asia 3State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China
*Corresponding author
GenICL Pipeline Overview

Abstract

In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks during inference using only a few demonstrations. However, ICL performance is highly dependent on the selection of these demonstrations. Recent work explores retrieval-based methods for selecting query-specific demonstrations, but these approaches often rely on surrogate objectives such as metric learning, failing to directly optimize ICL performance. Consequently, they struggle to identify truly beneficial demonstrations. Moreover, their discriminative retrieval paradigm is ineffective when the candidate pool lacks sufficient high-quality demonstrations. To address these challenges, we propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL. Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations, leading to better ICL performance.

Motivation

Existing retrieval-based methods for demonstration selection face critical challenges. The most significant issue is the misalignment between the surrogate learning objective of the retriever and the intrinsic optimization goal of ICL. A discriminative model trained with metric learning objectives to approximate relevance scores does not necessarily indicate a candidate's effectiveness as an in-context demonstration for LLMs.

Furthermore, the scarcity of effective demonstration candidates poses another challenge. As illustrated in the figure below, most demonstration examples are ineffective for most queries, making retriever optimization particularly challenging.

Distribution of useful example ratio

The distribution of the useful example ratio across test sets from different datasets. The x-axis represents the ratio of useful examples to total examples, where a 'useful example' helps the LLM generate accurate output with in-context learning. The ratio remains low for most test queries, indicating that the majority of demonstration examples are ineffective for the LLM.

Method

To address these challenges, we propose GenICL, a novel generative preference learning framework that directly optimizes demonstration selection for ICL using LLM feedback. We reformulate ICL as a generative Bayesian optimization problem, introducing a latent variable to bridge demonstration selection and LLM inference.

Our optimization objective can be formulated as:

$$P_{\mathcal{M}}(Y \mid \{(X_k, Y_k)\}_{k=1}^K, X) = \int_{z} P_{\mathcal{M}}(Y \mid z, X)\, P_{\mathcal{M}}(z \mid \{(X_k, Y_k)\}_{k=1}^K, X)\, dz$$

Variable Definitions:

  • $(X, Y)$: A task sample with input $X$ and ground-truth output $Y$
  • $\{(X_k, Y_k)\}_{k=1}^K$: A set of $K$ demonstrations selected from the demonstration pool $\mathcal{P}$
  • $\mathcal{P}$: The demonstration pool containing all training samples from all tasks
  • $z$: A latent variable that bridges demonstration selection and LLM inference
  • $P_{\mathcal{M}}$: The probability distribution of the LLM $\mathcal{M}$

GenICL optimizes this objective through preference learning, which focuses on the relative effectiveness between demonstration samples. This approach allows our method to capture finer-grained information in scenarios where effective demonstrations are scarce, and aligns more closely with the intrinsic objective of ICL.

Experimental Results

Main Results on Classification, Multi-Choice, and Text Generation Tasks

GenICL achieves superior performance compared to existing demonstration selection methods across multiple datasets and task categories. The results demonstrate consistent improvements across classification tasks, multi-choice reasoning, and text generation tasks.

Classification Tasks

Method Topic Reading Comprehension Paraphrase NLI Sentiment
AGNews BoolQ MultiRC PAWS QQP RTE SNLI Sentiment140 SST2
Zero-shot 31.4 64.7 57.0 53.0 57.9 59.9 39.6 49.3 54.2
Random 65.0 69.6 60.4 49.6 54.0 65.7 40.4 78.8 64.1
BM25 90.0 74.0 58.7 56.5 80.3 59.9 47.7 88.3 84.7
SBERT 89.8 73.6 53.3 58.3 81.7 60.2 56.2 94.1 87.8
E5base 90.6 71.0 54.0 55.6 77.3 68.5 53.7 93.0 92.4
CBDS 67.3 77.6 49.3 57.6 64.2 56.3 43.5 92.5 69.2
EPR 91.8 74.8 50.4 57.7 81.7 66.8 68.4 91.4 88.7
LLM-R 92.4 74.9 50.2 57.5 80.9 61.7 80.0 91.6 93.4
GenICL (ours) 92.6 78.1 56.9 63.9 82.0 72.9 84.6 94.7 95.0

Multi-Choice and Text Generation Tasks

Method Multi-Choice Tasks Text Generation Tasks
Coreference Commonsense Reasoning Summarize CommonGen Data-to-text CloseQA
Winogrande COPA HellaSwag OpenBookQA AESLC Gigaword CommonGen DART E2ENLG SQuADv1
Zero-shot 61.8 66.0 71.5 41.6 5.7 15.4 19.2 22.8 34.6 2.2
Random 62.1 74.0 72.1 43.0 6.5 27.2 36.3 34.5 51.1 47.3
BM25 66.4 77.0 74.6 47.8 23.9 32.5 38.3 55.4 53.8 53.7
SBERT 66.9 81.0 75.2 49.2 22.3 31.7 37.8 54.6 50.6 62.5
E5base 66.7 84.0 75.0 51.0 23.4 31.9 37.6 54.9 52.3 61.9
CBDS 66.3 83.0 73.6 47.6 20.5 29.2 34.5 52.2 50.8 59.8
EPR 66.5 82.0 75.2 49.6 26.0 32.4 39.2 56.2 53.6 64.3
LLM-R 66.9 85.0 74.6 50.8 26.0 32.5 37.2 56.0 54.4 61.8
GenICL (ours) 68.0 86.0 74.6 51.8 24.4 33.0 41.0 56.4 55.1 65.7

Note: Bold numbers indicate the best performance, and underlined numbers indicate the second-best performance. GenICL consistently outperforms existing methods across most tasks, demonstrating the effectiveness of our generative preference learning approach.

Case Study

We present several case studies to illustrate how GenICL selects effective demonstrations across different task types. These examples showcase the quality and relevance of demonstrations chosen by our method.

Task Name AGNews
Test Input "Dominant US captures gold with 79th straight win The US softball team completed its scorched-earth run through the Olympics on Monday with a 5-1 win over Australia, America's third straight gold medal." What is this text about? World, Sports, Business, or Technology?
Demonstration "US Women Shatter Olympic 800-Meter Freestyle Relay Record The United States has shattered a 17-year-old world record in the women's Olympic 800-meter freestyle relay." What is this text about? World, Sports, Business, or Technology? Sports
Test Answer "Sports"
Task Name Sentiment140
Test Input zomg!!! I have a G2!!!!!!! What is the sentiment of this tweet?
Demonstration My brother got his update for his G1 and I ain't got shit What is the sentiment of this tweet? Negative
Test Answer "Positive"
Task Name RTE
Test Input The west has preferred to focus on endangered animals, rather than endangered humans. African elephants are hunted down and stripped of tusks and hidden by poachers. Their numbers in Africa slumped from 1.2m to 600,000 in a decade until CITES - the Convention on International Trade in Endangered Species - banned the trade in ivory. Based on the paragraph above can we conclude that "African elephants are endangered by ivory poachers."? Yes or No?
Demonstration Three leading Japanese banks have announced an alliance forming the world's largest financial group. Fuji Bank, Dai-Ichi Kangyo and the Industrial Bank of Japan say their operations will be integrated by the spring of 2002. Based on the paragraph above can we conclude that "Merger of Japanese Banks creates the world's biggest bank."? Yes or No? Yes
Test Answer "Yes"
Task Name SST2
Test Input Review: "instead of a hyperbolic beat-charged urban western , it 's an unpretentious , sociologically pointed slice of life ." Is this movie review sentence negative or positive?
Demonstration Review: "offers an unexpected window into the complexities of the middle east struggle and into the humanity of its people ." Is this movie review sentence negative or positive? Positive
Test Answer "Positive"
Task Name CommonGen
Test Input Concepts: kid, yard, ball. Write a sentence that includes all these words.
Demonstration Concepts: kid, grass, crawl. Write a sentence that includes all these words. A kid is about to crawl through some grass.
Test Answer "A kid is playing with a ball in his yard."

These examples demonstrate how GenICL leverages LLM feedback to directly optimize demonstration selection for in-context learning. Unlike traditional retrieval-based methods that rely on surrogate objectives, our generative preference learning framework identifies truly beneficial demonstrations that lead to better ICL performance across diverse task types including classification, natural language inference, sentiment analysis, and text generation.

BibTeX

@inproceedings{zhang2025learning,
  author    = {Zhang, Zheng and
               Lan, Shaocheng and
               Song, Lei and
               Bian, Jiang and
               Li, Yexin and
               Ren, Kan},
  title     = {Learning to Select In-Context Demonstration Preferred by Large Language Model},
  booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
  year      = {2025},
}