In-context learning (ICL) enables large language models (LLMs) to adapt to new tasks during inference using only a few demonstrations. However, ICL performance is highly dependent on the selection of these demonstrations. Recent work explores retrieval-based methods for selecting query-specific demonstrations, but these approaches often rely on surrogate objectives such as metric learning, failing to directly optimize ICL performance. Consequently, they struggle to identify truly beneficial demonstrations. Moreover, their discriminative retrieval paradigm is ineffective when the candidate pool lacks sufficient high-quality demonstrations. To address these challenges, we propose GenICL, a novel generative preference learning framework that leverages LLM feedback to directly optimize demonstration selection for ICL. Experiments on 19 datasets across 11 task categories demonstrate that GenICL achieves superior performance than existing methods in selecting the most effective demonstrations, leading to better ICL performance.
Existing retrieval-based methods for demonstration selection face critical challenges. The most significant issue is the misalignment between the surrogate learning objective of the retriever and the intrinsic optimization goal of ICL. A discriminative model trained with metric learning objectives to approximate relevance scores does not necessarily indicate a candidate's effectiveness as an in-context demonstration for LLMs.
Furthermore, the scarcity of effective demonstration candidates poses another challenge. As illustrated in the figure below, most demonstration examples are ineffective for most queries, making retriever optimization particularly challenging.
The distribution of the useful example ratio across test sets from different datasets. The x-axis represents the ratio of useful examples to total examples, where a 'useful example' helps the LLM generate accurate output with in-context learning. The ratio remains low for most test queries, indicating that the majority of demonstration examples are ineffective for the LLM.
To address these challenges, we propose GenICL, a novel generative preference learning framework that directly optimizes demonstration selection for ICL using LLM feedback. We reformulate ICL as a generative Bayesian optimization problem, introducing a latent variable to bridge demonstration selection and LLM inference.
Our optimization objective can be formulated as:
Variable Definitions:
GenICL optimizes this objective through preference learning, which focuses on the relative effectiveness between demonstration samples. This approach allows our method to capture finer-grained information in scenarios where effective demonstrations are scarce, and aligns more closely with the intrinsic objective of ICL.
GenICL achieves superior performance compared to existing demonstration selection methods across multiple datasets and task categories. The results demonstrate consistent improvements across classification tasks, multi-choice reasoning, and text generation tasks.
Method | Topic | Reading Comprehension | Paraphrase | NLI | Sentiment | ||||
---|---|---|---|---|---|---|---|---|---|
AGNews | BoolQ | MultiRC | PAWS | QQP | RTE | SNLI | Sentiment140 | SST2 | |
Zero-shot | 31.4 | 64.7 | 57.0 | 53.0 | 57.9 | 59.9 | 39.6 | 49.3 | 54.2 |
Random | 65.0 | 69.6 | 60.4 | 49.6 | 54.0 | 65.7 | 40.4 | 78.8 | 64.1 |
BM25 | 90.0 | 74.0 | 58.7 | 56.5 | 80.3 | 59.9 | 47.7 | 88.3 | 84.7 |
SBERT | 89.8 | 73.6 | 53.3 | 58.3 | 81.7 | 60.2 | 56.2 | 94.1 | 87.8 |
E5base | 90.6 | 71.0 | 54.0 | 55.6 | 77.3 | 68.5 | 53.7 | 93.0 | 92.4 |
CBDS | 67.3 | 77.6 | 49.3 | 57.6 | 64.2 | 56.3 | 43.5 | 92.5 | 69.2 |
EPR | 91.8 | 74.8 | 50.4 | 57.7 | 81.7 | 66.8 | 68.4 | 91.4 | 88.7 |
LLM-R | 92.4 | 74.9 | 50.2 | 57.5 | 80.9 | 61.7 | 80.0 | 91.6 | 93.4 |
GenICL (ours) | 92.6 | 78.1 | 56.9 | 63.9 | 82.0 | 72.9 | 84.6 | 94.7 | 95.0 |
Method | Multi-Choice Tasks | Text Generation Tasks | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Coreference | Commonsense Reasoning | Summarize | CommonGen | Data-to-text | CloseQA | |||||
Winogrande | COPA | HellaSwag | OpenBookQA | AESLC | Gigaword | CommonGen | DART | E2ENLG | SQuADv1 | |
Zero-shot | 61.8 | 66.0 | 71.5 | 41.6 | 5.7 | 15.4 | 19.2 | 22.8 | 34.6 | 2.2 |
Random | 62.1 | 74.0 | 72.1 | 43.0 | 6.5 | 27.2 | 36.3 | 34.5 | 51.1 | 47.3 |
BM25 | 66.4 | 77.0 | 74.6 | 47.8 | 23.9 | 32.5 | 38.3 | 55.4 | 53.8 | 53.7 |
SBERT | 66.9 | 81.0 | 75.2 | 49.2 | 22.3 | 31.7 | 37.8 | 54.6 | 50.6 | 62.5 |
E5base | 66.7 | 84.0 | 75.0 | 51.0 | 23.4 | 31.9 | 37.6 | 54.9 | 52.3 | 61.9 |
CBDS | 66.3 | 83.0 | 73.6 | 47.6 | 20.5 | 29.2 | 34.5 | 52.2 | 50.8 | 59.8 |
EPR | 66.5 | 82.0 | 75.2 | 49.6 | 26.0 | 32.4 | 39.2 | 56.2 | 53.6 | 64.3 |
LLM-R | 66.9 | 85.0 | 74.6 | 50.8 | 26.0 | 32.5 | 37.2 | 56.0 | 54.4 | 61.8 |
GenICL (ours) | 68.0 | 86.0 | 74.6 | 51.8 | 24.4 | 33.0 | 41.0 | 56.4 | 55.1 | 65.7 |
Note: Bold numbers indicate the best performance, and underlined numbers indicate the second-best performance. GenICL consistently outperforms existing methods across most tasks, demonstrating the effectiveness of our generative preference learning approach.
We present several case studies to illustrate how GenICL selects effective demonstrations across different task types. These examples showcase the quality and relevance of demonstrations chosen by our method.
Task Name | AGNews |
Test Input | "Dominant US captures gold with 79th straight win The US softball team completed its scorched-earth run through the Olympics on Monday with a 5-1 win over Australia, America's third straight gold medal." What is this text about? World, Sports, Business, or Technology? |
Demonstration | "US Women Shatter Olympic 800-Meter Freestyle Relay Record The United States has shattered a 17-year-old world record in the women's Olympic 800-meter freestyle relay." What is this text about? World, Sports, Business, or Technology? Sports |
Test Answer | "Sports" |
Task Name | Sentiment140 |
Test Input | zomg!!! I have a G2!!!!!!! What is the sentiment of this tweet? |
Demonstration | My brother got his update for his G1 and I ain't got shit What is the sentiment of this tweet? Negative |
Test Answer | "Positive" |
Task Name | RTE |
Test Input | The west has preferred to focus on endangered animals, rather than endangered humans. African elephants are hunted down and stripped of tusks and hidden by poachers. Their numbers in Africa slumped from 1.2m to 600,000 in a decade until CITES - the Convention on International Trade in Endangered Species - banned the trade in ivory. Based on the paragraph above can we conclude that "African elephants are endangered by ivory poachers."? Yes or No? |
Demonstration | Three leading Japanese banks have announced an alliance forming the world's largest financial group. Fuji Bank, Dai-Ichi Kangyo and the Industrial Bank of Japan say their operations will be integrated by the spring of 2002. Based on the paragraph above can we conclude that "Merger of Japanese Banks creates the world's biggest bank."? Yes or No? Yes |
Test Answer | "Yes" |
Task Name | SST2 |
Test Input | Review: "instead of a hyperbolic beat-charged urban western , it 's an unpretentious , sociologically pointed slice of life ." Is this movie review sentence negative or positive? |
Demonstration | Review: "offers an unexpected window into the complexities of the middle east struggle and into the humanity of its people ." Is this movie review sentence negative or positive? Positive |
Test Answer | "Positive" |
Task Name | CommonGen |
Test Input | Concepts: kid, yard, ball. Write a sentence that includes all these words. |
Demonstration | Concepts: kid, grass, crawl. Write a sentence that includes all these words. A kid is about to crawl through some grass. |
Test Answer | "A kid is playing with a ball in his yard." |
These examples demonstrate how GenICL leverages LLM feedback to directly optimize demonstration selection for in-context learning. Unlike traditional retrieval-based methods that rely on surrogate objectives, our generative preference learning framework identifies truly beneficial demonstrations that lead to better ICL performance across diverse task types including classification, natural language inference, sentiment analysis, and text generation.
@inproceedings{zhang2025learning,
author = {Zhang, Zheng and
Lan, Shaocheng and
Song, Lei and
Bian, Jiang and
Li, Yexin and
Ren, Kan},
title = {Learning to Select In-Context Demonstration Preferred by Large Language Model},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2025},
year = {2025},
}