Large Language Models in Software Engineering: A Critical Review of Evaluation Strategies and Future Directions
worked on by: Ali Bektas
Context
The emergence of Transformer-based Large Language Models (LLMs) has generated significant interest in their applications within Software Engineering (SE). This growing interest is reflected in an increasing number of publications exploring the intersection of LLMs and SE. Researchers have conducted various surveys on the utilization of LLMs in SE [8, 6, 4] and have studied general evaluation strategies for LLMs [1, 5, 2]. Additionally, there has been a focus on specific evaluation aspects for particular SE tasks, such as code generation, comment generation, and test generation. Despite these efforts, the literature still lacks a comprehensive review of evaluation strategies specifically tailored for SE research utilizing LLMs. Addressing this gap is critical for ensuring the reliability and relevance evaluation of LLM-based solutions for practical SE applications.
Goals
The primary goal of this thesis is to critically examine the evaluation methodologies employed in LLM based applications within SE tasks. By systematically reviewing the literature identified by Hou et al. [6], this research aims to assess the reliability and relevance of existing evaluation strategies in determining the quality of LLM-based solutions for SE tasks.
The study seeks to identify limitations and shortcomings in current evaluation practices, raising awareness within the research community about these issues. By providing insights into these limitations, the research aims to guide the community in refining and improving evaluation strategies in future studies, thereby enhancing the rigor and practical applicability of LLM-based research in Software Engineering.
Approach
1. Initial Review and Sorting according the subject of evaluation:
Review the titles and abstracts of the papers identified by Hou et al. [6], sorting them based on the subject of their evaluation. The primary focus is to identify papers where the subject of evaluation is a ”solution” to an SE problem utilizing LLMs.
Based on preliminary research, papers are categorized into four groups. The primary focus for this study is on two of these groups where the subject of evaluation is a solution to an SE task:
- Solution-Oriented Research
- Existing Solution Evaluation
Criteria for identifying papers where the subject of evaluation is a "solution" to an SE problem utilizing LLMs:
- The paper proposes a new solution (e.g. a Framework for code repair where LLMs are part of this Framework) or refines an existing solution (e.g., through fine-tuning) to one or more SE tasks.
- The proposed or refined solution utilizes Large Language Models (LLMs).
- The paper directly evaluates the effectiveness of the proposed or existing LLM-based solution for the specified SE task(s).
Papers meeting these criteria fall into either the "Solution-Oriented Research" category (for new or refined solutions) or the "Existing Solution Evaluation" category (for evaluations of existing solutions).
Papers in the other two identified categories ("LLM Property Investigation" and "Benchmark and Dataset Development") are not considered to have a solution to an SE task as their primary subject of evaluation, and thus are not the focus of this study.
This categorization ensures that the analysis focuses on evaluations directly relevant to the practical applicability of LLM-based solutions in Software Engineering tasks.
2. Definition of ”Reliability” and ”Relevance”:
Define the terms
”reliability” and
”relevance” in the context of this review. Develop a scoring schema ranging from 0 to 5, where 0 indicates that the evaluation is unreliable and irrelevant for assessing the solution for the targeted SE activity, and 5 indicates that the evaluation strategy is fully reliable and relevant.
3. In-Depth Review of Evaluation Strategies:
The review process consists of two phases: an initial review of a subset to develop the evaluation template, followed by a comprehensive review of all papers using this template.
Phase 1: Initial Subset Review and Template Development
From the 177 papers where the subject of evaluation is a solution to an SE task, 14 papers (~8%) are randomly selected using proportional stratified sampling. This selection mirrors the distribution of SE activities and ML problem types observed in Hou et al. [6] Fig. 8. The initial review of this subset focuses on identifying key elements of evaluation strategies. Based on this review, a standardized template is developed, structured around four main elements:
- Evaluation
- Dataset/Benchmark
- Ground-Truth
- Baseline
Each of these main elements includes relevant sub-elements. For example, the Evaluation element might include sub-elements such "qualitative eval" if the evolution includes aspects of qualitative evaluation or the category of the metric used (e.g., if BLUE and
CodeBLUE was used then the category "text similarity based metric" would met).
For each main element, the template includes fields for:
- A "reliability" score (0-5) with accompanying reasoning
- A "relevance" score (0-5) with accompanying reasoning
Phase 2: Systematic Application of Template to complete set of papers
Using the developed template, the review process continues with the remaining papers. For each paper, the template is filled out, providing a structured analysis of its evaluation strategy. Throughout both phases, the review process focuses on identifying possible limitations and shortcomings of the evaluations, particularly concerning their reliability and relevance for assessing the practical usability of the proposed solutions for SE tasks.
4. Findings and Discussion:
After completing the review, describe the findings, including grouping papers according to their task context and reliability/relevance scores. Discuss observed patterns in terms of reliability and relevance. Identify commonalities among approaches with similar limitations and shortcomings.
5. Conclusion:
Conclude the research by summarizing the findings, highlighting the most critical areas for improvement in the evaluation strategies, and suggesting future directions for research.
Expected Outcome
The expected outcome of this research is that the evaluation strategies employed in most LLM-based SE studies lack a certain degree of reliability and relevance concerning assessing the solution’s quality for real-world SE tasks. Preliminary observations propose that many studies rely heavily on quantitative evaluations using off-the-shelf benchmarks and established evaluation metrics. Current literature, such as the review by Hou et al. [6], and studies like Evtikhiev et al. [3] and Shin et al. [7], document the difficulties and shortcomings of these evaluation elements, especially in terms of their expressiveness and applicability to practical SE problems. This research anticipates uncovering patterns in the limitations of these evaluation strategies and aims to guide the development of more robust and relevant evaluation methodologies for future LLM-based SE research. By improving the rigor and applicability of evaluation practices, this work will contribute to the broader goal of advancing the field of Software Engineering.
References
[1] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
[2] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
[3] Mikhail Evtikhiev, Egor Bogomolov, Yaroslav Sokolov, and Timofey Bryksin. Out of the bleu: how should we assess quality of the code generation models? Journal of Systems and Software, 203:111741, 2023.
[4] Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. Large language models for software engineering: Survey and open problems. arXiv preprint arXiv:2310.03533, 2023.
[5] Zishan Guo, Renren Jin, Chuang Liu, Yufei Huang, Dan Shi, Linhao Yu, Yan Liu, Jiaxuan Li, Bojian Xiong, Deyi Xiong, et al. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023.
[6] Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
[7] Jiho Shin, Hadi Hemmati, Moshi Wei, and Song Wang. Assessing evaluation metrics for neural test oracle generation. IEEE Transactions on Software Engineering, 2024.
[8] Zibin Zheng, Kaiwen Ning, Jiachi Chen, Yanlin Wang, Wenqing Chen, Lianghong Guo, and Weicheng Wang. Towards an understanding of large language models in software engineering tasks. arXiv preprint arXiv:2308.11396, 2023.