AI agents in software testing: a human-in-the-loop assurance model

Authors

DOI:

https://doi.org/10.46299/j.isjea.20260502.03

Keywords:

artificial intelligence agents, software testing, human-in-the-loop supervision, risk governance, quality assurance, continuous integration and continuous delivery, trustworthy artificial intelligence

Abstract

AI agents are increasingly integrated into software testing workflows, including test case generation, regression prioritization, execution analysis, and defect triage. While these capabilities can improve throughput and expand exploratory coverage, their adoption in release-relevant contexts remains methodologically undergoverned. Current studies predominantly evaluate task-level utility of model outputs, but provide limited guidance on how to define delegation boundaries, approval authority, evidence requirements, and escalation rules when AI-generated artifacts may affect merge or release decisions. The aim of this study is to develop a governance-oriented framework for the controlled use of AI agents in software testing. The research applies a design-oriented conceptual methodology that synthesizes evidence from software engineering, software testing, and trustworthy artificial intelligence governance literature. As a result, two linked methodological artifacts are proposed: a human-in-the-loop assurance model and an activity-level risk/control matrix. The model distinguishes assistive, supervised, and conditional autonomous modes, while the matrix relates testing activities, artifact criticality, and autonomy level to required human controls, traceability obligations, and escalation triggers. Risk is operationalized through a heuristic composite formulation (S = C + I + U + L + V), used to calibrate governance intensity rather than to support statistical prediction. The practical value of the study lies in providing a structured baseline for integrating AI-agent-supported testing into continuous integration and continuous delivery workflows. The main limitation is the absence of empirical cross-context validation of the proposed governance parameters.

References

Chen, M., Tworek, J., Jun, H., Yuan, Q., Ponde de Oliveira Pinto, H., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, S., ... Zaremba, W. (2021). Evaluating large language models trained on code. arXiv. https://arxiv.org/abs/2107.03374

OpenAI. (2023). GPT-4 technical report. arXiv. https://arxiv.org/abs/2303.08774

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2024). SWE-bench: Can language models resolve real-world GitHub issues? In The Twelfth International Conference on Learning Representations (ICLR). https://proceedings.iclr.cc/paper_files/paper/2024/file/edac78c3e300629acfe6cbe9ca88fb84-Paper-Conference.pdf

Wang, J., Huang, Y., Chen, C., Liu, Z., Wang, S., & Wang, Q. (2024). Software testing with large language models: Survey, landscape, and vision. IEEE Transactions on Software Engineering, 50(4), 911–936. https://doi.org/10.1109/TSE.2024.3368208

Yang, L., Yang, C., Gao, S., Wang, W., Wang, B., Zhu, Q., Chu, X., Zhou, J., Liang, G., Wang, Q., & Chen, J. (2024). On the evaluation of large language models in unit test generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE '24) (pp. 1607–1619). Association for Computing Machinery. https://doi.org/10.1145/3691620.3695529

Wu, Q., Bansal, G., Zhang, J., Wu, Y., Li, B., Zhu, E., Jiang, L., Zhang, X., Zhang, S., Liu, J., Awadallah, A. H., White, R. W., Burger, D., & Wang, C. (2024). AutoGen: Enabling next-gen LLM applications via multi-agent conversation. In First Conference on Language Modeling (COLM). OpenReview. https://openreview.net/forum?id=BAakY1hNKS

Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., ... Tang, J. (2024). AgentBench: Evaluating LLMs as agents. In The Twelfth International Conference on Learning Representations (ICLR). https://proceedings.iclr.cc/paper_files/paper/2024/hash/e9df36b21ff4ee211a8b71ee8b7e9f57-Abstract-Conference.html

Xi, Z., Chen, W., Guo, X., He, W., Ding, Y., Hong, B., Zhang, M., Wang, J., Jin, S., Zhou, E., Zheng, R., Fan, X., Wang, X., Xiong, L., Zhou, Y., Wang, W., Jiang, C., Zou, Y., Liu, X., ... Gui, T. (2025). The rise and potential of large language model based agents: A survey. Science China Information Sciences, 68(2), 121101. https://doi.org/10.1007/s11432-024-4222-0

Tabassi, E. (2023). Artificial intelligence risk management framework (AI RMF 1.0) (NIST AI 100-1). National Institute of Standards and Technology.

Autio, C., et al. (2024). Artificial intelligence risk management framework: Generative AI profile (NIST AI 600-1). National Institute of Standards and Technology.

International Organization for Standardization. (2023). ISO/IEC 42001:2023: Artificial intelligence—Management system. https://www.iso.org/standard/81230.html

International Organization for Standardization. (2021). ISO/IEC/IEEE 29119-2:2021: Software and systems engineering—Software testing—Part 2: Test processes. https://www.iso.org/standard/79428.html

International Organization for Standardization. (2023). ISO/IEC 23894:2023: Artificial intelligence—Guidance on risk management. https://www.iso.org/standard/77304.html

Schäfer, M., Nadi, S., Eghbali, A., & Tip, F. (2024). An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 50(1), 85–105. https://doi.org/10.1109/TSE.2023.3334955

Garousi, V., & Zhi, J. (2013). A survey of software testing practices in Canada. Journal of Systems and Software, 86(5), 1354–1376. https://doi.org/10.1016/j.jss.2012.12.051

Downloads

Published

2026-04-01

How to Cite

Kopovskyi, S. (2026). AI agents in software testing: a human-in-the-loop assurance model. International Science Journal of Engineering & Agriculture, 5(2), 21–27. https://doi.org/10.46299/j.isjea.20260502.03

Similar Articles

1 2 3 4 5 6 7 8 9 10 > >> 

You may also start an advanced similarity search for this article.