your debt
recovery strategy

Streamline the debt recovery process with a chatbot that delivers live calls, texts and emails. Escalate high risk calls to live agents with a co-pilot.

Trusted by leading enterprise
Case Studies
View All
Learn best practices & techniques from the #dreamteam
Don't just dream of the future... Build It!
Schedule a
call to
learn more

Please choose a time slot below to schedule a call with our experts to discover how we can assist you in reaching your goals!

Ready to get started?
Member Name
Title and role @ architech
Schedule a meeting to learn more...
your debt
recovery strategy
Case Studies
View All
Trusted by
leading enterprise
Learn best
practices &
techniques from
the #dreamteam
Don't just dream
of the future...
Build It!

Please choose a time slot below to schedule a call with our experts to discover how we can assist you in reaching your goals!

Ready to get started?
Schedule a
call to
learn more
Schedule a meeting to learn more

Automated testing has become an indispensable part of modern software development and quality assurance processes. However, developing and maintaining automated test scripts remains a significant bottleneck. Writing robust, maintainable test code requires specialized coding skills and considerable time investment. Recent advances in large language models present an opportunity to automate parts of the test generation process using AI.

OpenAI released a fine-tuning feature for their models last month. This experiment was evaluated using fine-tuned Generative Pre-trained Transformer 3.5 (GPT-3.5) models to assist with generating Cypress end-to-end test scripts.

Out-of-the-box chatbot models like ChatGPT can produce test scripts if you give them HTML from a website as input. However, the results often require heavy editing to tailor the outputted code to project needs. Fine-tuning offers a way to adapt the models to match specific coding conventions and testing frameworks.

The goal was to assess whether fine-tuned models could improve test script quality (compared to non-fine-tuned models) and reduce the need for manual amendments across three aspects:

  • Following existing code conventions
  • Using optimal page element selectors and
  • Matching desired test file structure

ChatGPT output
ChatGPT output
Figure 1.1 Problem with default ChatGPT output
Desired output
Figure 1.2 Problem with default ChatGPT output

Background on ChatGPT and Fine-Tuning

ChatGPT is a widely used conversational AI system launched in 2022. It is a fine-tuned version of GPT-3 or the more recent GPT-4, optimized for diverse conversational responses across many domains. The base GPT-3 model was originally released in 2020 but saw extremely limited usage until fine-tuned into the user-friendly ChatGPT system.

The key difference with fine-tuning is it adapts these general-purpose models to specialized use cases by training on custom datasets. For this experiment, a GPT-3.5 model was fine-tuned on a small set of just 10 website HTML code snippets paired with corresponding Cypress test scripts tailored to the project’s needs.

Fine-tuning allows teaching the model to use distinct coding conventions, suited for the specific project you are working on. This guided training process creates an AI assistant that is purpose-built for a particular project, as opposed to general conversational models like ChatGPT.

Learn more about our FREE AI Ideation Workshop

Experimental Setup and Model Evaluation

To quantify improvements from fine-tuning, four models were evaluated:

· ChatGPT using the GPT-3.5 engine
· ChatGPT using the more advanced GPT-4 engine
· ChatGPT with GPT-4 using few-shot prompting
· Custom fine-tuned GPT-3.5 model

Few-shot prompting aims to rapidly adapt a model by providing a single example input-output pair as an example of the desired behavior. This prompts the model without needing a full training round.

Figure 2.1 ChatGPT Prompt and generated Cypress code

Figure 2.2 ChatGPT Prompt and generated Cypress code

The models were assessed on a set of 10 new website HTMLs not used in training data. Each model produced Cypress test scripts for the same HTML code snippets. Outputs were scored across three quality metrics:

· Runs from the first try without any amendments
· Matches desired single test file structure
· Follows existing variable naming conventions

Together these measure how well the models align with project coding standards and test code organization — all without human tweaking.

Results and Analysis

Figure 3 Model comparison results

As expected, the fine-tuned model achieved the best results across metrics — generating test scripts that required no amendments to run from the first time and precisely matched the desired coding convention.

However, even the top-of-line GPT-4 could not match its performance, especially in adhering to coding conventions. This highlights the superiority of full fine-tuning even using a small training dataset like the 10 examples here.

Surprisingly, the few-shot prompting technique showed a significant boost in performance compared to the standard GPT-4 prompting method. This underscores the crucial role that effective prompting plays in optimizing the model’s output.

The zero-shot GPT-3.5 model expectedly performed worst given its lack of exposure to the desired variable naming convention (note that both GPT3.5 and GPT4 were not given the expected variable

naming convention unlike few-shot and fine-tuned models). The overall results corroborate the potential value of fine-tuning for automated test generation. With larger training datasets, fine-tuned models may eventually produce extremely high-quality output, ultimately achieving the highest possible scores on the given evaluation scale.

Additional information

Currently, only GPT-3 models can be fine-tuned with GPT-4 support planned for the near future. Once available, even better fine-tuning performance is anticipated given GPT-4’s enhanced capabilities.

For internal testing purposes, the compute costs of fine-tuning and model output generation were minimal given the low volume. However, at the production scale, the price might be more relevant.

Model prices approximate comparison:

· fine-tuned GPT3.5 is 4 times more expensive than default GPT3.5
· GPT4 is 10+ times more expensive than default GPT3.5, making it at least 2 times more expensive than fined-tuned GPT3.5

That being said, we see that for this particular task, fined-tuned GPT3.5 performed better, and it is cheaper than GPT4.

A key challenge in this experiment was collecting enough high-quality training data pairs. Complete HTML pages and full test scripts were used as examples, making dataset creation very time-intensive.

For less complex tasks, smaller individual training examples could enable rapid fine-tuning using hundreds of samples instead of the minimal 10 here.


In summary, fine-tuning large language models even on small datasets appears highly promising for automated test generation. As language model capabilities improve and fine-tuning techniques mature, AI-assisted testing may soon see widespread adoption for accelerating test creation.

Eimantas Macius
QA Engineer

You may also like