The fields of large language models (LLMs) and application development
are increasingly intertwined, as web application developers are increasingly
using LLMs as coding partners to improve their efficiency and productivity.
However, it remains unclear which LLMs truly offer the highest productivity
alongside a satisfactory user experience in the context of web application
development. To explore this issue, we evaluated selected LLMs by measur-
ing their interactions during the automatic generation of code (applications)
for predefined tasks, and by using the UMUX questionnaire with users who
tested the generated applications. The models used in the experiment were
Claude 3.7 Sonnet, Gemini 2.5 Pro Preview, and GPT-4o. Each of these
LLMs was assigned three tasks that included incomplete and ambiguous in-
structions, allowing us to assess their ability to generate applications based
on instructions in the form of prompts (the kind of instructions developers
often receive from clients). We observed how successful the models were in
generating functional applications and how many additional prompts were
needed to produce a working solution. Our results show that Claude 3.7 Son-
net delivers the highest developer productivity, requiring the least developer
intervention and demonstrating the greatest degree of autonomy. The gener-
ated applications were then evaluated through user testing, followed by the
UMUX questionnaire. While Gemini 2.5 Pro Preview achieved the highest
average UMUX score, it required more iterations than Claude 3.7 Sonnet.
|