Web Application Testing with LLMs

Apr 25, 2025

Testing is an important part of software engineering, and when properly implemented, test automation can improve efficiency by order of magnitude.
GenAI, and more specifically Large Language Models (LLMs), can enable people with less coding expertise to participate in software development - including the creation of automated tests.

Prompting LLMs to Write UI Tests

For UI based testing, like testing web applications, you can prompt the LLM with detailed step by step instructions: which testing framework to use, how to set it up, what each test should do, and how each test should be conducted. Ideally, LLM then generates runnable code, that uses test framework like Playwright, Selenium or Cypress.

However, there is also alternative approach.

Letting the LLM Run the Tests Itself

Instead of asking the LLM to write code, what if we asked it to run the test directly?

Here’s a simple prompt I gave to Claude 3.7 Sonnet: 'Use your browser, go to bosch.com and find if there are 17 product categories in power tools. Report the result just with pass or fail.'

That’s it — no other context provided. Claude opened the Bosch website, navigated to power tools section, noticed that not all categories were visible at first and had to be expanded, counted all the categories, and returned the result.

Here is the video of the test in action:

On the left side, Claude explains what it’s doing step by step. On the right, you see the actual browser navigating the site. Here is the link to the full conversation.

To demonstrate how it handles failures, here’s a slightly different test: 'Use your browser, go to bosch.com and find if there are 18 product categories in power tools. Report the result just with pass or fail.'

As expected, this one fails, since there aren’t 18 categories. Here is the link to this conversation.

How Is This Done?

This behavior is enabled through Playwright, the web automation framework from Microsoft, and the Model Context Protocol (MCP) — a protocol that allows LLMs to interface with external tools, like a browser.

The Model Context Protocol (MCP) [(Anthropic] enables Large Language Models (LLMs) to interface with external tools. Originally developed by Anthropic, it is gaining traction and will be supported by other providers as well—OpenAI’s recent announcement is a good example.

MCP relies on two core components: a client and a server.

In example linked above, I used Claude Desktop — desktop application for accessing Claude — as the MCP client. It acts as an LLM interface that interprets user prompts, sends commands to the MCP server, receives feedback, and issues further instructions accordingly.

Configuration instructions for Claude Desktop are available here (link).

Playwright-mcp setup guide explains how to configure the MCP client for integration. Playwright-mcp is an MCP server implementation for Playwright. It receives commands from the MCP client and performs corresponding browser actions—such as navigation, clicks, and other interactions.

Apart from installing and configuring Claude Desktop, I only needed to install Playwright and Node.js. That’s it.

Current limitation to desktop clients is a drawback, but future MCP implementations will likely support prompting LLMs over APIs, enabling cloud-based usage and integration with other tools.

When can I have it?

This is still very early — and far from perfect:

LLMs sometimes struggle with cookie pop-ups or dropdown interactions
Execution is slow, much slower than a well-written Playwright script
It incurs token costs, although those are dropping fast

But what it offers is powerful: easy test implemenation, flexibility and robustness. If web page changes - links are moved, renamed, or even if the navigation completely changes - traditional script might break. But an LLM, like a human tester, can adapt and still find the right path. That adaptability is something current tools don’t handle well.

The Future of Testing

This approach could increase efficiency of software development and also bridge the gap between technical and non-technical team members, potentially democratizing test creation and maintenance. As LLM capabilities improve and costs drop, execution speed and reliability will also get better. However the future of testing likely involves a hybrid approach — using traditional automation for execution speed and scale - and LLM driven for adaptability and complex exploratory scenarios. Organizations that embrace both will be best positioned to maintain quality while accelerating their development cycles.

Daniel's Blog

Discussion about this post