From Certification to Practice: AI Testing for Accessibility

ISTQB’s certifications in AI Testing (CT‑AI) and Testing with Generative AI (CT‑GenAI) provide a solid theoretical foundation – but what do they look like in practice? In this blog post, we explore how the knowledge from these certifications can be translated into concrete, value‑driven accessibility testing.

Introduction

When Prasanth, a senior consultant at System Verification, completed the ISTQB AI Testing (CT‑AI) certification last year, he made an observation that stayed with us for a while. Many of the ideas in the syllabus did not feel completely new. Instead, they finally provided clear names for problems we had already been encountering in everyday testing work.

A few months later, Tobias, a principal consultant at System Verification, completed the ISTQB Testing with Generative AI (CT‑GenAI) certification, and naturally we compared impressions. Rather than discussing the theory in isolation, the conversation quickly moved toward a practical question: where do these concepts actually appear in real projects?

CT‑AI covers the specific challenges of testing AI‑based systems: how to deal with machine learning, non‑deterministic behaviour, limited test oracles, and risk‑based strategies for systems that learn or adapt. CT‑GenAI focuses on working with generative AI tools in testing workflows—understanding their outputs, managing hallucination risks, and validating results that cannot be checked against a fixed expected value.

Accessibility testing turned out to be a surprisingly good example. In projects with CI pipelines and automated accessibility scans, it becomes visible very quickly where automation works well and where human judgement is still required. Some aspects of accessibility can be validated deterministically and automated quite reliably. Other aspects depend heavily on context and real user experience. This combination makes accessibility testing a useful example for many of the challenges described in the two courses.

The Oracle Problem in Accessibility Testing

One concept from AI testing that becomes visible very quickly in accessibility work is the Oracle Problem. This refers to situations where determining the correct result of a system is difficult or impossible to define precisely.

Automated accessibility tools such as axe‑core, Pa11y, or Lighthouse are very good at detecting violations where the expected result is clearly defined. Typical examples include missing form labels, invalid ARIA roles, or insufficient colour contrast.

In those situations, the result is binary: the violation either exists or it does not.

However, accessibility testing also includes situations where interpretation is required. Alt text is a simple example. An automated tool can check whether an image contains an alt attribute, but it cannot reliably evaluate whether the description is actually meaningful for screen reader users.

Both of the following alt text values would pass automated checks:

“image”
“Submit registration form button”

Both technically pass, but only one of them is actually useful.

One possible strategy in these situations is to compare outputs from different AI‑based description or captioning models against the same set of images. Where both agree, the result is likely reliable. Where they disagree, that disagreement highlights cases worth human review. The differences do not tell you which answer is right, but they reliably surface the cases where interpretation is genuinely ambiguous.

Combining Different Testing Approaches

The Oracle Problem makes clear that no single approach will cover everything. That is not a theoretical observation—it shapes how the work needs to be structured.

Automated accessibility checks provide very fast feedback during development. In some projects, these checks already run on every pull request and scan hundreds of components at different stages: component level, integration, and before release. Catching issues early keeps them cheap to fix.

In practice, the most effective accessibility work reaches further back than test execution. Building accessibility criteria into requirements and design reviews catches structural problems before they become expensive. During development, accessibility linters give developers immediate feedback in the IDE, long before anything reaches a pipeline.

With agentic AI coding tools, there is another opportunity: accessibility expertise can be made explicit in the agent’s context rather than relying on developer memory. This can take the form of a dedicated accessibility reviewer persona or integrating accessibility expectations into a senior frontend developer role definition. Code analytics tools can then provide visibility into whether those rules are actually being applied in the parts of the codebase that matter most.

AI‑supported analysis adds another layer: pattern detection across large defect sets, suggested fixes, and classification by severity. Useful—but with a catch. Large language models can produce answers that sound plausible and still be wrong. We have seen generated fix suggestions that were technically coherent but addressed the wrong ARIA pattern for the context. They would have passed a non‑specialist review.

That is why AI‑generated suggestions always go through a specialist before being acted on. Not as bureaucracy, but as quality control.

Manual accessibility testing remains essential. Exploratory and experience‑based testing helps evaluate real user scenarios and usability aspects that automation cannot reliably assess.

In one project, automated scans reliably detected missing labels, but issues with dynamic status messages only became visible during manual screen reader testing. When a user submits a form and a confirmation message appears dynamically, accessibility guidelines require that message to be announced to screen reader users via a live region. An automated tool can check whether a live region exists in the DOM, but it cannot simulate the interaction, verify timing, or judge whether the wording makes sense in context. That required a tester using an actual screen reader and working through the real user journey.

Automation narrows the scope. It does not replace judgment.

Limits of Automated Accessibility Checks

Even within automated checks, not all checks are equally reliable. Automated accessibility testing is extremely helpful, but its limitations are real and worth understanding in detail.

Colour contrast checks are a good illustration. In one project, a hero banner had white text over a photograph. The automated tool calculated the contrast ratio against an average background colour and passed it. A manual check showed that in parts of the image, the text was nearly unreadable.

This becomes even harder with gradients, background images, or semi‑transparent layers. The calculated value stops being a reliable proxy for what users actually perceive, and manual verification ends up doing the work the tool cannot.

The issue here is not randomness. The tool produces the same result every time. The difficulty is that correctness depends on human perception and context, which the tool cannot model.

Human in the Loop and Automation Bias

Another issue that appears in practice is automation bias: the tendency to trust automated results more than they deserve, especially after long stretches of green builds.

We noticed this ourselves. After weeks of clean accessibility scans, the team gradually stopped questioning what the scans were actually covering. Nobody was being careless—it simply became background noise.

The mitigation is simple but effective: periodic manual spot checks of a random sample of passed results. It keeps the team honest about what the tooling really covers and what it does not.

Non‑Determinism in Generative AI

When generative AI tools are used in testing workflows, their probabilistic nature introduces a different kind of uncertainty. The same prompt can produce slightly different outputs when run twice—not necessarily wrong, just different.

This means you cannot validate generative AI output the same way you validate a deterministic check. What works instead is evaluating outputs against characteristics: is the output relevant, technically accurate, and aligned with the actual problem?

One practical technique is using a second language model to review the output of the first against those criteria. This does not remove uncertainty, but it catches a reasonable share of obvious issues before a human reviewer steps in.

Conclusion

The certifications gave us better language for things we were already navigating. That matters, especially when explaining to stakeholders why a green build does not automatically mean the product is accessible.

What the courses do not fully prepare you for is the messiness of applying these ideas under real project constraints. In practice, teams make pragmatic trade‑offs. That is fine—but it is important to be honest about them.

Accessibility testing turned out to be a particularly good lens for this. The mix of deterministic checks and human judgment, the real impact on users, and the clear limits of tooling surface the fundamental questions quickly. For anyone working through these certifications and looking for a concrete place to apply them, accessibility testing is well worth the time.

Key Takeaways

The Oracle Problem shows up constantly in accessibility testing, especially where tools can detect presence but not meaning.
Effective accessibility validation requires a combination of automated checks, AI‑supported analysis, and manual testing.
Generative AI outputs can be plausible without being correct. Human review is essential to make AI‑assisted testing trustworthy.
Exploratory and experience‑based testing is where many real accessibility issues are discovered—and where human testers still add irreplaceable value.