When Software Meets the Real World: Robotics, AI, and the Limits of Traditional Testing

Written by Faseeh Ahmad | Feb 11, 2026 8:53:40 AM

Image source: AZoRobotics

From Robotics Research to QA

My name is Faseeh Ahmad, and I have a PhD in robotics and AI. My research focused on building self-reliant robots, systems that can adapt to their environment and recover from failures on their own. These systems are often described as embodied AI agents, where intelligence is not just about reasoning, but about sensing, acting, and learning in the physical world.

(a) Peg-in-hole: insert the blue peg into the hole.

(b) Drawer task: place blue cube into the drawer.

Figure 1: Example lab tasks used to study execution failures and recovery.

Over the past eight years, I have worked with a wide range of robotic plat-forms, from mobile robots to industrial manipulators, in both research labs and industrial settings. My goal was always the same: move beyond controlled de-mos and make robots useful in real environments. That means dealing with uncertainty, noise, timing issues, and situations you did not plan for.

To achieve this, I used a mix of classical planning, motion and task planning, and machine learning, including deep learning and reinforcement learning. In recent years, I have also worked with foundation models, large pre-trained AI models that can be adapted to many tasks using language, such as large language models vision-language models, diffusion models, and vision-language action models.

What all of this taught me is that robotics is not just complex, it is fragile in subtle ways. Even a simple robot movement depends on many tightly connected components. When one assumption breaks, the entire system can fail. This is where software quality stops being a background concern and becomes central.

When Software Meets the Real World

Much of my work started in lab environments. Typical tasks included peg-in-hole insertion, opening drawers, and sorting objects into bins, as illustrated in the examples below. These tasks may sound simple, but they already require perception, planning, control, and execution to work together. We organized these behaviors using behavior trees, which structure robot actions into modular decision and execution blocks that adapt to changes during runtime. The real challenge appears when the world changes in ways the system did not expect.

Consider a peg-in-hole task, illustrated below with two execution scenarios that require different recovery strategies. In one case, the obstacle can be grasped and removed; in the other, it must be pushed aside because the gripper’s opening is not wide enough to securely grasp the object. The plan assumes the hole is free. Now imagine an obstacle blocks the hole during execution. That obstacle was never part of the original task model. The plan itself is correct, but execution fails because the assumption that the hole is free no longer holds.

Figure 2: Peg-in-hole execution failures caused by unexpected obstacles: small obstacle blocking the hole (top row) and a larger obstacle interfering during execution (bottom row).

This type of failure is difficult to handle. It is not a planning failure, because the planner had no information about the obstacle. It is also not a simple software bug. The failure happens during execution, triggered by changes in the environment that were never modelled beforehand.

Debugging such failures is hard for several reasons. The problem emerges from interactions between perception, decision-making, and the physical world. Sensor noise, timing differences between cameras, and partial observations all contribute. On top of that, modern robots rely on large software stacks and learning-based components. When an AI model generates an action that fails, it is rarely obvious why.

In my work, I addressed these execution failures by monitoring the robot during task execution and reasoning about what went wrong. Vision language models (VLMs) helped interpret what the robot was seeing, identify the failure, and suggest how the execution policy should change. The behavior tree is updated on the fly, allowing the robot to recover and handle similar situations in the future.

What this experience made clear is that failures in robotics are rarely isolated issues. They are system-level problems that only appear when software, hardware, perception, and the environment interact. This is exactly where traditional assumptions about software quality start to break down.

Why Traditional Testing Is No Longer Enough

Testing was not the primary focus in much of my robotics work. The goal was to make the robot solve the task reliably. Safety and quality mattered, but they were often implicit rather than systematically addressed.

In practice, testing often meant adding assertions, logs, and runtime checks. This helps to some extent, but it is not enough. It is time consuming, and it assumes you already know what can go wrong. In dynamic environments, that assumption rarely holds. It is extremely difficult to anticipate all variables and edge cases.

More structured approaches such as unit testing, static checks, runtime monitoring, and integration testing are far more effective. They require more effort and resources, but the payoff is significant, especially for systems that interact with the real world.

Simulation and digital twins play a key role here. In robotics, simulation is almost always the best place to start. It allows testing of behaviors, failure cases, and safety constraints before touching real hardware. Simulations are not perfect and never fully capture reality, but they are invaluable for exposing weaknesses early.

My frustrations with testing became even clearer when working with learning-based systems. For instance, in a pushing task in our setup, where the goal is to move an object to a specified target location, we tune the parameters using reinforcement learning and define a reward function to guide the process. Designing reward functions that truly reflect task progress is hard. Even when a model appears to work, it can still produce parameters that make little sense. Often, the only way to know is to run them on the robot.

That might be acceptable for simple tasks. For delicate or safety-critical tasks, it is not. Learning-based components do not fail in obvious or consistent ways, and traditional testing approaches struggle to capture this behavior.

Where This Leaves Us

These experiences led me to an uncomfortable but important conclusion: traditional testing practices struggle not because they are flawed, but because the systems themselves have changed.

Robots and AI systems are no longer static pieces of software. They perceive, decide, and act in environments we cannot fully predict. Quality can no longer be something we check at the end.

This raises a deeper question: if testing is no longer enough, what does quality assurance need to become in order to make AI-driven, embodied systems trustworthy?

That is where the second part of this discussion begins.

View full post