All posts
Field NotesSystems9 min readApril 10, 2026

Nexus on a Live Floor: Reading a Fault the Way an Engineer Does

DM

Dylan McCarthy

Founder & Engineer

Nexus on a Live Floor: Reading a Fault the Way an Engineer Does

A field study of Nexus, our operational intelligence system, running on a 14-line manufacturing floor. The question was simple: can a system read control logic, supervisory data, and documentation together well enough to diagnose a fault the way an experienced engineer would?

This is a field study, not a product engagement. We brought Nexus to a working manufacturing floor to test one of the questions at the center of our research: can a system understand an operational environment well enough to reason over it, the way an experienced engineer carries a working model of a plant in their head? A discrete manufacturer running 14 production lines gave us a real place to find out.

Why a Real Floor

An experienced controls engineer walking a plant floor knows what each machine does, how the lines depend on each other, what normal sounds like, and what does not. Most software never builds that model. It stores readings and draws charts and leaves the understanding to the person reading them. Operational Intelligence, the first of our four research directions, asks whether a system can hold the model itself.

The only honest way to study that is to run it where being wrong has a cost. This facility was averaging 28 minutes of downtime per fault event, not because the faults were hard to fix, but because tracing a root cause through disconnected system layers was. Controls engineers were manually bridging Ignition SCADA alarms, OPC-UA tag mappings, Rockwell Studio 5000 ladder logic, and Cisco IE network diagnostics by hand. Every minute on that trace was a minute the line sat idle, at roughly $4,200 per hour across the lines by the facility's own OEE tracking.

The Diagnostic Problem

Industrial environments are built in layers that work well alone and communicate poorly when something breaks. Alarms surface in SCADA. The tags behind them map to addresses in an OPC-UA server. The logic driving those tags lives inside ladder routines, Function Block Diagrams, and Add-On Instructions nested several levels deep. The hardware state blocking the logic sits on a switch network or an I/O module. No single interface shows all of it at once.

When a fault fires, an engineer typically has to locate the alarm source, identify the underlying tag path and map it to a PLC element, open the logic and find the rung that drives the tag, trace upstream permissives across multiple routines or even multiple PLCs, check device status on the network, and verify I/O and safety system state. For a single failed sensor that takes 15 to 20 minutes. For a fault with several contributing conditions buried in nested logic, it can take over an hour.

What We Deployed

Nexus was given structured, read-only access to three sources: the Ignition SCADA project (alarm configuration, tag paths, device mappings), the Studio 5000 L5X exports for all 14 PLCs (full logic, cross-references, and AOI definitions), and live CIP and EtherNet/IP device diagnostics through the existing FactoryTalk infrastructure.

Its job is diagnostic traversal. An operator describes a symptom in plain language ("Line 4 E-stop reset is not latching," "Filler head 3 faults on every cycle start") and the system traverses the dependency graph from that symptom to its root cause, then explains the path in plain language.

The intelligence layer stays inside the plant network boundary. Nexus runs on a hardened edge server on the OT segment. No configuration data, tag data, or production records leave the facility, and there is no cloud dependency in the diagnostic path. It is read-only at the control level by architecture: no write access to any PLC, no command path through SCADA, no connection to any drive or actuator. The only thing it produces is a diagnostic report. That constraint is itself part of the research. The question we are studying is understanding, not control.

The Event That Made the Case

During the study period, a conveyor on Line 7 faulted overnight and would not clear on operator reset. Floor operators ran the standard reset sequence six times across two shifts without success. By the time the day-shift engineer arrived, the line had been down nearly four hours, and the working theory was a wiring issue in the safety circuit, which could mean pulling conduit.

Nexus was given a single input: the alarm text from the event log. It traced from the alarm through the alarm journal to the OPC-UA tag, followed the tag into the ladder logic for PLC-07, located the rung holding the reset output latched off, traced the blocking permissive upstream through two nested AOI layers, and identified a GuardLogix Safety I/O module with a dropped CIP connection as the upstream condition. It also flagged two conveyors on Lines 8 and 9 that shared the same module in their interlock chain and were one timeout event away from faulting under load.

Total diagnostic trace time: 28 seconds. Estimated manual trace time for the same nested path: 35 minutes.

The engineer confirmed a red fault LED on the module. The true cause was a loose fiber connection at the backplane, a physical problem that needed a physical fix. The fix took four minutes. The two downstream conveyors were inspected before the shift ended, preventing two more events.

What the Study Showed

Three months in, average fault-to-resolution time dropped from 28 minutes to under 6. Nexus handles the trace, the engineer handles the fix. Against confirmed root cause, diagnostic accuracy on the three most common fault categories (I/O hardware faults, safety permissive failures, and logic-level inhibits) was 94%.

  • Average fault trace time: 28 min baseline to under 60 seconds
  • Mean time to resolution: 28 min to 6 min
  • Estimated weekly production time recovered: 14+ hours across all lines
  • Secondary faults caught proactively before failure: 11 in the first 90 days
  • Engineer time returned to capital and preventive work: roughly 8 hrs/week

Why It Works, and Where It Stops

Industrial fault tracing is mostly a retrieval and graph-traversal problem, not a reasoning problem. The information needed to resolve most faults already exists inside the control system. The bottleneck is access speed, getting the right person to the right place in the right order under time pressure. A system that holds the full control graph does not search, it traverses, and the answer that takes an engineer 20 minutes to locate takes Nexus seconds.

The remaining cases, novel failure modes and hardware degradation with ambiguous signatures, still need an experienced engineer. Nexus does not replace that judgment. It removes the time the engineer would have spent on everything that did not need it, so they reach the hard part faster. That boundary, what the system can carry and what stays human, is exactly the next question our research takes up under Human and AI Collaboration.

Keep reading the work.

This is one of a series of field notes and essays on building systems that understand and act in real operations. Nexus is where the ideas get tested.