How We Integrate AI Into Existing Business Software

Most businesses don't need to rebuild their software to add AI. They need AI that understands their specific data, respects their access controls, and connects to the tools their teams already use.

The default framing for AI integration is a new product: a standalone chatbot, a new app, a separate tool that users log into. This framing is wrong for most businesses. The software a company uses to run its operations represents years of workflow refinement, institutional knowledge encoded in data structures, and user habits that don't transfer to new interfaces easily. Replacing it to add AI capability is expensive, risky, and usually unnecessary.

The more accurate framing is enhancement: taking the software that already exists and making it more capable at the specific tasks where AI adds real value. That means AI that can answer questions about the company's own data. AI that can draft outputs in the formats the existing workflow expects. AI that can take actions in the systems users already work in. None of this requires a full rebuild. It requires connecting an LLM to the right context and giving it the right tools.

This post covers how we approach that problem in practice: the architecture for grounding AI on proprietary data, the tool-calling pattern for connecting it to existing systems, the access control requirements that are non-negotiable, and the failure modes to design around before they occur in production.

The Grounding Problem

A general-purpose LLM has broad knowledge of the world and strong reasoning capability. It has no knowledge of your company's specific data: your customers, your equipment records, your pricing, your inventory, your open work orders. Without access to that data, AI-generated outputs are generic. They may be well-structured and grammatically correct, but they aren't grounded in the specific context that makes them useful.

Retrieval-augmented generation (RAG) is the primary technique for grounding LLM outputs on proprietary data. The architecture has three components:

A vector database or semantic search layer. Documents, records, and knowledge base content are converted into embedding vectors using an embedding model and stored in a vector database (Pinecone, Weaviate, pgvector as a PostgreSQL extension, Qdrant). At query time, the user's question is embedded using the same model and used to retrieve the most semantically similar records from the database. The retrieved content is then passed to the LLM as context.

A retrieval pipeline. Raw retrieval from a vector database has known failure modes. The top-k results by cosine similarity are not always the most relevant results. Retrieval quality varies significantly based on how content was chunked before embedding, which embedding model was used, and how the query was structured. A production retrieval pipeline includes reranking (using a cross-encoder model to score retrieved results for relevance before passing them to the LLM), filtering (restricting retrieval to documents the authenticated user is authorized to see), and logging (capturing what was retrieved for each query so retrieval quality can be measured over time).

Structured data access alongside unstructured retrieval. Vector search works well for unstructured content: documents, notes, knowledge base articles, past communications. It works poorly for structured queries: "what is the current inventory count for part number 8812-C," "which work orders are overdue for facility F-3," "what was the average response time last quarter." These questions should be answered by querying the database directly, not by searching for similar text. The AI integration needs to route to structured data access for queries that have structured answers, and to vector search for queries where the answer lives in unstructured content.

The Tool-Calling Architecture

Tool calling is the mechanism that connects the LLM to the systems it needs to act on. The model is given a set of tools, each described with a name, a description of what it does, and a schema for its inputs. When the model determines that answering the user's request requires calling one of the available tools, it generates a structured call that the application layer executes against the actual system. The result is passed back to the model, which incorporates it into the response.

For an AI integration into existing business software, the tools are usually: a database query tool, an API call tool for each integrated external system, and a document retrieval tool. The design of these tools has more impact on output quality than the choice of LLM.

Tool descriptions that match how users think. The model selects tools based on the descriptions. A tool described as "queries the asset_maintenance table with a SQL WHERE clause" requires the model to know both that assets are tracked in a table called asset_maintenance and that the correct query syntax for that table. A tool described as "retrieves maintenance history and current status for a specific asset, given its asset ID or name" requires the model to know only what a user would know. The latter produces dramatically more reliable tool selection.

Input validation before execution. Every tool call should be validated against the tool's input schema before executing against the backend system. The model sometimes generates calls with incorrect argument types, missing required fields, or out-of-range values. A validation layer catches these before they reach the database or external API, returns a specific error to the model explaining what was wrong, and allows the model to self-correct and retry. Without this layer, invalid tool calls either throw unhandled exceptions or produce incorrect outputs silently.

Read-write separation. Read-only tools (query, retrieve, lookup) and write tools (create, update, delete, send) should be clearly separated in the tool definition and should be subject to different authorization requirements. Not every user who can ask questions about the data should be able to trigger writes. The authorization check for write tools happens at the tool execution layer, not by trusting that the model only calls write tools when appropriate.

Structured outputs. Tools that return data should return typed, structured objects, not raw SQL result sets or unformatted API responses. The model has to interpret the tool's return value and incorporate it into a response. A raw SQL result set returned as a list of tuples requires the model to infer the meaning of each field from its position. A structured object with named fields and clear types is easier for the model to reason about correctly, and errors in interpreting the return value are easier to diagnose.

Access Control: The Non-Negotiable Constraint

AI that can access data the user isn't supposed to see is a security vulnerability, not a feature. The access control model for an AI integration has to be at least as strict as the access control model for the underlying systems, which means it has to be enforced at the data access layer, not by trusting the model's judgment.

The concrete requirement: every tool call executed on behalf of a user must apply the same access control filters that would be applied if that user were querying the system directly. A field technician who can only see work orders assigned to their team should not be able to ask the AI about work orders assigned to other teams and receive accurate answers. The query tool that retrieves work orders must apply the same row-level filter the application's API applies for that user's role.

This is implemented by passing the authenticated user's identity and role into the tool execution context, and designing each tool to apply appropriate filters before querying. The model should not receive any data the user isn't authorized to see, even intermediately. Returning the full dataset and asking the model to "only show the parts this user should see" is not a valid implementation.

The access control model also applies to retrieval from the vector database. Documents indexed in the vector store should be tagged with the access level required to see them. At retrieval time, the query should filter to documents the authenticated user is authorized to access before running the semantic similarity search, not after. Post-retrieval filtering that removes unauthorized results still returns them from the database, which is a weaker guarantee than pre-retrieval filtering that never fetches them.

Connecting to Systems the Team Already Uses

For AI integration into existing business software, the highest-value connections are usually to the systems users interact with most: the CRM, the ERP, the project management tool, the ticketing system. Each connection is a tool the AI can call to read from or write to that system.

The integration quality varies significantly by system. Modern SaaS platforms (Salesforce, HubSpot, Jira, ServiceNow) have REST APIs that are well-documented, consistent, and support fine-grained access control via API keys or OAuth. Building tool wrappers around these APIs is straightforward, typically a week of development per integration to handle authentication, rate limiting, error handling, and response normalization.

Legacy systems present a different problem. An ERP system from 2005 may expose data only through a SOAP API with XML responses, a proprietary binary protocol, or a shared database that's accessed by direct SQL query. None of these are insurmountable, but they require more integration work and carry higher maintenance burden when the source system is updated.

The integration priority decision should be driven by where the AI generates the most value, not by which systems are easiest to connect. A field inspection app that can access the maintenance history for the asset being inspected while the technician is on-site generates more value than one that can only answer questions about general best practices. The integration that connects to the asset management system is the one that makes the difference.

Streaming Responses and Latency

LLM response generation takes time. A multi-step agent that retrieves context from the vector store, queries the database for structured data, and then generates a response may take five to fifteen seconds to complete from the user's perspective. In a conversational interface, a blank screen for that duration feels broken.

Streaming solves this by sending the response token by token as it's generated, rather than buffering the full response and sending it at once. From the user's perspective, the response starts appearing almost immediately and continues filling in. The latency is the same, but the perceived latency is dramatically lower because the user sees progress.

Streaming also allows intermediate state to be communicated during multi-step processing. A tool-calling sequence where the model looks up a record, retrieves relevant documents, and generates a response can show the user what it's doing at each step: "Looking up work order W-4471... Retrieving maintenance history... Generating response..." This feedback loop is especially important for operations workflows where the user needs to know the AI is working with current, real data rather than generating a plausible-sounding hallucination.

The implementation uses server-sent events (SSE) or WebSocket streaming from the backend to the frontend. Most LLM provider APIs support streaming natively. The tool call results can be streamed as structured events that the frontend renders as intermediate states, with the final text response streaming via SSE.

Preventing Hallucination Through System Design

Hallucination, the model generating confident-sounding output that is factually incorrect, is the primary reliability concern for production AI integrations. The design-level response to hallucination is not primarily about model selection. It's about constraining the model's outputs to be grounded in verified data rather than generated from parametric knowledge.

The techniques that actually reduce hallucination in production:

Require citation. For responses that answer questions about the company's data, require the model to cite the specific record or document it's responding from. The citation requirement forces the model to ground its response in retrieved content rather than generating from memory. A response that says "Based on work order W-4471, the last preventive maintenance was completed on March 15" is verifiable. A response that says "preventive maintenance for this equipment type is typically done quarterly" is not, and it may be wrong for this specific asset.

Return structured data for structured questions. When a user asks "how many open work orders does facility F-3 have," the correct response is not a paragraph that includes the number. It's a direct query to the database that returns the count, displayed as a number. Routing structured queries to structured data access rather than generating the answer from context eliminates hallucination for that query type entirely.

Use low temperature for factual queries. Temperature controls the randomness of the model's output. Higher temperature produces more varied and creative output. Lower temperature (0.0 to 0.3) produces more deterministic output that sticks closer to the most likely continuation given the context. For factual queries about business data, lower temperature reduces the probability of the model improvising details that aren't in the retrieved context.

System prompt constraints on scope. The system prompt should explicitly instruct the model that it should only answer questions it can support with retrieved data, and should say so clearly when it can't find the information rather than generating a plausible answer. "I don't have that information in the available records" is a correct answer. A confident incorrect answer is not.

Evaluation: Knowing Whether It's Working

An AI integration that is not continuously evaluated will degrade over time without anyone noticing. The retrieval quality drifts as the document corpus changes. New types of user queries don't match the prompt structure. A model update changes behavior on specific task types. Without an evaluation harness, these regressions surface through user complaints rather than monitoring.

The evaluation harness for an AI integration has three layers:

Retrieval quality evaluation. Given a set of test queries with known correct documents, does the retrieval pipeline surface those documents in the top-k results? This is measured as recall at k (how many of the correct documents are in the top-k results) and mean reciprocal rank (how high in the result list the first correct document appears). Retrieval quality should be measured before any model changes that affect the embedding process, and after any changes to the document corpus that might affect retrieval coverage.

Tool call accuracy evaluation. Given a set of user queries with known correct tool calls, does the model select the correct tool and generate correct arguments? This is the most direct measure of whether the AI is connecting user intent to the right system actions. Tool call failures in production are often traced to queries that weren't represented in this evaluation set.

Response quality evaluation. For a sample of query-response pairs, are the responses accurate, grounded in retrieved context, and appropriately scoped? This is the hardest to automate because response quality is partially subjective. LLM-as-judge evaluation, where a separate model scores responses against a rubric, scales to large sample sizes. Human evaluation on a smaller sample calibrates whether the automated judge's scores match human judgment.

Running these evaluations before any deployment and on a sampling of production queries continuously is the infrastructure that allows you to confidently update the integration over time rather than being afraid to touch it.

The Integration Is Not the Product

The AI integration is one component of a system. The surrounding system, the data it can access, the access controls it respects, the interfaces it's embedded in, and the operational monitoring that catches failures, determines whether the integration is useful or not.

The teams that get real value from AI integration into existing software treat it as an engineering problem with the same rigor they'd apply to any other backend service: designed for failure, instrumented for observability, evaluated continuously, and operated with a defined ownership model. The model selection and prompt design are a small fraction of the total engineering surface. The data layer, the tool architecture, the access control model, and the evaluation harness are where the real reliability comes from.

The teams that get poor results treat AI integration as a configuration task: pick a model, write a system prompt, point it at the data, declare it done. The gap between those two approaches is the gap between an AI feature that people actually use and one that gets bypassed within a month because it gives wrong answers too often to trust.

For businesses with existing software that runs real workflows, the right integration makes the software meaningfully more capable without requiring users to learn a new system. That's a high bar to clear. The architecture covered here is how we clear it in practice.