top of page

Beyond the AI Label. Why Telemetry Matters as You Build Your Travel Reasoning Workflows

  • Graham Anderson
  • Jan 14
  • 10 min read

Updated: Jan 26

Series: From automation tools to agentic orchestration - a telemetry guide for the travel industry.


As we kick off 2026 in earnest, every travel tech company claims or will soon be claiming to have "AI capabilities." However, a more honest assessment is that most of us are still experimenting, not operating at scale.


We are now using Model Context Protocol (MCP) to let reasoning models query our databases, call our APIs and retrieve policy documents. We see promising pilots where natural language triggers workflows and automations that previously required a swivel chair, five different screens and manual data entry.


Here is something that tech leaders and their teams will wrestle with as AI capabilities harden into production solutions or products. The telemetry infrastructure we built for deterministic software cannot tell us what we need to know about reasoning models. When an LLM interprets a traveller's request, decides which tools to call and hands off to traditional automation, we see the final result. We cannot reconstruct the reasoning path that got us there.


The Question That Reveals the Gap

Ask your team this question in your next stand-up or catch up, "If our AI-powered booking assistant makes a recommendation that a customer disputes, can we show them the step-by-step reasoning the system used to arrive at that recommendation, make the booking, take the payment, etc?"


For most, the answer is "no." Not because the system is failing, but because we are using logging designed to capture "Function X called Function Y." What we actually need is "Model interpreted request as Z, retrieved context from A, considered options B and C and selected B because of constraint D."


This is semantic telemetry, a new category of observability designed specifically for reasoning systems. It captures not just what happened (traditional logs) but why the system made the decisions it did (reasoning, context, alternatives considered). If you have never heard this term, you are not alone. Most travel tech leaders have not. That is the problem.


The industry is currently in a 'visibility honeymoon.' We are so impressed that these models can reason at all that we have stopped asking how they are doing it. However, in the transition from experiment to infrastructure, 'Magic' is a liability. You cannot manage, optimise or defend what you cannot see. We do not need to slow down our AI deployment; we need to catch our telemetry up to our ambitions.


The Three-Tier Reality Check

In January 2026, your "AI" portfolio likely breaks down into three distinct tiers of visibility:

Tier

Technology

Telemetry Status

The 2026 Risk

Tier 1: Rules

If-Then logic and low code automation tools

✅ Full Visibility. Decades of experience.

Logic bugs in code/rules.

Tier 2: ML

Predictive models, Fraud scoring, Forecasting, Pricing

⚠️ Partial Visibility. We log scores, rarely "features."

Invisible Drift: models failing as travel patterns shift.

Tier 3: Reasoning

LLM-powered agents, MCP workflows

❌ Zero Visibility. The middle is a black box.

Hallucinations & Injection: unexplainable business decisions.

Why This Article Focuses on Tier 3

Tier 1 (rules-based systems) have well-understood debugging. If the logic is wrong, you fix the code. Tier 2 (ML models) have established monitoring practices - though "invisible drift" remains a challenge worthy of its own article (and there are plenty out there).


A pricing model trained on 2025 demand patterns might silently degrade as traveller behaviour shifts, with no alert that its predictions are becoming increasingly wrong. Traditional ML monitoring tracks accuracy on test sets - not whether the real world has changed underneath the model. This "invisible drift" is why semantic telemetry must eventually extend to predictive systems, not just reasoning ones.


However, Tier 3 (reasoning systems) is where the visibility gap is most acute. These are the systems where:

  • Non-determinism is the feature, not the bug.

  • The "logic" is a language model's interpretation, not executable code.

  • Traditional logging was never designed to capture "why did the AI think this was the right answer?"

That is the gap this telemetry guide addresses.


In January 2026, most travel companies are deploying AI faster than they are instrumenting it. If you are reading this and realising you have Tier 3 systems in production without semantic telemetry, you are not behind, you are in the majority. The gap between 'AI in production' and 'AI we can explain' is widening across the industry. The companies closing this gap now, before regulatory enforcement and before lawsuits, are the ones building sustainable AI capabilities rather than ticking the 'we have AI' box."


The "Non-Determinism" Mystery

In a deterministic system, the same input yields the same output. In a 2026 reasoning workflow, that is no longer true.


The scenario: a tester asks for a "quiet hotel in Lisbon" four times over two days:

  • Query 1 Historic guesthouse in Alfama (charming but narrow streets, tourist crowds, tram noise).

  • Query 2 Modern hotel in Baixa (central shopping district, very crowded and touristy).

  • Query 3  Contemporary hotel in Parque das Nações (quiet but lacks traditional Lisbon charm, feels impersonal).

  • Query 4  Riverside hotel in Belém ✓ (Perfect match: peaceful southwestern district, spacious, away from crowds but well-connected).


The fourth result is perfect. The tester wants to know why the system finally got it right. Traditional logs show what happened. All four queries executed successfully, all returned results from the hotel inventory API, all returned a 200 OK status. What those logs cannot show is why the AI weighted "peaceful district away from crowds" over "central location" in Query 4 but not in Queries 1-2.


Semantic telemetry shows that between Query 1 and Query 4, the agent retrieved a specific piece of user history - a previous stay where the guest mentioned appreciating "calm mornings by the water away from tourist crowds" - and adjusted its interpretation of "quiet" from "low nightlife activity" (which Alfama provides at night) to "peaceful atmosphere throughout the day in a less touristy district" (which Belém delivers).


Without semantic tracing, you cannot replicate your successes and you certainly cannot debug your failures.


Unlike deterministic software where logging can be retrofitted, you cannot recreate LLM reasoning paths after the fact. You cannot rewind and replay. If you did not capture the reasoning, the temperature settings and the context window state at the exact moment it happened, that logic is gone forever.


The Security Risk .. How AI Could Amplify Existing Fraud Patterns

Security researchers have highlighted a growing vulnerability in AI agents with tool-calling capabilities: over-privileged output. These systems are often so well-optimised for "helpfulness" that they inadvertently act as high-speed oracles for sensitive business logic. While we are yet to see a headline-grabbing breach of this type in travel, the gap between AI capability and traditional telemetry reveals a looming "silent" fraud crisis.


The Traditional Fraud Pattern

Consider, a long-standing hotel fraud problem, unauthorised use of corporate discount codes. Hotels negotiate rate codes with specific organisations - government agencies, corporations or associations.


Traditionally, this fraud is "leaky" but slow. Unauthorised guests might stumble upon a code or find one on a social media thread. As discovering which codes are active and what they require (such as ID verification) is a manual, tedious process, the misuse is usually limited to a handful of bookings per property. It is a manageable nuisance.


The AI-Amplified Risk, Shifting From Nuisance to Macro-Risk

An AI booking assistant changes the economics of this fraud by increasing the fraud velocity. A bad actor can use the AI to systematically "brute force" or enumerate rate codes. As the AI is designed to be frictionless, it may confirm which codes are valid, the depth of the discount and, crucially, whether the property's policy strictly enforces ID verification.


Scenario Analysis:

  • The Leakage.  A high-value corporate code is validated by the AI and shared via social "deal" communities.

  • The Volume.  If this leads to just 25 unauthorised bookings per day across a mid-sized hotel cluster over a 60-day period, the property is looking at 1,500 compromised transactions.

  • The Impact. With an average discount of $130 per night, the unrealised revenue exceeds $195,000 before a human auditor ever sees the trend.


For a global brand, multiplying this pattern across hundreds of properties transforms a helpful customer interaction into a multi-million dollar systemic risk.


The Telemetry Gap

The core of the problem is that traditional logs are outcome-oriented. They would simply report: "1,500 bookings processed successfully with valid rate codes." What would be missing is the intent-oriented data that only surfaces at the conversational level:

  1. Systematic Enumeration. That a single session queried 50+ codes in a three-minute window.

  2. Policy Disclosure. That the AI explicitly confirmed "No ID required" for a specific high-value code.

  3. Anomalous Correlation. That usage of a specific code spiked 800% shortly after that specific AI interaction.


Each individual booking appears legitimate in isolation. The pattern only surfaces during a quarterly audit - long after the revenue has evaporated.


Why Semantic Telemetry Matters

Systems that capture query patterns, not just transaction outcomes, can flag systematic rate code enumeration the moment it begins. This is the essence of semantic telemetry. It provides visibility into the "why" behind the interaction.


Without this infrastructure, travel companies are essentially building high-speed doors without installing security cameras. To protect margins in an AI-driven market, telemetry must evolve from tracking what happened to understanding what was intended.


Redefining "Done" for Reasoning Systems

Here is an uncomfortable question, "Have we lowered our standards because the technology is new?"


When we built our core booking systems, payment processing and inventory management, we did not debate whether monitoring was worth the investment. We understood that a system you cannot observe is not production-ready. We built logging, alerting and audit trails as part of "done" - not as optional enhancements.


Yet as we rush to deploy AI features, many organisations are calling systems "done" when they cannot explain what those systems are doing (and why they are doing what they are doing). We are accepting a definition of production-ready that we would never have tolerated for our transactional systems.


This is not about AI requiring new capabilities. It is about applying the same operational standards we have always had.

What "Done" Always Meant

What "Done" Often Means Today for AI

Code that works ✅

Model that produces outputs ✅

Tests that prove it ✅

Basic integration working ✅

Visibility into behaviour ✅

Visibility into reasoning ❌

Ability to debug failures ✅

Ability to reconstruct decisions ❌

Audit trail for disputes ✅

Explanation for recommendations ❌

The gap is not that telemetry is expensive. The gap is that we are shipping incomplete systems.


When engineering leadership says "we need to add telemetry," they are not asking for budget to build something extra. They are asking for permission to finish the work properly. Arguably, you can ship experiments without complete telemetry. You cannot responsibly scale to production without it.


Learning from the Mission-Critical

We don't need to invent new standards; we simply need to apply the rigour of our own industry's high-stakes sectors to our digital interfaces.

  • Aviation. We do not fly planes without flight data recorders. We need the same "black box" mindset for AI - the ability to reconstruct why a system made a specific turn.

  • Financial Services.  Algorithmic trading requires a clear audit trail. Our booking assistants, which handle revenue and customer identity, should be held to the same standard of traceability.


A Tier 3 reasoning system is "done" when:

✅ It produces the intended outputs.

✅ It includes the appropriate telemetry layer for its risk profile.

✅ You can reconstruct reasoning for disputed decisions.

✅ You can prove compliance with applicable regulations.

✅ You can optimise what is working and debug what is failing.

This is not a new standard. It is the standard we have always had, now applied to reasoning systems.


The real question is not, "Should we invest in telemetry?" It is, "Are we willing to call something production-ready when it does not meet our operational standards?"


What You Need to Instrument Now

To move from pilot to production in 2026, your reasoning workflows must capture more than just inputs and outputs. To meet emerging transparency standards (like the EU AI Act) and ensure operational control, you must log:

  • Interpretation Logs. How did the model parse ambiguous customer intent?

  • Context Retrieval (RAG) Tracing. Which specific data points were pulled from your inventory or policy databases to inform this response?

  • Tool Selection Logic. Why was the "Refund API" triggered instead of the "FAQ Search"?

  • Confidence & Human-in-the-Loop. How certain was the model? Did it correctly escalate a high-stakes decision to a human agent?


A Practical Implementation. Start Layered and Know Your Boundaries

Do not try to instrument everything at once - you will drown in data. Instead, use a tiered approach where the risk profile of the transaction determines the depth of the telemetry.

  • Layer 1 - Always On (The "What"). Metadata for every request. Model version, timestamp and final outcome. This is your high-level audit trail.

  • Layer 2 - Sampled (The "How"). Full semantic traces for a percentage (e.g. 10%) of successful traffic. This provides a "gold standard" baseline to identify model drift or performance decay.

  • Layer 3: Logic-Triggered (The "Deep Dive") Verbose tracing (Chain of Thought, RAG sources, alternative paths) activated during inference when:

    • Low Confidence: The model’s self-reported confidence score falls below a set threshold.

    • Sensitive Tool Access: The agent attempts to call a "Power Tool" (e.g., payment, refund, or PII access).

    • Guardrail Violation: A secondary "supervisor" model flags a potential policy breach.


The Critical Distinction. These triggers fire before the transaction completes. You are capturing the "uncertainty" while the evidence still exists. If you wait for a customer complaint, the reasoning path has already evaporated.


Working With Commercial AI (Claude, GPT, Gemini)

When using commercial APIs, your telemetry happens at the application boundary. You are not instrumenting the model; you are instrumenting your interaction with it.

Layer

What the API Provides

What You Must Capture (Your Code)

Layer 1

Request/Response, Token Usage

Session ID, User Context

Layer 2

Function Call parameters

The RAG "snippets" sent to the model

Layer 3

JSON-structured reasoning

Why specific tools were selected/rejected

Pro Tip for Tech Leaders ... do not just log the AI's final answer, prompt your models to return a "Reasoning Object" alongside the response.


Your Monday Morning Action. Run The Maturity Assessment

Block 2 hours with your technical and product leads this week.

  • Part 1: The Inventory (45minutes). List every "AI" feature. Is it Tier 1, 2 or 3? Be honest - ignore the marketing narrative.

  • Part 2: The Dispute Test (45 minutes). Pick your most prominent Tier 3 pilot. If a customer disputes a result today, can you reconstruct the reasoning path? Business Leaders ask your CTO to show you the reasoning path for your top AI pilot. Their answer determines if you are ready to scale or just running an expensive experiment. CTO's pick your highest-visibility Tier 3 system. Role-play the dispute scenario with your team. If you cannot reconstruct reasoning today, this is your Q1 priority - not Q3, not 'when we have time.'

  • Part 3: The Gap Map (30 minutes). Identify the top three workflows where "black box" reasoning creates financial or legal risk. These are your Q1 2026 priorities.

Further Reading & Resources

  • OpenTelemetry Semantic Conventions for GenAI  (opentelemetry.io/docs/specs/semconv/gen-ai). The emerging industry standard for tracing LLM inputs, outputs and tool calls.

  • EU AI Act Regulatory Framework (digital-strategy.ec.europa.eu). Official guidance on transparency and explainability requirements for high-risk AI systems in 2026.

  • Model Context Protocol Specification (modelcontextprotocol.io). Technical documentation on the protocol enabling reasoning models to connect to data sources and tools.

  • Building MCP Servers in the Real World (newsletter.pragmaticengineer.com/p/mcp-deepdive): How engineering teams across industries - including travel companies like GetYourGuide - are implementing MCP servers, with honest discussion of security challenges and practical constraints.

 
 
 

Comments


Commenting on this post isn't available anymore. Contact the site owner for more info.
bottom of page