← Back to Home
Langfuse March 20, 2026

How to Actually Read Your AI Agent's Langfuse Dashboard

How to Actually Read Your AI Agent’s Langfuse Dashboard: A Hands-On Guide for LLM Developers

You built four different agent architectures, ran a benchmark, and got scores of 100%, 96.7%, 96.7%, and 40%. Now what? You know “who’s better” but not “why.” This guide teaches you how to use Langfuse to open the black box.


Table of Contents

  1. Who This Guide Is For
  2. My Experiment Setup
  3. Stop 1: The Tracing Page
  4. Stop 2: Inside a Trace
  5. Stop 3: Inside the LLM Call
  6. Real-World Debug: A ReAct FAIL Case
  7. Stop 4: Comparing Four Agents
  8. Stop 5: Scores and Sessions Pages
  9. Advanced: Process-Level Scoring Beyond Pass/Fail
  10. Exporting Data for Further Analysis
  11. Self-Assessment: Do You Actually Know This?
  12. Appendix: Quick Troubleshooting Reference
  13. Appendix: Langfuse Page Navigation

Who This Guide Is For

You already know how to build agents with LangGraph (or any LLM framework) and you’ve run some evaluations. But you’ve realized that pass rates alone aren’t enough — you want to know what happened inside the agent at every step, where the money went, and which step went wrong.

If you haven’t set up a Langfuse account or sent any traces yet, this is not a setup tutorial. This guide starts where setup ends: your traces are in, the dashboard is open, now what?

My Experiment Setup

To give this walkthrough a concrete subject, I used a small math evaluation project:

Each agent runs all 30 problems = 120 traces sent automatically to Langfuse. Below is what I saw when I opened the dashboard, how I read it, and every mistake I made along the way.


Stop 1: The Tracing Page — Your Trace List

Open Langfuse, click “Tracing” in the left sidebar. This is the page you’ll spend the most time on.

Tracing page

Page Layout

The page has three zones:

Left filter panel: This is your key tool for comparing agents. Under Trace Name, you’ll see all distinct trace names with their counts. You can also filter by Session ID, User ID, Metadata, Score, and more.

Center trace list: Each row is one agent answering one question. Columns include Timestamp, Name, Input/Output, and usually Score and Observation count on the right edge.

What to look for: Scan the Score column first. 1.0 means PASS, 0.0 means FAIL. The FAIL rows are the ones worth clicking into.

Gotcha: Make Sure You’re Looking at Your Own Project

The first time I opened Langfuse, I spent several minutes staring at traces before realizing I was looking at the Demo Project, not my own.

Demo Project

The clue is in the top-left corner — if it says “Langfuse Demo > langfuse-docs” instead of your org name, you’re in the demo. Click the dropdown to switch to your own project.

Gotcha: Input/Output Columns Are Empty

After switching to my own project, I noticed the Input and Output columns in the trace list were completely blank.

Empty Input/Output

Why: LangGraph’s CallbackHandler automatically records prompts and responses inside the inner ChatOpenAI observation, but it does not write them to the outermost trace level. If you don’t explicitly pass input and output parameters when creating the trace, the list page shows nothing.

Fix: Add an input parameter to start_as_current_observation(), and call span.update(output=...) after the agent finishes. Important: span.update() must be called inside the with span: block — I initially placed it outside, and because the span had already closed, the output never got written.

How to Filter for Agent Comparison

With 120 traces in the list, you need to filter by agent type. If your trace names follow a pattern like eval-cot-001, eval-react-018, each name is unique and the filter panel lists 120 entries with count 1 each — you can’t select “all CoT” in one click.

Fix: Use TEXT mode for fuzzy search. The Trace Name filter has SELECT and TEXT buttons. Switch to TEXT mode and type cot to filter all traces whose names contain “cot”. Type react for all ReAct traces.

An even better approach is to use tags in your code (e.g., tags=["cot"]), so you can filter directly in the Tags section of the filter panel.


Stop 2: Inside a Trace — The Waterfall View

This is the most valuable screen in Langfuse. Click any row in the trace list to open its detail page.

Trace waterfall

Page Layout

Left side: Step tree

This is the full execution tree of the agent’s internals. The top level is the trace itself, with nested spans and generations below it. Each node shows:

Right side: Detail panel

Click any node on the left, and the right panel shows that step’s complete information: raw Input/Output content, Metadata, Scores.

Bottom-left: LangGraph flow diagram

Your graph structure is visualized here. A CoT agent shows __start__ -> reason -> __end__ as a straight line. A ReAct agent shows agent <-> tools loops with iteration counts (e.g., agent (2/2), tools (2/2)).

Key Concept: Trace Level vs. Observation Level

This was the concept that took me the longest to understand. A single trace contains multiple nested observations:

eval-cot-001          <- Trace level (outermost): what the Tracing list shows
 |-- eval-cot-001     <- Span: LangGraph wrapper
     |-- LangGraph    <- Span: graph execution
         |-- reason   <- Span: graph node
             |-- ChatOpenAI  <- Observation level (innermost): LLM call details live here

The Tracing list page reads from the outermost trace’s Input/Output. Clicking ChatOpenAI in the waterfall reads from the innermost observation’s Input/Output. They are different.

This distinction becomes critical when setting up LLM-as-a-Judge later — choosing “Run on Traces” vs. “Run on Observations” gives the evaluator completely different data.

Different Agents Have Different Trees

CoT agent trees are shallow: One reason node with a single ChatOpenAI call inside. Simple and direct.

ReAct agent trees have loops:

ReAct tree expanded

The structure is agent -> RunnableSequence -> ChatOpenAI (first LLM call), then tools -> calculator (tool execution), then another agent (second LLM call). should_continue is LangGraph’s routing decision — “Did the LLM request a tool call? If yes, continue the loop; if no, end.”

Multi-Agent trees are the deepest: Each sub-agent (planner, calculator, verifier) is its own subtree nested inside the main graph.


Stop 3: Inside the LLM Call — Seeing the Prompt and Response

In the step tree, find the ChatOpenAI node (or whatever LLM provider you use) and click it. This is the single most important layer for debugging.

ChatOpenAI detail page

Top Metric Bar

A row of badges shows everything about this LLM call:

System / User / Assistant

The right panel expands the full LLM conversation into three sections:

System: Your prompt template. Verify it matches what you intended — sometimes prompts don’t get injected correctly, and this is where you’ll catch it. Click “Expand system prompt” for the full text.

User: The actual input question.

Assistant: The LLM’s complete response. For a CoT agent, you’ll see the step-by-step reasoning (Step 1, Step 2…). This is the core value of CoT — the reasoning process is made visible. If the answer is wrong, you can pinpoint exactly which step went off track.

Metadata Section

Metadata details

Scroll down to find metadata. Important fields include: langgraph_node (which graph node this is), ls_provider (LLM provider), ls_model_name (model), ls_temperature. These can be used for filtering and grouping later.

ReAct-Specific: Tool Call Arguments

In a ReAct agent’s first-round ChatOpenAI response, you’ll see the tool call function names and arguments. For example:

1. calculator
   Arguments: 0.65 * 40

2. calculator
   Arguments: 0.75 * (0.65 * 40)

This tells you what calculation the LLM “wanted to perform.” If the expression is wrong, the bug is here. If the expression is correct but the final answer is wrong, the problem is in the next LLM call (interpreting the tool result).


Real-World Debug: A ReAct FAIL Case

Let me walk through a real debugging case. This is the actual path I followed to locate a root cause in Langfuse.

Spotting the Problem

In the Tracing page, I saw eval-react-018 with a score of correctness: 0.00.

ReAct FAIL trace

Several things stand out: correctness: 0.00 in the top-left confirms a wrong answer. The step tree is much deeper than CoT — two rounds of agent + tools loops. Token usage is 779 (2.4x CoT’s 321). Cost is $0.0035. The bottom-left graph shows agent (2/2) and tools (2/2).

First LLM Call: Reasoning Was Correct

The problem: “In a class of 40 students, 65% passed the exam. Of those who passed, 75% scored above 80. How many students scored above 80?”

Clicking the first ChatOpenAI node, the LLM’s reasoning and tool call arguments were both correct: 0.65 * 40 -> calculator returned 26, 0.75 * (0.65 * 40) -> calculator returned 19.5.

Second LLM Call: Here’s Where It Went Wrong

The LLM received 19.5 from the calculator and responded:

Since the number of students must be a whole number, we interpret this as approximately 20 students scoring above 80. ANSWER: 20

The LLM decided on its own that “student count can’t be a decimal” and rounded 19.5 up to 20. But the ground truth was 19.5.

Conclusion

This isn’t a reasoning error or a tool call error. It’s a conflict between the agent’s “helpful” behavior and the evaluation standard. The ReAct agent made a reasonable real-world inference, but the eval expected the precise mathematical answer.

Without Langfuse: You only know “ReAct got question 18 wrong.”

With Langfuse: In under two minutes you can trace the path: trace list -> click the FAIL trace -> step tree -> first-round reasoning correct -> tool call arguments correct -> calculator returned 19.5 -> second-round LLM rounded up -> root cause identified.

Then you can decide: fix the agent’s prompt (tell it not to round), or fix the ground truth (accept 20 as correct).


Stop 4: Comparing Four Agents

Results after running all 120 questions:

MetricCoTReActConstrainedMulti-Agent
Accuracy96.7%96.7%100.0%40.0%
Avg Latency3.6s5.2s4.9s5.5s
Avg Tokens / Question4651,4681,7561,394
Avg Cost / Question$0.004$0.006$0.006$0.006
LLM Calls / Question1.12.93.53.1
Reasoning Quality (LLM Judge)0.980.950.910.97
Waterfall DepthShallow (1 LLM call)Medium (multi-round loops)MediumDeep (nested sub-graphs)

Several counterintuitive findings:

Constrained scored 100% accuracy but the lowest Reasoning Quality (0.91). It used the most tokens (1,756/question) and the most LLM calls (3.5/question) to achieve a perfect score. It works like a student who brute-forces every problem — the process isn’t elegant, but the answers are always right.

Multi-Agent scored only 40% accuracy but a high Reasoning Quality of 0.97. This is the most counterintuitive result — each sub-agent’s individual reasoning quality is high, but the combined result is terrible. The problem lies in inter-agent communication, not in any single agent’s reasoning ability. This is exactly the value of process-level scoring: if you only look at accuracy, you’d think Multi-Agent is “the dumbest”; but Reasoning Quality tells you it “thinks well but collaborates poorly.”

CoT is the overall winner. Fastest (3.6s), fewest tokens (465), cheapest ($0.004), highest Reasoning Quality (0.98), and 96.7% accuracy. The only question it got wrong was due to a rounding edge case. On this benchmark, the simplest architecture was the best.

ReAct used 3x the tokens but matched CoT’s accuracy exactly. ReAct averaged 2.9 LLM calls per question (CoT needed only 1.1), consuming 3x the tokens, yet achieved the same 96.7%. The extra tool calls provided no accuracy advantage on simple math problems — though they might make all the difference on tasks requiring external data retrieval.


Stop 5: Scores and Sessions Pages

Scores Page

Click “Scores” in the left sidebar. This aggregates all score data across your traces. The correctness, latency_seconds, and any process-level scores you sent via create_score() appear here. You can filter by trace name to see each agent’s pass rate.

Sessions Page

If your code sets a session_id (e.g., eval-3f66d0fe), all traces from the same eval run are grouped into one session. This makes it easy to compare “this run’s results vs. the last run after I changed the prompt.”

Dashboard Page

The Dashboard provides aggregate metrics: total traces, average latency, total cost, score trends. Use the time filter to lock onto your eval run’s time range.


Advanced: Process-Level Scoring Beyond Pass/Fail

After running 120 questions and reviewing the comparison table, you might ask: “Correctness only tells me right or wrong, but how do I know if the agent’s reasoning quality is good?”

That’s the right question. An agent might get the right answer with terrible reasoning (lucky guess), and another might get the wrong answer with nearly perfect reasoning (like the rounding issue in question 18). Pass rate alone can’t distinguish these two cases.

Two Independent Questions: “How to Compute” and “How to Deliver”

This is a core concept that took me a while to grasp.

“How to compute the score” has two approaches: Rule-based (deterministic metrics calculated with code) and LLM-as-a-Judge (another LLM reads the reasoning process and grades it).

“How to deliver the score to Langfuse” also has two approaches: Use Langfuse’s built-in evaluator UI, or call the create_score() API from your own code.

These two dimensions are independent. You can use Langfuse’s UI to run an LLM judge, or you can call Gemini from your own code and send the score back via create_score(). The result on the Scores page looks identical. Langfuse doesn’t care where the score came from — it just stores and displays.

Method 1: Rule-Based Scoring

I added four process-level metrics to run_eval.py, automatically computed after each question and sent back to Langfuse:

Score NameMeaningHow It’s Computed
tool_call_countHow many tool calls the agent madeCount ToolMessage instances in messages
reasoning_stepsHow many reasoning steps were takenCount “Step 1”, “Step 2” patterns
answer_clarityWhether the answer was cleanly formattedCheck if final_answer field contains ANSWER format
efficiencyOverall efficiency (0-1)Agent-type-specific formula penalizing excess tool calls or time

These don’t require changing the agent code — the scoring logic runs in the eval runner after the agent finishes. The agent doesn’t need to know it’s being graded.

Rule-BasedLLM-as-a-Judge
What it can evaluateQuantifiable metrics (counts, time, format)Subjective quality (reasoning clarity, logical rigor)
CostZeroRequires additional LLM tokens
Consistency100% deterministicMay vary slightly between runs
Requires re-running agents?YesNot necessarily

Method 2: LLM-as-a-Judge

Langfuse has a built-in LLM-as-a-Judge feature. Here’s the complete setup flow.

Step 1: Create an LLM Connection

On the LLM-as-a-Judge page in the left sidebar, first establish a connection to an LLM provider.

LLM Connection

I used Google AI Studio’s free tier. Set the LLM adapter to google-ai-studio, enter your API key, and leave everything else at default.

Step 2: Choose a Model

Model selection

I recommend starting with gemini-2.5-flash — cheap, fast, and sufficient for evaluating math reasoning. Upgrade to gemini-2.5-pro later if quality isn’t good enough.

Step 3: Create a Custom Evaluator

Evaluator template list

Langfuse offers built-in evaluator templates (Correctness, Hallucination, Relevance, etc.), but these are designed for general-purpose use. For math reasoning evaluation, you need a custom rubric. Click ”+ Create Custom Evaluator” in the bottom-right.

Custom evaluator form

Write your rubric. My prompt uses both {{input}} (the problem) and {{output}} (the LLM’s reasoning response), so the judge can evaluate reasoning against the actual problem statement.

Step 4: Variable Mapping

Variable mapping

Each {{variable}} needs an Object Field assignment. The options are Input, Output, and Metadata.

Gotcha: Object Field Set to the Wrong Value

I initially set {{output}}’s Object Field to Input — which meant the judge received the prompt sent to the LLM, not the response from it. The Preview section helps you verify what each variable actually contains: blue text shows the input mapping, orange text shows the output mapping.

Gotcha: Evaluator Only Scores New Traces

Evaluator run config

If the “Run on live incoming observations” toggle is enabled, the evaluator only grades observations that arrive after the evaluator was created. It will not retroactively process existing traces. If you want the judge to evaluate all traces, set up the evaluator before running your eval.

My approach: create the evaluator first, then re-run the full eval. As new traces come in, the evaluator grades them automatically.

Step 5: Verify Results

Judge results

My evaluator finished with 315 results. Why 315 instead of 120? Because it was configured to Run on Observations with a Type = GENERATION filter — it graded each individual ChatOpenAI call, not each trace. ReAct has 2-3 GENERATION observations per trace, Multi-Agent has even more. 120 traces x ~2.6 GENERATION observations each = ~315.

This is actually better — you can see that ReAct’s “first round: decide which tool to call” and “second round: interpret the result” each get their own Reasoning Quality score.

Gotcha: Evaluator Log Is Empty

After creating my evaluator, I clicked into its log and found zero results. Two possible causes: Hobby plan limitations (requires upgrade to run), or the evaluator is still queued for processing. I refreshed after ten minutes and results started appearing. If nothing shows up, the alternative is to call the Gemini API from your own code and send scores back via create_score().


Exporting Data for Further Analysis

Gotcha: CSV Export Contains Incomplete Data

Langfuse supports CSV export. But I exported several times and kept getting partial data.

The first time, I got only 30 rows (only the constrained agent) — because a filter was still active when I exported.

The second time, I got 133 AGENT-type observations with no token/cost data and no scores — because I exported from the Observations tab, and scores are attached at the trace level, not the observation level.

The correct approach: Go to Tracing -> Traces (not Observations), clear all filters, then export. That CSV will include all scores (correctness, tool_call_count, efficiency, etc.) for each trace.


Self-Assessment: Do You Actually Know This?

The following 8 scenario questions all come from real usage experience. Each has one correct answer. If you can answer all of them correctly, you’re ready to navigate Langfuse dashboards independently.

Fundamentals

Scenario 1: You discover that the ReAct agent got question 18 wrong. You want to see how it interpreted the calculator’s return value of 19.5 in its second LLM call. What do you do?

Answer: B. The core debug path is Tracing -> click into a specific trace -> find the relevant LLM call in the waterfall -> read the full prompt and response. The Scores page can only tell you “right or wrong” — it can’t show you the reasoning process.

Scenario 2: You’ve run 120 questions and want to quickly view only the CoT agent’s traces. But the Trace Name filter lists each name with a count of 1 (because names like eval-cot-001 are unique). What do you do?

Answer: B. The Trace Name filter has SELECT and TEXT modes. SELECT is exact match; TEXT supports fuzzy search — typing cot filters all traces whose names contain “cot.”

Scenario 3: Your Tracing list shows completely blank Input/Output columns, but you’re certain the agents ran successfully. What’s the most likely cause?

Answer: B. LangGraph’s CallbackHandler automatically records prompts and responses inside the inner ChatOpenAI observation, but does not write them to the outermost trace. The Tracing list reads from the trace level’s Input/Output, so it’s empty. Fix this by explicitly passing an input parameter when creating the trace and calling span.update(output=...) before the span closes.

Scenario 4: In your LLM-as-a-Judge setup, you set {{output}}’s Object Field to Input. What happens?

Answer: C. Object Field determines which field of the observation the variable pulls from. Input = the prompt sent to the LLM. Output = the LLM’s response. If you pick the wrong one, the judge is evaluating prompt quality instead of reasoning quality. Use the Preview feature to verify what each variable actually contains.

Advanced

Scenario 5: Your LLM-as-a-Judge evaluator is configured with Run on Observations, Type = GENERATION. You ran 120 traces, but the evaluator shows 315 completions. Why?

Answer: B. One CoT trace has ~1 GENERATION, one ReAct trace has ~3, Multi-Agent has ~3. 120 traces x ~2.6 average GENERATION observations = ~315. The evaluator grades each GENERATION observation individually, not each trace.

Scenario 6: The comparison table shows Constrained agent with 100% accuracy but only 0.91 Reasoning Quality (the lowest of all four). What does this mean?

Answer: B. This is precisely the value of process-level scoring — accuracy and reasoning quality are two independent dimensions. Constrained averaged 3.5 LLM calls and 1,756 tokens per question (CoT needed only 1.1 calls and 465 tokens). It works harder but doesn’t make mistakes. The gap between outcome and process is only visible when you look at both scores simultaneously.

Scenario 7: You want to export the complete CSV for this eval run, including each trace’s correctness and process scores. You export from the Observations tab, but all score columns are empty. Why?

Answer: C. Trace level and Observation level are two different concepts in Langfuse. create_score() is called with a trace_id, so scores attach to traces. To export a CSV with scores, go to Tracing -> Traces and export from there, not from the Observations tab.

Scenario 8: You created the evaluator and then re-ran the full 120 questions. But you notice that the earlier test trace you ran manually (eval-cot-001) has no judge score. Why?

Answer: B. “Run on live incoming observations” means only new arrivals get evaluated — no retroactive processing. This is by design, not a bug. If you want all traces evaluated, set up the evaluator before running your eval.


Appendix: Quick Troubleshooting Reference

ProblemCauseFix
Trace list Input/Output is emptyCode didn’t pass input/output to traceAdd input= parameter to start_as_current_observation()
span.update() has no effectCalled outside the with span: blockMove it inside the with block, before the span closes
Trace Name filter shows count 1 for eachTrace names are unique (e.g., eval-cot-001)Switch to TEXT mode for fuzzy search
LLM-as-a-Judge log is emptyHobby plan limitation or still queuedWait, or write your own LLM judge code + create_score()
Evaluator didn’t score old traces”Run on live” only evaluates new arrivalsSet up evaluator before running eval
Judge ran 315 times instead of 120Run on Observations; each trace has multiple GENERATIONsThis is normal — each LLM call is scored individually
CSV export has incomplete dataFilters still active, or exported from Observations tabGo to Traces page, clear all filters, then export
{{output}} received the prompt instead of the responseObject Field set to Input instead of OutputChange to Output; use Preview to verify

Appendix: Langfuse Page Navigation

What you want to findWhere to look
All tracesTracing -> Traces
Traces for a specific agentTracing -> Trace Name in TEXT mode
Step-by-step agent internalsClick into trace -> step tree -> ChatOpenAI
Full LLM prompt and responseChatOpenAI node -> Preview -> System/User/Assistant
Token usage and costChatOpenAI node -> top metric bar
Each agent’s pass rateScores page, filter by correctness
Process-level metricsScores page, filter by tool_call_count / efficiency
LLM judge reasoning quality scoresScores page, filter by Reasoning Quality
All results from one eval runSessions page
Overall cost and latency trendsDashboard page
LangGraph graph structureTrace detail page, bottom-left flow diagram
Setting up LLM-as-a-JudgeLeft sidebar -> LLM-as-a-Judge -> + Set up evaluator