Observability for On‑Prem LLMs: Using Arize Phoenix with Ollama





Research Response

How to Use Arize Phoenix with Ollama

A Practical Guide for LLM‑Ops Engineers and Data Scientists


1. Introduction

Arize Phoenix is an open‑source observability platform that lets teams monitor, debug, and evaluate large‑language‑model (LLM) applications. It can record traces, run automatic evaluations, and surface visual insights that help you spot drift, bias, or performance regressions.

Ollama is a local, lightweight LLM host that exposes an OpenAI‑compatible API. By running Ollama locally, you can keep your data in‑house, cut inference costs, and experiment quickly.

Combining Phoenix with Ollama gives you:

Feature Phoenix Ollama
Trace collection OTLP‑compatible Any SDK that speaks OpenAI API
Model evaluation Pre‑built templates (relevance, faithfulness, toxicity, etc.) Directly feed your local model
Visualization Embedding heatmaps, trace graphs, metrics dashboards Immediate feedback on local prompts
Cost Free, open source Zero cloud‑usage costs

2. What Is Arize Phoenix?

Phoenix is built on top of OpenTelemetry (OTLP) and provides:

  • Trace ingestion – collect request‑response pairs from any framework (LangChain, LlamaIndex, DSPy, etc.).
  • Automatic evaluation – run your LLM output against a prompt or reference set using a library of templates (faithfulness, toxicity, coherence, etc.).
  • Embeddings visualizer – cluster analysis and dimensionality reduction of user queries or knowledge‑base documents.
  • Dashboards – metrics such as latency, error rates, accuracy, and drift alerts.

Phoenix is intentionally “playground‑first”: you can spin up a local UI and test everything before deploying to production.


3. Why Combine Phoenix with Ollama?

Pain Point Why Phoenix Helps Why Ollama Helps
Latency Visualize and compare latency distributions across models Run inference locally, no network round‑trip
Data privacy Store traces locally, no third‑party transmission Keep data on‑premises
Cost Free tooling Zero cloud inference cost
Rapid iteration Playground allows instant parameter tweaks Quick local inference without API throttling

4. Prerequisites

  1. Python 3.10+ (recommended in a virtual environment).
  2. Docker (optional, for running Phoenix locally).
  3. Ollama installed locally – see https://ollama.ai/.
  4. An OpenAI‑compatible API key if you want to evaluate against external reference data (optional).

5. Installing Phoenix

Phoenix can be installed as a Python package or run in a Docker container.
The Python route is easiest for experimentation:

python -m venv venv
source venv/bin/activate
pip install "arize-phoenix[evals,llama-index]"  # pulls in core, evals, and LlamaIndex integration

Alternatively, run the prebuilt Docker image:

docker run -d -p 5000:5000 arize/phoenix

Once the container is running, open the UI at http://localhost:5000.


6. Configuring Phoenix to Use Ollama

Phoenix treats any OpenAI‑compatible endpoint as a “provider.”
Ollama exposes an OpenAI‑compatible endpoint at http://localhost:11434/v1.

6.1 Set Environment Variables

export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=YOUR_LOCAL_KEY  # could be anything, Phoenix ignores it for Ollama

6.2 Create a Prompt Playground Session

  1. In the Phoenix UI, click PlaygroundNew Session.
  2. Under AI Provider, select Custom.
  3. Enter the base URL and API key above.
  4. Choose a model from the list (e.g., llama3.1:8b).

You can now send prompts directly to your local Ollama instance from the Phoenix UI and immediately see the trace, latency, and evaluation results.


7. Sending Traces from Your Own Code

Phoenix provides a lightweight callback handler that you can plug into frameworks like LlamaIndex or LangChain.

from llama_index.callbacks.arize_phoenix import ArizePhoenixCallbackHandler
from llama_index.llms import OpenAI
from llama_index import VectorStoreIndex, PromptTemplate

# Tell LlamaIndex to use the Phoenix callback
callback_handler = ArizePhoenixCallbackHandler()

llm = OpenAI(
    model="llama3.1:8b",
    api_base="http://localhost:11434/v1",
    callbacks=[callback_handler]
)

index = VectorStoreIndex(...)  # build your RAG index

query = "Explain the benefits of using local LLMs."
response = index.as_query_engine().query(query)
print(response)

All requests will be automatically sent to Phoenix via OTLP.
You’ll see each trace appear in the Traces tab, complete with timestamps, request/response payloads, and any evaluation metrics you have configured.


8. Evaluating Responses with Phoenix

Phoenix ships with a rich library of evaluation templates, e.g., RAG_RELEVANCY_PROMPT_TEMPLATE. You can also write your own.

8.1 Using a Built‑In Template

from arize_phoenix.evals import RAG_RELEVANCY_PROMPT_TEMPLATE, OpenAIModel
from arize_phoenix import evaluate

# Assume `model_output` and `ground_truth` are strings
metrics = evaluate(
    model_output=model_output,
    ground_truth=ground_truth,
    eval_template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=OpenAIModel(
        name="llama3.1:8b",
        base_url="http://localhost:11434/v1"
    )
)
print(metrics)  # {'relevance': 0.87, 'faithfulness': 0.92, ...}

The metrics are automatically logged to Phoenix, where you can compare them across runs.

8.2 Custom Evaluation Prompts

Create a prompt that asks the LLM to score its own answer:

CUSTOM_PROMPT = """
You are evaluating the following answer to a user query:
Q: {query}
A: {answer}
Rate the answer on a scale of 0–10 for relevance and factual accuracy.
Return a JSON object: {{"relevance": int, "accuracy": int}}
"""

metrics = evaluate(
    model_output=answer,
    ground_truth=None,  # self‑evaluation
    eval_template=CUSTOM_PROMPT,
    model=OpenAIModel(...),
)

9. Visualizing Embeddings

Phoenix’ Embedding Visualizer helps you understand how your data is clustered.

  1. Load your query or document embeddings (e.g., via openai.embeddings.create).
  2. Push them to Phoenix using the SDK:
from arize_phoenix import embeddings

embeddings.upload(
    vectors=vectors,          # list of embedding vectors
    labels=labels,            # optional metadata (e.g., topic)
    dataset_name="my_docs"
)
  1. In the UI, open EmbeddingsDatasetmy_docs.
    You’ll see a 2‑D/3‑D scatter plot, cluster boundaries, and the ability to filter by label.
    Use this to spot outliers or verify that your RAG knowledge base covers the query space.

10. Advanced Use Cases

Scenario How Phoenix Helps Tips
RAG system debugging Trace each retrieval step, compare retrieved docs to ground truth Use LlamaIndex + Phoenix callbacks to see which docs were fetched
Bias & fairness monitoring Run periodic evaluation with labeled prompts Store evaluation metrics in Phoenix, alert on drift
Latency SLA enforcement Continuous latency dashboards, threshold alerts Set up an external alerting rule (e.g., PagerDuty) via Phoenix webhook
Multi‑model comparison Store traces for several Ollama models Use the Model Comparison view to see accuracy vs latency

11. Troubleshooting Common Issues

Symptom Likely Cause Fix
Traces don’t appear Phoenix OTLP endpoint unreachable Verify Docker port mapping (-p 5000:5000) or local address (http://localhost:5000).
Model requests fail Wrong OPENAI_BASE_URL Ensure it points to Ollama’s v1 endpoint (http://localhost:11434/v1).
Evaluation metrics missing Evaluation template not registered Pass the correct eval_template and ensure OpenAIModel has a proper name and base_url.
Embedding upload errors Mismatch vector dimension Ollama’s embeddings (e.g., 768) must match the dataset schema.

12. Summary

Arize Phoenix turns a local Ollama deployment into a production‑grade LLM observability platform. By simply pointing Phoenix at the Ollama endpoint and enabling the built‑in callback handlers, you gain:

  • Instant trace visualization
  • Automated evaluation with a library of templates
  • Embedding insights for data coverage and drift detection
  • Dashboards that surface latency, accuracy, and error rates

Because both tools are open source, you can keep all data on‑premise and avoid costly cloud usage while still enjoying the benefits of a modern observability stack.

Happy building! 🚀


References used in this article:

  • Arize Phoenix documentation (user guide, release notes)
  • Ollama documentation (API compatibility)
  • OpenTelemetry integration references (OTLP)
  • Phoenix evaluation templates and examples (RAG relevance, custom prompts)
  • LlamaIndex callback integration with Phoenix.