How to Use Arize Phoenix with Ollama
A Practical Guide for LLM‑Ops Engineers and Data Scientists
1. Introduction
Arize Phoenix is an open‑source observability platform that lets teams monitor, debug, and evaluate large‑language‑model (LLM) applications. It can record traces, run automatic evaluations, and surface visual insights that help you spot drift, bias, or performance regressions.
Ollama is a local, lightweight LLM host that exposes an OpenAI‑compatible API. By running Ollama locally, you can keep your data in‑house, cut inference costs, and experiment quickly.
Combining Phoenix with Ollama gives you:
| Feature | Phoenix | Ollama |
|---|---|---|
| Trace collection | OTLP‑compatible | Any SDK that speaks OpenAI API |
| Model evaluation | Pre‑built templates (relevance, faithfulness, toxicity, etc.) | Directly feed your local model |
| Visualization | Embedding heatmaps, trace graphs, metrics dashboards | Immediate feedback on local prompts |
| Cost | Free, open source | Zero cloud‑usage costs |
2. What Is Arize Phoenix?
Phoenix is built on top of OpenTelemetry (OTLP) and provides:
- Trace ingestion – collect request‑response pairs from any framework (LangChain, LlamaIndex, DSPy, etc.).
- Automatic evaluation – run your LLM output against a prompt or reference set using a library of templates (faithfulness, toxicity, coherence, etc.).
- Embeddings visualizer – cluster analysis and dimensionality reduction of user queries or knowledge‑base documents.
- Dashboards – metrics such as latency, error rates, accuracy, and drift alerts.
Phoenix is intentionally “playground‑first”: you can spin up a local UI and test everything before deploying to production.
3. Why Combine Phoenix with Ollama?
| Pain Point | Why Phoenix Helps | Why Ollama Helps |
|---|---|---|
| Latency | Visualize and compare latency distributions across models | Run inference locally, no network round‑trip |
| Data privacy | Store traces locally, no third‑party transmission | Keep data on‑premises |
| Cost | Free tooling | Zero cloud inference cost |
| Rapid iteration | Playground allows instant parameter tweaks | Quick local inference without API throttling |
4. Prerequisites
- Python 3.10+ (recommended in a virtual environment).
- Docker (optional, for running Phoenix locally).
- Ollama installed locally – see https://ollama.ai/.
- An OpenAI‑compatible API key if you want to evaluate against external reference data (optional).
5. Installing Phoenix
Phoenix can be installed as a Python package or run in a Docker container.
The Python route is easiest for experimentation:
python -m venv venv
source venv/bin/activate
pip install "arize-phoenix[evals,llama-index]" # pulls in core, evals, and LlamaIndex integration
Alternatively, run the prebuilt Docker image:
docker run -d -p 5000:5000 arize/phoenix
Once the container is running, open the UI at http://localhost:5000.
6. Configuring Phoenix to Use Ollama
Phoenix treats any OpenAI‑compatible endpoint as a “provider.”
Ollama exposes an OpenAI‑compatible endpoint at http://localhost:11434/v1.
6.1 Set Environment Variables
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=YOUR_LOCAL_KEY # could be anything, Phoenix ignores it for Ollama
6.2 Create a Prompt Playground Session
- In the Phoenix UI, click Playground → New Session.
- Under AI Provider, select Custom.
- Enter the base URL and API key above.
- Choose a model from the list (e.g.,
llama3.1:8b).
You can now send prompts directly to your local Ollama instance from the Phoenix UI and immediately see the trace, latency, and evaluation results.
7. Sending Traces from Your Own Code
Phoenix provides a lightweight callback handler that you can plug into frameworks like LlamaIndex or LangChain.
from llama_index.callbacks.arize_phoenix import ArizePhoenixCallbackHandler
from llama_index.llms import OpenAI
from llama_index import VectorStoreIndex, PromptTemplate
# Tell LlamaIndex to use the Phoenix callback
callback_handler = ArizePhoenixCallbackHandler()
llm = OpenAI(
model="llama3.1:8b",
api_base="http://localhost:11434/v1",
callbacks=[callback_handler]
)
index = VectorStoreIndex(...) # build your RAG index
query = "Explain the benefits of using local LLMs."
response = index.as_query_engine().query(query)
print(response)
All requests will be automatically sent to Phoenix via OTLP.
You’ll see each trace appear in the Traces tab, complete with timestamps, request/response payloads, and any evaluation metrics you have configured.
8. Evaluating Responses with Phoenix
Phoenix ships with a rich library of evaluation templates, e.g., RAG_RELEVANCY_PROMPT_TEMPLATE. You can also write your own.
8.1 Using a Built‑In Template
from arize_phoenix.evals import RAG_RELEVANCY_PROMPT_TEMPLATE, OpenAIModel
from arize_phoenix import evaluate
# Assume `model_output` and `ground_truth` are strings
metrics = evaluate(
model_output=model_output,
ground_truth=ground_truth,
eval_template=RAG_RELEVANCY_PROMPT_TEMPLATE,
model=OpenAIModel(
name="llama3.1:8b",
base_url="http://localhost:11434/v1"
)
)
print(metrics) # {'relevance': 0.87, 'faithfulness': 0.92, ...}
The metrics are automatically logged to Phoenix, where you can compare them across runs.
8.2 Custom Evaluation Prompts
Create a prompt that asks the LLM to score its own answer:
CUSTOM_PROMPT = """
You are evaluating the following answer to a user query:
Q: {query}
A: {answer}
Rate the answer on a scale of 0–10 for relevance and factual accuracy.
Return a JSON object: {{"relevance": int, "accuracy": int}}
"""
metrics = evaluate(
model_output=answer,
ground_truth=None, # self‑evaluation
eval_template=CUSTOM_PROMPT,
model=OpenAIModel(...),
)
9. Visualizing Embeddings
Phoenix’ Embedding Visualizer helps you understand how your data is clustered.
- Load your query or document embeddings (e.g., via
openai.embeddings.create). - Push them to Phoenix using the SDK:
from arize_phoenix import embeddings
embeddings.upload(
vectors=vectors, # list of embedding vectors
labels=labels, # optional metadata (e.g., topic)
dataset_name="my_docs"
)
- In the UI, open Embeddings → Dataset → my_docs.
You’ll see a 2‑D/3‑D scatter plot, cluster boundaries, and the ability to filter by label.
Use this to spot outliers or verify that your RAG knowledge base covers the query space.
10. Advanced Use Cases
| Scenario | How Phoenix Helps | Tips |
|---|---|---|
| RAG system debugging | Trace each retrieval step, compare retrieved docs to ground truth | Use LlamaIndex + Phoenix callbacks to see which docs were fetched |
| Bias & fairness monitoring | Run periodic evaluation with labeled prompts | Store evaluation metrics in Phoenix, alert on drift |
| Latency SLA enforcement | Continuous latency dashboards, threshold alerts | Set up an external alerting rule (e.g., PagerDuty) via Phoenix webhook |
| Multi‑model comparison | Store traces for several Ollama models | Use the Model Comparison view to see accuracy vs latency |
11. Troubleshooting Common Issues
| Symptom | Likely Cause | Fix |
|---|---|---|
| Traces don’t appear | Phoenix OTLP endpoint unreachable | Verify Docker port mapping (-p 5000:5000) or local address (http://localhost:5000). |
| Model requests fail | Wrong OPENAI_BASE_URL |
Ensure it points to Ollama’s v1 endpoint (http://localhost:11434/v1). |
| Evaluation metrics missing | Evaluation template not registered | Pass the correct eval_template and ensure OpenAIModel has a proper name and base_url. |
| Embedding upload errors | Mismatch vector dimension | Ollama’s embeddings (e.g., 768) must match the dataset schema. |
12. Summary
Arize Phoenix turns a local Ollama deployment into a production‑grade LLM observability platform. By simply pointing Phoenix at the Ollama endpoint and enabling the built‑in callback handlers, you gain:
- Instant trace visualization
- Automated evaluation with a library of templates
- Embedding insights for data coverage and drift detection
- Dashboards that surface latency, accuracy, and error rates
Because both tools are open source, you can keep all data on‑premise and avoid costly cloud usage while still enjoying the benefits of a modern observability stack.
Happy building! 🚀
References used in this article:
- Arize Phoenix documentation (user guide, release notes)
- Ollama documentation (API compatibility)
- OpenTelemetry integration references (OTLP)
- Phoenix evaluation templates and examples (RAG relevance, custom prompts)
- LlamaIndex callback integration with Phoenix.