Test Suites
NCCL Profiler OpenTelemetry Tests Suite#
The NCCL Profiler Open Telemetry test suite tests the NCCL Profiler OpenTelemetry plugin's telemetry export functionality. These tests verify that: 1. The LGTM stack (Loki, Grafana, Tempo, Mimir) and OTel Collector are accessible 2. vLLM inference triggers NCCL operations 3. The NCCL profiler exports metrics to Prometheus via OpenTelemetry
Hardware Requirements#
Minimum 2 NVIDIA GPUs required. These tests validate NCCL profiler metrics, which are only generated when multi-GPU parallelism (tensor or pipeline) triggers inter-GPU communication. A single GPU will not produce NCCL metrics.
Software Requirements#
Before running these tests, ensure the following are running:
- LGTM Stack and OTel Collector
- vLLM with NCCL Profiler plugin enabled and tensor parallelism (
--tensor-parallel-size 2) or pipeline parallelism (--pipeline-parallel-size 2)
Using the production-test-framework container will automate a lot of the managing the lifecycle of these services, but they can also be managed manually.
Test Structure#
Shared fixtures#
Any fixtures that need to be used across multiple test suites should be placed in tests/suites/conftest.py.
Profiler OTEL conftest.py#
Provides fixtures and constants for the profiler OTEL tests:
| Fixture | Description |
|---|---|
prometheus_url |
Prometheus API endpoint (default: http://localhost:9090) |
grafana_url |
Grafana endpoint (shared; default: http://localhost:3000) |
vllm_client |
Client for vLLM inference API |
vllm_ready |
Waits for vLLM to be healthy |
inference_completed |
Runs inference to trigger NCCL operations |
nccl_profiler_metrics |
List of expected Prometheus metric names |
test_profiler_metrics.py#
Contains the test class TestNCCLProfilerTelemetry:
| Test | Description |
|---|---|
test_otel_collector_accessible |
Verifies Prometheus endpoint is reachable |
test_grafana_accessible |
Verifies Grafana dashboard is reachable |
test_nccl_metrics_exported_after_inference |
Runs inference and validates all expected NCCL metrics appear in Prometheus |
Expected Metrics#
The tests validate that the following metrics are exported to Prometheus (defined in telemetry.cc):
Collective Metrics#
nccl_profiler_collective_bytes_total- Total bytes in collective opsnccl_profiler_collective_time_microseconds_sum- Time spent in collective ops-
nccl_profiler_collective_count_sum- Number of collective opsNotes
Collective metrics appear with tensor parallelism (
--tensor-parallel-size).
P2P Metrics#
nccl_profiler_p2p_bytes_bytes_sum- Bytes in P2P ops-
nccl_profiler_p2p_time_microseconds_sum- Time in P2P opsNotes
P2P metrics only appear when using pipeline parallelism (
--pipeline-parallel-size).
Rank/Transfer Metrics#
nccl_profiler_rank_bytes_total- Bytes transferred between ranksnccl_profiler_rank_latency_microseconds_sum- Latency between ranksnccl_profiler_transfer_size_bytes_sum- Transfer sizes per channel
Environment Variables#
| Variable | Default | Description |
|---|---|---|
VLLM_HOST |
localhost |
vLLM server host |
VLLM_PORT |
8080 |
vLLM server port |
PROMETHEUS_HOST |
localhost |
Prometheus host |
PROMETHEUS_PORT |
9090 |
Prometheus port |
GRAFANA_HOST |
localhost |
Grafana host |
GRAFANA_PORT |
3000 |
Grafana port |
Troubleshooting#
No metrics appearing in Prometheus#
-
Check vLLM logs for NCCL profiler initialization:
-
Verify the OTEL endpoint is correct:
-
Check Prometheus targets:
Tests timeout waiting for vLLM#
vLLM needs time to download and load the model. The default timeout is 5 minutes. Check logs:
Missing P2P metrics#
P2P metrics require pipeline parallelism. Modify docker-compose.yml:
Grafana Dashboards Tests Suite#
The Grafana dashboards test suite validates that the dashboards in the repository stay in sync with the running Grafana instance. It does not require vLLM or GPUs.
What is tested#
- dashboards.yml versus repository files: Every
options.pathentry indashboards.ymlmust point to a JSON file that exists underdeployments/dashboards/in the repository (same basename). The path must be exactly/var/lib/grafana/dashboards/<filename>.json, matching the Docker Compose mount ofdeployments/dashboardsto that location in the Grafana container, so the listed path resolves to the intended file. Duplicate basenames across providers are rejected. - Dashboard presence in Grafana: The dashboards listed in
dashboards.ymlare available in Grafana - Each dashboard loads: For each dashboard returned by Grafana’s search API, the dashboard UID API returns HTTP 200.
Requirements#
- Grafana (LGTM stack) must be running. When running via
make test, the production-test-framework container mounts the repo’sdeployments/dashboardsdirectory at/mnt/dashboardsso the tests can readdashboards.ymland verify repo files match the provisioning paths. - The suite uses the same Grafana URL as the profiler OTEL suite (
GRAFANA_HOST,GRAFANA_PORT; defaulthttp://localhost:3000).
Environment variables#
| Variable | Default | Description |
|---|---|---|
DASHBOARDS_DIR |
/mnt/dashboards |
Path to the dashboards directory (set by mount when run in container) |