Test Suites

NCCL Profiler OpenTelemetry Tests Suite#

The NCCL Profiler Open Telemetry test suite tests the NCCL Profiler OpenTelemetry plugin's telemetry export functionality. These tests verify that: 1. The LGTM stack (Loki, Grafana, Tempo, Mimir) and OTel Collector are accessible 2. vLLM inference triggers NCCL operations 3. The NCCL profiler exports metrics to Prometheus via OpenTelemetry

Hardware Requirements#

Minimum 2 NVIDIA GPUs required. These tests validate NCCL profiler metrics, which are only generated when multi-GPU parallelism (tensor or pipeline) triggers inter-GPU communication. A single GPU will not produce NCCL metrics.

Verify GPU availability

nvidia-smi

Software Requirements#

Before running these tests, ensure the following are running:

LGTM Stack and OTel Collector
vLLM with NCCL Profiler plugin enabled and tensor parallelism (--tensor-parallel-size 2) or pipeline parallelism (--pipeline-parallel-size 2)

Using the production-test-framework container will automate a lot of the managing the lifecycle of these services, but they can also be managed manually.

Test Structure#

Shared fixtures#

Any fixtures that need to be used across multiple test suites should be placed in tests/suites/conftest.py.

Profiler OTEL `conftest.py`#

Provides fixtures and constants for the profiler OTEL tests:

Fixture	Description
`prometheus_url`	Prometheus API endpoint (default: `http://localhost:9090`)
`grafana_url`	Grafana endpoint (shared; default: `http://localhost:3000`)
`vllm_client`	Client for vLLM inference API
`vllm_ready`	Waits for vLLM to be healthy
`inference_completed`	Runs inference to trigger NCCL operations
`nccl_profiler_metrics`	List of expected Prometheus metric names

`test_profiler_metrics.py`#

Contains the test class TestNCCLProfilerTelemetry:

Test	Description
`test_otel_collector_accessible`	Verifies Prometheus endpoint is reachable
`test_grafana_accessible`	Verifies Grafana dashboard is reachable
`test_nccl_metrics_exported_after_inference`	Runs inference and validates all expected NCCL metrics appear in Prometheus

Expected Metrics#

The tests validate that the following metrics are exported to Prometheus (defined in telemetry.cc):

Collective Metrics#

nccl_profiler_collective_bytes_total - Total bytes in collective ops
nccl_profiler_collective_time_microseconds_sum - Time spent in collective ops
nccl_profiler_collective_count_sum - Number of collective ops

Notes

Collective metrics appear with tensor parallelism (--tensor-parallel-size).

P2P Metrics#

nccl_profiler_p2p_bytes_bytes_sum - Bytes in P2P ops
nccl_profiler_p2p_time_microseconds_sum - Time in P2P ops

Notes

P2P metrics only appear when using pipeline parallelism (--pipeline-parallel-size).

Rank/Transfer Metrics#

nccl_profiler_rank_bytes_total - Bytes transferred between ranks
nccl_profiler_rank_latency_microseconds_sum - Latency between ranks
nccl_profiler_transfer_size_bytes_sum - Transfer sizes per channel

Environment Variables#

Variable	Default	Description
`VLLM_HOST`	`localhost`	vLLM server host
`VLLM_PORT`	`8080`	vLLM server port
`PROMETHEUS_HOST`	`localhost`	Prometheus host
`PROMETHEUS_PORT`	`9090`	Prometheus port
`GRAFANA_HOST`	`localhost`	Grafana host
`GRAFANA_PORT`	`3000`	Grafana port

Troubleshooting#

No metrics appearing in Prometheus#

Check vLLM logs for NCCL profiler initialization:
```
make profiler-otel-logs
```

Verify the OTEL endpoint is correct:

# Should be http://nccl-profiler-otel-lgtm:4318 (HTTP, not gRPC)
docker inspect nccl-profiler-vllm | grep OTEL

Check Prometheus targets:
```
http://localhost:9090/targets
```

Tests timeout waiting for vLLM#

vLLM needs time to download and load the model. The default timeout is 5 minutes. Check logs:

docker logs nccl-profiler-vllm -f

Missing P2P metrics#

P2P metrics require pipeline parallelism. Modify docker-compose.yml:

command: ["vllm serve ... --pipeline-parallel-size 2"]

Grafana Dashboards Tests Suite#

The Grafana dashboards test suite validates that the dashboards in the repository stay in sync with the running Grafana instance. It does not require vLLM or GPUs.

What is tested#

dashboards.yml versus repository files: Every options.path entry in dashboards.yml must point to a JSON file that exists under deployments/dashboards/ in the repository (same basename). The path must be exactly /var/lib/grafana/dashboards/<filename>.json, matching the Docker Compose mount of deployments/dashboards to that location in the Grafana container, so the listed path resolves to the intended file. Duplicate basenames across providers are rejected.
Dashboard presence in Grafana: The dashboards listed in dashboards.yml are available in Grafana
Each dashboard loads: For each dashboard returned by Grafana’s search API, the dashboard UID API returns HTTP 200.

Requirements#

Grafana (LGTM stack) must be running. When running via make test, the production-test-framework container mounts the repo’s deployments/dashboards directory at /mnt/dashboards so the tests can read dashboards.yml and verify repo files match the provisioning paths.
The suite uses the same Grafana URL as the profiler OTEL suite (GRAFANA_HOST, GRAFANA_PORT; default http://localhost:3000).

Environment variables#

Variable	Default	Description
`DASHBOARDS_DIR`	`/mnt/dashboards`	Path to the dashboards directory (set by mount when run in container)