Architecture
This document describes the high-level architecture of Mosaic, how its components interact, and how Collective Communication Layer (CCL) telemetry flows from GPUs to dashboards. At a high level, Mosaic consists of:
- Profiler Plugin: NCCL profiler plugin that emits OpenTelemetry compatible metrics
- Mappers: Collect metadata required to correlate CCL metrics
- Exporters: Collect server, GPU and process information
- Observability Stack: Grafana LGTM backend for storage and visualization

Collective Communication Layer (CCL)#
Mosaic focuses on collective operations, which dominate performance in distributed GPU workloads.
Examples include AllReduce, AllGather, AllToAll.
These collectives are invoked by deep learning frameworks but executed inside NCCL.
Why CCL Visibility Matters?
- Performance issues often manifest as communication pathologies
- Traditional tools provide postmortem or offline analysis
- Mosaic captures live, production-safe telemetry
Profiler Plugin#
The Mosaic profiler plugin is a lightweight library embedded into NCCL (NVIDIA Collective Communications Library) that collects and exports performance metrics to OpenTelemetry collectors. This plugin captures collective operations, P2P operations, and proxy operation transfer rates with detailed performance data for monitoring and analysis.
This plugin implements the NCCL Profiler Plugin v4 interface and provides:
- Real-time Metrics Collection: Captures NCCL collective, P2P, and proxy operation metrics
- OpenTelemetry Integration: Exports metrics to OpenTelemetry collectors via HTTP
- Lock-free Event Recording: Zero-allocation critical path with pre-allocated circular buffers
- Configurable Telemetry: Supports customizable reporting intervals and batch timeouts
- Event Filtering: Configurable event mask to control which NCCL events are profiled
- Multi-threaded Support: Thread-safe operation across multiple NCCL contexts
- Linear Regression Analysis: Automatic latency and bandwidth calculation from transfer data
The plugin uses a lock-free circular buffer design with pre-allocated memory to ensure minimal overhead on the critical path. Events are processed asynchronously by a background telemetry thread that aggregates metrics and exports them to OpenTelemetry collectors.
Feature#
- Comprehensive Event Tracking: Monitors collective operations, P2P operations, and proxy operations
- Histogram Metrics: Provides detailed metrics for bytes, time, latency, and transfer rates
- Rank Transfer Metrics: Tracks data transfer between ranks with latency and bandwidth analysis
- Channel Transfer Metrics: Per-channel transfer statistics for granular performance analysis
- Lock-free Critical Path: Pre-allocated circular buffers ensure zero memory allocation during event recording
- Asynchronous Processing: Background telemetry thread processes and exports metrics without blocking NCCL operations
- Windowed Aggregation: Events are aggregated in windows (50k events or the configured time interval) before export
- Linear Regression Metrics: Automatic calculation of latency (intercept) and rate (slope) from transfer data
- Debug Logging: Optional trace logging for troubleshooting
Available Metrics#
Collective Metrics
Collective#
| Name | Description | Unit |
|---|---|---|
nccl_profiler_collective_bytes |
Total bytes transferred in collective operations | bytes |
nccl_profiler_collective_time |
Average time per collective operation | µs |
nccl_profiler_collective_count |
Number of collective operations | count |
nccl_profiler_collective_num_transfers |
Average number of transfers per collective | count |
nccl_profiler_collective_transfer_size |
Average transfer size per collective | bytes |
nccl_profiler_collective_transfer_time |
Average transfer time per collective | µs |
P2P Metrics
P2P#
| Name | Description | Unit |
|---|---|---|
nccl_profiler_p2p_bytes |
Average bytes per P2P operation | bytes |
nccl_profiler_p2p_time |
Average time per P2P operation | µs |
nccl_profiler_p2p_num_transfers |
Average number of transfers per P2P operation | count |
nccl_profiler_p2p_transfer_size |
Average transfer size for P2P | bytes |
nccl_profiler_p2p_transfer_time |
Average transfer time for P2P | µs |
Rank Metrics
Rank#
| Name | Description | Unit |
|---|---|---|
nccl_profiler_rank_bytes |
Bytes sent from rank to rank | bytes |
nccl_profiler_rank_latency |
Latency from rank to rank (*1) | µs |
nccl_profiler_rank_rate |
Transfer rate from rank to rank (*2) | MBps |
Transfer Time Metrics
Transfer Time#
| Name | Description | Unit |
|---|---|---|
nccl_profiler_transfer_size |
Average transfer size per channel | bytes |
nccl_profiler_transfer_time |
Average transfer time per channel | µs |
nccl_profiler_transfer_latency |
Transfer latency per channel (*1) | µs |
OpenTelemetry Collector#
The OpenTelemetry Collector acts as the control plane for telemetry. It
- Receive OTLP metrics from profiler plugin
- Apply batching and aggregation
- Enrich metrics with resource attributes
- Route data to backend systems
LGTM Stack#
Mosaic integrates natively with the Grafana LGTM stack:
- Grafana – visualization and dashboards
- Mimir – scalable metrics storage
- Loki – logs (roadmap)
- Tempo – traces (roadmap)
Mosaic comes with the following dashboards:
- Overall health - Shows overall system health (e.g. number of failed GPU, transfer time anomaly)
- Comprehensive CCL metrics - shows every CCL metrics available
- GPU status - Show GPU-specific metrics (e.g. power consumption anomaly)
Reference Implementation#
Mosaic profiler plugin needs to be included in the environment where the training/inferencing workload is executed, which can be a bare metal server, a VM, a container, or even on Kubernetes cluster.
While we aim at providing a universal solution that theoretically works on all above-mentioned environments, it is obvious that we cannot support and validate all possible combinations. Therefore, we are defining a reference implementation which we validate, ship and support as prebuilt artifacts. We also provide documentation on how to compile and include Mosaic profiler plugin into your runtime environment.
Currently, Mosaic reference implementation is defined as follows:
- Workload
- vLLM inference
- Deployment
The following will be brought up by Docker compose
- LGTM stack
- Mosaic profiler plugin
- Mosaic mappers and exporters
- Ray + vLLM
- Hardware
- NVIDIA GPUs
- RTX 2000 Ada
- RTX Pro 6000 Blackwell
- AMD GPUs (experimental1)
- Radeon AI Pro R9700
- MI300X
- NVIDIA GPUs
This reference is opinionated but intentionally simple.
Known Issue
Our reference implementation is based on vLLM 0.15.0, which has a known issue related to pipeline parallelism. Please refer to the GitHub issue for details: https://github.com/vllm-project/vllm/issues/27116
Design Boundaries#
- NCCL telemetry collection
- OpenTelemetry metric emission
- Grafana dashboards
- Reference implementation
Everything in Mosaic, plus
- Automated mitigation
- Advanced orchestration logic
- Large-scale (>32 GPU) abstractions
- Kubernetes integration
Extensibility#
Mosaic is designed to evolve:
- Additional metrics (network, memory)
- Logs and traces via OTel
- Vendor expansion beyond NVIDIA
- Deeper integration with orchestration platforms such as Kubernetes
Summary#
Mosaic’s architecture is built around a simple idea:
Treat collective communication as observable infrastructure.
By combining lightweight in-workload profiling with standard OpenTelemetry pipelines and proven Grafana backends, Mosaic delivers production-ready GPU observability without sacrificing performance.
-
Implementation is completed. Marked as "experimental" since automated integration test is still in progress ↩