Troubleshooting
Common Issues#
-
Plugin not loading:
- Verify
NCCL_PROFILER_PLUGIN=otelis set - Check that
LD_LIBRARY_PATHincludes the plugin directory - Ensure the plugin library exists and is executable
- Verify
-
No metrics being sent:
- Verify
NCCL_PROFILER_OTEL_TELEMETRY_ENDPOINTis correctly set - Check that
NCCL_PROFILER_OTEL_ENABLE=1(default) - Check that
NCCL_PROFILER_OTEL_TELEMETRY_ENABLE=1(default) - Verify OpenTelemetry collector is running and accessible
- Verify
-
Connection issues:
- Verify the OpenTelemetry collector is running and accessible
- Check network connectivity to the collector endpoint
- Ensure the endpoint URL format is correct (
http://host:port)
-
Buffer overflow warnings:
- Increase window trigger settings if events are generated too quickly
- Check that telemetry thread is processing windows in time
- Verify system has sufficient CPU for background processing
Debug Logging#
Enable trace logging for detailed debugging:
# Build with trace logging
make TRACE=1
# Run with NCCL debug output
export NCCL_DEBUG=INFO
export NCCL_PROFILER_PLUGIN=otel
# ... other environment variables
This will provide detailed logs about plugin initialization, metric collection, and telemetry export.