-
Notifications
You must be signed in to change notification settings - Fork 886
Insights: NVIDIA/nccl
Overview
-
- 0 Merged pull requests
- 2 Open pull requests
- 6 Closed issues
- 4 New issues
There hasn’t been any commit activity on NVIDIA/nccl in the last week.
Want to help out?
2 Pull requests opened by 2 people
-
Add cross data center communications and network topology awareness to NCCL
#1659 opened
Mar 24, 2025 -
[nccl-profiler] fix profiling logic for proxyCtrl idle/active events
#1662 opened
Mar 26, 2025
6 Issues closed by 4 people
-
NCCL error: unhandled system error
#1657 closed
Mar 28, 2025 -
`ncclP2pImportShareableBuffer()`: `cuMemImportFromShareableHandle()` fails with CUDA failure 101
#1647 closed
Mar 25, 2025 -
Question regarding the number of Channels versus the number of SMs used in communication.
#1656 closed
Mar 24, 2025 -
Is well-overlapped pipelining achievable in NVLS + Tree or NVLS + IB SHARP?
#1651 closed
Mar 24, 2025 -
NCCL Cuda failure 999 'unknown error'**
#1653 closed
Mar 24, 2025
4 Issues opened by 4 people
-
NCCL data race
#1663 opened
Mar 27, 2025 -
How ext-tuner plugin enforces algorithm and protocol?
#1661 opened
Mar 26, 2025 -
Segmentation fault while running all_reduce_perf 5090
#1660 opened
Mar 25, 2025 -
NCCL profiler proxyStep start/stop counts mismatch in plugin v3.
#1658 opened
Mar 24, 2025
12 Unresolved conversations
Sometimes conversations happen on old items that aren’t yet closed. Here is a list of all the Issues and Pull Requests with unresolved conversations.
-
`ncclInternalError: Internal check failed` when using `irecv` with non-contiguous tensor in NCCL backend
#1655 commented on
Mar 23, 2025 • 0 new comments -
`ncclInternalError: Internal check failed` and `CUDA error: invalid device ordinal` when using `dist.recv_object_list` with NCCL backend in multi-GPU setup
#1654 commented on
Mar 23, 2025 • 0 new comments -
Why get poor performance when using different channels with samechannels = 0?
#1239 commented on
Mar 24, 2025 • 0 new comments -
NCCL failure caused by NET/IB completion error
#1405 commented on
Mar 24, 2025 • 0 new comments -
How data is moving between GPUs?
#1644 commented on
Mar 25, 2025 • 0 new comments -
When the number of nodes increases, the bandwidth performance of alltoall is unstable
#1531 commented on
Mar 25, 2025 • 0 new comments -
[Bug] NCCL all_reduce failed with A800 when NCCL_ALGO uses Ring
#1055 commented on
Mar 26, 2025 • 0 new comments -
NCCL INFO NET/IB : No device found
#952 commented on
Mar 26, 2025 • 0 new comments -
all-reduce slower on v2.20.5 compared to v2.18.5 on AWS g5.48xlarge (8 x A10G)
#1298 commented on
Mar 26, 2025 • 0 new comments -
Why tree algorithms are specifically targeted at All-Reduce?
#1473 commented on
Mar 26, 2025 • 0 new comments -
Is there a performance difference between dmabuf and peer memory?
#908 commented on
Mar 28, 2025 • 0 new comments -
Fix for various issues in the graph module
#1635 commented on
Mar 24, 2025 • 0 new comments