TY - JOUR
T1 - Segmented Merge
T2 - A New Primitive for Parallel Sparse Matrix Computations
AU - Ji, Haonan
AU - Lu, Shibo
AU - Hou, Kaixi
AU - Wang, Hao
AU - Jin, Zhou
AU - Liu, Weifeng
AU - Vinter, Brian
N1 - Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2021/10
Y1 - 2021/10
N2 - Segmented operations, such as segmented sum, segmented scan and segmented sort, are important building blocks for parallel irregular algorithms. We in this work propose a new parallel primitive called segmented merge. Its function is in parallel merging q sub-segments to p segments, both of possibly nonuniform lengths which easily cause the load balancing and the vectorization problems on massively parallel processors, such as GPUs. Our algorithm resolves these problems by first recording the boundaries of segments and sub-segments, then assigning roughly the same number of elements for GPU threads, and finally iteratively merging the sub-segments in each segment in the form of binary tree until there is only one sub-segment in each segment. We implement the segmented merge primitive on GPUs and demonstrate its efficiency on parallel sparse matrix transposition (SpTRANS) and sparse matrix–matrix multiplication (SpGEMM) operations. We conduct a comparative experiment with NVIDIA vendor library on two GPUs. The experimental results show that our algorithm achieve on average 3.94× (up to 13.09×) and 2.89× (up to 109.15×) speedup on SpTRANS and SpGEMM, respectively.
AB - Segmented operations, such as segmented sum, segmented scan and segmented sort, are important building blocks for parallel irregular algorithms. We in this work propose a new parallel primitive called segmented merge. Its function is in parallel merging q sub-segments to p segments, both of possibly nonuniform lengths which easily cause the load balancing and the vectorization problems on massively parallel processors, such as GPUs. Our algorithm resolves these problems by first recording the boundaries of segments and sub-segments, then assigning roughly the same number of elements for GPU threads, and finally iteratively merging the sub-segments in each segment in the form of binary tree until there is only one sub-segment in each segment. We implement the segmented merge primitive on GPUs and demonstrate its efficiency on parallel sparse matrix transposition (SpTRANS) and sparse matrix–matrix multiplication (SpGEMM) operations. We conduct a comparative experiment with NVIDIA vendor library on two GPUs. The experimental results show that our algorithm achieve on average 3.94× (up to 13.09×) and 2.89× (up to 109.15×) speedup on SpTRANS and SpGEMM, respectively.
KW - GPU
KW - Parallel computing
KW - Segmented merge
KW - Sparse matrix
UR - http://www.scopus.com/inward/record.url?scp=85103393417&partnerID=8YFLogxK
U2 - 10.1007/s10766-021-00695-1
DO - 10.1007/s10766-021-00695-1
M3 - Journal article
AN - SCOPUS:85103393417
SN - 0885-7458
VL - 49
SP - 732
EP - 744
JO - International Journal of Parallel Programming
JF - International Journal of Parallel Programming
IS - 5
ER -