Abstract
T-overlap query is the basis of set similarity query and has been applied in many important fields. Most existing approaches employ a pruning-and-verification framework, thus in low efficiency. Modern GPU has much higher parallelism as well as memory bandwidth than CPU and can be used to accelerate T-overlap query. In this paper, we use hash segmentation to divide inverted lists into segments, then design an efficient inverted index called GHSII on GPU using hash segmentation. Based on GHSII, a new segmentation parallel T-overlap algorithm, GSPS, is proposed. GSPS uses segment at a time to scan segments and uses shared memory to decrease the number of accesses to device memory. Furthermore, an optimized algorithm called GSPS-TLLO using a heuristic query order is proposed to solve the problem of load imbalance. Experiments are carried out on two real datasets and the results show that GSPS-TLLO outperforms the state-of-the-art GPU parallel T-overlap algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
When there is no ambiguity, we also use j-SEGs to denote all the j-th segments of query Q related lists.
References
Kawamoto, H., Kitamura, T.: Similarity of speaker individualities of sentence in ATR speech database set C. In: Proceedings of IEICE Technical Report Speech, pp. 33–34 (2013)
Hadjieleftheriou, M., Chandel, A., Koudas N., et al.: Fast indexes and algorithms for set similarity selection queries. In: Proceeding of IEEE 24th International Conference on Data Engineering (ICDE 2008), pp. 267–276 (2008)
He, B., Ke, F., Fang, R., et al.: Relational joins on graphics processors. In: Proceeding of ACM SIGMOD International Conference on Management of Data, pp. 511–524 (2007)
Shalom, S.A.A., Dash, M., Tue, M.: Efficient K-means clustering using accelerated graphics processors. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 166–175. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85836-2_16
Punronen, S., Terziyan, V.: A similarity evaluation technique for data mining with an ensemble of classifiers. In: Proceeding of IEEE Computer Society, International Workshop on Database & Expert Systems Applications, pp. 1155–1160 (2000)
Kim, J., Vasardani, M., Winter, S.: Similarity matching for integrating spatial information extracted from place descriptions. Int. J. Geogr. Inf. Sci. 1–25 (2016)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proceeding of VLDB, pp. 918–929 (2006)
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceeding of ICDE (2006)
Lin, X., Wang, W.: Set and string similarity queries a survey. Chin. J. Comput. 34(10), 1853–1862 (2012)
Deng, D., Li, G., Feng, J., et al.: A unified framework for approximate dictionary-based entity extraction. VLDB J. 24(1), 143–167 (2014)
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceeding of SIGMOD 2004, pp. 743–754 (2004)
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: Proceeding of ICDE 2008, pp. 257–266 (2008)
Jia, L., Xi, J., Li, M., et al.: ETI: an efficient index for set similarity queries. Front. Comput. Sci. 6(6), 700–712 (2012)
Li, M., Jia, L., You, J., et al.: Fast T-overlap query algorithms using graphics processor units and its applications in web data query. World Wide Web-internet Web Inf. Syst. 18(2), 1–17 (2013)
Tatikonda, S., Junqueira, F., Cambazoglu, B.B., et al.: On efficient posting list intersection with multicore processors. In: Proceeding of ACM SIGIR 2009, pp. 738–739 (2009)
Ding, B., Nig, A.: Fast set intersection in memory. In: Proceeding of the VLDB Endowment 2011, pp. 255–266 (2011)
Ao, N., Zhang, F., Wu, D., et al.: Efficient parallel lists intersection and index compression algorithms using graphics processing units. In: Proceeding of the VLDB Endowment, pp. 470–481 (2011)
Ding, S., He, J., Yan, H.: Using graphics processors for high-performance IR query processing. In: Proceeding of WWW, pp. 1213–1214 (2008)
Programming of shared memory GPUs shared memory systems. http://site.uottawa.ca/~mbolic/ceg4131/CUDA_Report.pdf. Accessed Jan 2016
Wu, D., Zhang, F., Ao, N., et al.: Efficient lists intersection by CPU-GPU cooperative computing. In: Proceeding of 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), pp. 1–8. IEEE (2010)
Pagh, R., Rodler, F.: Cuckoo hashing. J. Algorithms 51(2), 122–144 (2004)
Bay, S., Kibler, D., Pazzani, M., et al.: The UCI KDD archive of large data sets fordata mining research and experimentation. ACM SIGKDD Explor. Newsl. 2(2), 14–18 (2002)
Bayardo, R., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceeding of International Conference on World WideWeb 2007, pp. 71–81 (2007)
Broder, A.Z., Carmel, D., Herscovici, M., et al.: Efficient query evaluation using a two-level retrieval process. In: Proceeding of Twelfth International Conference on Information and Knowledge Management, pp. 426–434. ACM (2003)
Acknowledgment
The research was supported by Grants from the National Natural Science Foundation of China (No. 61562054, 51467007, 61462050), and the Personnel Training Project of Yunnan Province (No. KKSY201603016).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jia, L., Zhang, Y., Li, M., Ding, J., You, J. (2017). GPU Based Hash Segmentation Index for Fast T-overlap Query. In: Zou, B., Li, M., Wang, H., Song, X., Xie, W., Lu, Z. (eds) Data Science. ICPCSEE 2017. Communications in Computer and Information Science, vol 727. Springer, Singapore. https://doi.org/10.1007/978-981-10-6385-5_4
Download citation
DOI: https://doi.org/10.1007/978-981-10-6385-5_4
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-6384-8
Online ISBN: 978-981-10-6385-5
eBook Packages: Computer ScienceComputer Science (R0)