×

Content and structure coverage: extracting a diverse information subset. (English) Zbl 1446.90092

Summary: Recent years have witnessed a rapid increase in online data volume and the growing challenge of information overload for web use and applications. Thus, information diversity is of great importance to both information service providers and users of search services. Based on a diversity evaluation measure (namely, information coverage), a heuristic method – \(FastCov_{C+S}\)-Select – with corresponding algorithms is designed on the greedy submodular idea. First, we devise the \(Cov_{C+S}\)-Select algorithm, which possesses the characteristic of asymptotic optimality, to optimize information coverage using a strategy in the spirit of simulated annealing. To accelerate the efficiency of \(Cov_{C+S}\)-Select, its fast approximation (i.e., \(FastCov_{C+S}\)-Select) is then developed through a heuristic strategy to downsize the solution space with the properties of information coverage. Furthermore, ample experiments have been conducted to show the effectiveness, efficiency, and parameter robustness of the proposed method, along with comparative analyses revealing the performance’s advantages over other related methods.
The online appendix is available at https://doi.org/10.1287/ijoc.2017.0753.

MSC:

90B40 Search theory
90B18 Communication networks in operations research
94A15 Information theory (general)
90C59 Approximation methods and heuristics in mathematical programming

Software:

Okapi

References:

[1] Agrawal R, Gollapudi S, Halverson A, Ieong S (2009) Diversifying search results. Proc. 2nd ACM Internat. Conf. Web Search Data Mining, Barcelona, Spain, 5-14.Crossref, Google Scholar · doi:10.1145/1498759.1498766
[2] Aliguliyev RM (2009a) Clustering of document collection—A weighting approach. Expert Syst. Appl. 36(4):7904-7916.Crossref, Google Scholar · doi:10.1016/j.eswa.2008.11.017
[3] Aliguliyev RM (2009b) Performance evaluation of density-based clustering methods. Inform. Sci. 179(20):3583-3602.Crossref, Google Scholar · doi:10.1016/j.ins.2009.06.012
[4] Baeza-Yates R, Ribeiro-Neto B (1999) Modern Information Retrieval (Addison-Wesley, New York).Google Scholar
[5] Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J. Mach. Learn. Res. 3(Jan):993-1022.Google Scholar · Zbl 1112.68379
[6] Brynjolfsson E, Hu Y, Smith MD (2003) Consumer surplus in the digital economy: Estimating the value of increased product variety at online booksellers. Management Sci. 49(11):1580-1596.Link, Google Scholar
[7] Carbonell J, Goldstein J (1998) The use of MMR, diversity-based reranking for reordering documents and producing summaries. Proc. 21st Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval, Melbourne, Australia, 335-336.Crossref, Google Scholar · doi:10.1145/290941.291025
[8] Carpineto C, Mizzaro S, Romano G, Snidero M (2009a) Mobile information retrieval with search results clustering: Prototypes and evaluations. J. Amer. Soc. Inform. Sci. Tech. 60(5):877-895.Crossref, Google Scholar · doi:10.1002/asi.21036
[9] Carpineto C, Osinski S, Romano G, Weiss D (2009b) A survey of Web clustering engines. ACM Comput. Surv. 41(3):1-38.Crossref, Google Scholar · doi:10.1145/1541880.1541884
[10] Carterette B, Chandar P (2009) Probabilistic models of ranking novel documents for faceted topic retrieval. Proc. 18th ACM Conf. Inform. Knowl. Management, Hong Kong, China, 1287-1296.Crossref, Google Scholar · doi:10.1145/1645953.1646116
[11] Černú V (1985) Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm. J. Optim. Theory Appl. 45(1):41-51.Crossref, Google Scholar · Zbl 0534.90091 · doi:10.1007/BF00940812
[12] Chen H, Karger DR (2006) Less is more: Probabilistic models for retrieving fewer relevant documents. Proc. 29th Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval, Seattle, 429-436.Crossref, Google Scholar · doi:10.1145/1148170.1148245
[13] Clarke CLA, Craswell N, Soboroff I (2010) Overview of the TREC 2009 web track. Proc. 18th Conf. Text Retrieval, Gaithersburg, MD.Google Scholar
[14] Clarke CLA, Kolla M, Cormack GV, Vechtomova O, Ashkan A, Büttcher S, MacKinnon I (2008) Novelty and diversity in information retrieval evaluation. Proc. 31st Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval, Singapore, 659-666.Crossref, Google Scholar · doi:10.1145/1390334.1390446
[15] Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction To Algorithms, 3rd ed. (MIT Press, Cambridge, MA).Google Scholar · Zbl 1187.68679
[16] De P, Hu Y, Rahman MS (2010) Technology usage and online sales: An empirical study. Management Sci. 56(11):1930-1945.Link, Google Scholar
[17] Dubois D, Prade H (1985) Fuzzy cardinality and the modeling of imprecise quantification. Fuzzy Sets Syst. 16(3):199-230.Crossref, Google Scholar · Zbl 0601.03006 · doi:10.1016/0165-0114(85)90025-9
[18] Endres DM, Schindelin JE (2003) A new metric for probability distributions. IEEE Trans. Inform. Theory 49(6):1858-1860.Crossref, Google Scholar · Zbl 1294.62003 · doi:10.1109/TIT.2003.813506
[19] Fung BCM, Wang K, Ester M (2003) Hierarchical document clustering using frequent itemsets. Proc. SIAM Internat. Conf. Data Mining (2003), San Francisco, CA, 59-70.Crossref, Google Scholar · doi:10.1137/1.9781611972733.6
[20] Grabmeier J, Rudolph A (2002) Techniques of cluster algorithms in data mining. Data Min. Knowl. Discov. 6(4):303-360.Crossref, Google Scholar · doi:10.1023/A:1016308404627
[21] Granville V, Krivanek M, Rasson JP (1994) Simulated annealing: A proof of convergence. IEEE Trans. Pattern Anal. Machine Intell. 16(6):652-656.Crossref, Google Scholar · doi:10.1109/34.295910
[22] Gregor S, Hevner AR (2013) Positioning and presenting design science research for maximum impact. MIS Quart. 37(2):337-355.Crossref, Google Scholar · doi:10.25300/MISQ/2013/37.2.01
[23] Han J, Pei MK (2011) Data Mining: Concepts and Techniques, 3rd ed. (Morgan Kaufmann, New York).Google Scholar
[24] He J, Meij E, Rijke Md (2011) Result diversification based on query-specific cluster ranking. J. Amer. Soc. Inform. Sci. Tech. 62(3):550-571.Google Scholar
[25] Herrera F, Martínez L (2000) A 2-tuple fuzzy linguistic representation model for computing with words. IEEE Trans. Fuzzy Syst. 8(6):746-752.Crossref, Google Scholar · doi:10.1109/91.890332
[26] Hochba DS (1997) Approximation algorithms for NP-hard problems. ACM SIGACT News 28(2):40-52.Crossref, Google Scholar · doi:10.1145/261342.571216
[27] Järvelin K, Kekäläinen J (2002) Cumulated gain-based evaluation of IR techniques. ACM Trans. Inform. Syst. 20(4):422-446.Crossref, Google Scholar · doi:10.1145/582415.582418
[28] Kirkpatrick S, Gelatt CD, Vecchi MP (1983) Optimization by simulated annealing. Sci. 220(4598):671-680.Crossref, Google Scholar · Zbl 1225.90162 · doi:10.1126/science.220.4598.671
[29] Krishnan S, Goldberg K (2015) The minimum conductance dissimilarity cut (MCDC) algorithm to increase novelty and diversity of recommendations. Working paper, Industrial Engineering and Operations Research Department, University of California, Berkeley, CA.Google Scholar
[30] Kullback S, Leibler RA (1951) On information and sufficiency. Ann. Math. Stat. 22(1):79-86.Crossref, Google Scholar · Zbl 0042.38403 · doi:10.1214/aoms/1177729694
[31] Li F, Huang M, Zhu X (2010) Sentiment analysis with global topics and local dependency. Proc. 24th AAAI Conf. Artificial Intelligence (AAAI 2010), Atlanta, 1371-1376.Google Scholar
[32] Liu B (2011) Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd ed. (Springer-Verlag, Berlin).Crossref, Google Scholar · Zbl 1231.68007 · doi:10.1007/978-3-642-19460-3
[33] Ma B, Wei Q (2012) Measuring the coverage and redundancy of information search services on e-commerce platforms. Electron. Commer. Res. Appl. 11(6):560-569.Crossref, Google Scholar · doi:10.1016/j.elerap.2012.09.001
[34] MacQueen J (1967) Some methods for classification and analysis of multivariate observations. Le Cam LM, Neyman J, eds. Proc. 5th Berkeley Symposium Math. Stat. Probab. (University of California Press, Berkeley, CA), 281-297.Google Scholar · Zbl 0214.46201
[35] Malik H, Kender J, Fradkin D, Moerchen F (2010) Hierarchical document clustering using local patterns. Data Mining Knowl. Disc. 21(1):153-185.Crossref, Google Scholar · doi:10.1007/s10618-010-0172-z
[36] Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval (Cambridge University Press, Cambridge, UK).Crossref, Google Scholar · Zbl 1160.68008 · doi:10.1017/CBO9780511809071
[37] Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calculations by fast computing machines. J. Chem. Phys. 21(6):1087-1092.Crossref, Google Scholar · Zbl 1431.65006 · doi:10.1063/1.1699114
[38] Mulpuru S (2008) The state of retailing online 2008: Merchandising and web optimization report. Report, Forrester Research, Cambridge, MA. Accessed April 12, 2014, https://www.forrester.com/report/The+State+Of+Retailing+Online+2008+Merchandising+And+Web+Optimization+Report/-/E-RES46187.Google Scholar
[39] Nemhauser GL, Wolsey LA, Fisher ML (1978) An analysis of approximations for maximizing submodular set functions. Math. Program. 14(1):265-294.Crossref, Google Scholar · Zbl 0374.90045 · doi:10.1007/BF01588971
[40] Pan F, Wang W, Tung AKH, Yang J (2005) Finding representative set from massive data. Proc. 5th IEEE Internat. Conf. Data Mining, Houston, TX, 338-345.Google Scholar
[41] Qin L, Zhu X (2013) Promoting diversity in recommendation by entropy regularizer. Proc. 23rd Internat. Joint Conf. Artificial Intelligence, Beijing, China, 2698-2704.Google Scholar
[42] Ralescu D (1995) Cardinality, quantifiers, and the aggregation of fuzzy criteria. Fuzzy Sets Systems 69(3):355-365.Crossref, Google Scholar · Zbl 0844.04007 · doi:10.1016/0165-0114(94)00177-9
[43] Robertson SE, Walker S, Jones S, Hancock-Beaulieu M, Gatford M (1994) Okapi at TREC-3. Proc. 3rd Conf. Text Retrieval, Gaithersburg, MD, 109-126.Google Scholar
[44] Santos R, Macdonald C, Ounis I (2010) Selectively diversifying web search results. Proc. 19th ACM Internat. Conf. Inform. Knowl. Management, Toronto, ON, Canada, 1179-1188.Crossref, Google Scholar · doi:10.1145/1871437.1871586
[45] Santos R, Macdonald C, Ounis I (2011) Intent-aware search result diversification. Proc. 34th Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval, Beijing, China, 595-604.Crossref, Google Scholar · doi:10.1145/2009916.2009997
[46] Sen R, King RC, Shaw MJ (2006) Buyers’ choice of online search strategy and its managerial implications. J. Management Inform. Syst. 23(1):211-238.Crossref, Google Scholar · doi:10.2753/MIS0742-1222230107
[47] Shi X, Liang X (2015) Resolving inconsistent ratings and reviews on commercial webs based on support vector machines. Proc. 12th Internat. Conf. Service Syst. Service Management, Guangzhou, China, 1-6.Google Scholar
[48] Silverstein C, Marais H, Henzinger M, Moricz M (1999) Analysis of a very large web search engine query log. SIGIR Forum 33(1):6-12.Crossref, Google Scholar · doi:10.1145/331403.331405
[49] Spink A, Jansen BJ, Wolfram D, Saracevic T (2002) From e-sex to e-commerce: Web search changes. IEEE Comput. 35(3):107-109.Crossref, Google Scholar · doi:10.1109/2.989940
[50] Spink A, Wolfram D, Jansen MBJ, Saracevic T (2001) Searching the web: The public and their queries. J. Amer. Soc. Inform. Sci. Tech. 52(3):226-234.Crossref, Google Scholar · doi:10.1002/1097-4571(2000)9999:9999<::AID-ASI1591>3.0.CO;2-R
[51] Suman B, Kumar P (2006) A survey of simulated annealing as a tool for single and multiobjective optimization. J. Oper. Res. Soc. 57(10):1143-1160.Crossref, Google Scholar · Zbl 1121.90065 · doi:10.1057/palgrave.jors.2602068
[52] Tsur O, Davidov D, Rappoport A (2010) ICWSM-A great catchy name: Semi-supervised recognition of sarcastic sentences in online product reviews. Proc. 4th Internat. AAAI Conf. Weblogs Social Media, Washington, DC, 162-169.Google Scholar
[53] Wang J, Zhu J (2009) Portfolio theory of information retrieval. Proc. 32nd Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval, Boston, 115-122.Crossref, Google Scholar · doi:10.1145/1571941.1571963
[54] Whissell J, Clarke C (2011) Improving document clustering using Okapi BM25 feature weighting. Inform. Retrieval 14(5):466-487.Crossref, Google Scholar · doi:10.1007/s10791-011-9163-y
[55] Xiong H, Wu J, Chen J (2009) K-means clustering versus validation measures: A data-distribution perspective. IEEE Trans. Syst. Man Cybernetics 39(2):318-331.Crossref, Google Scholar · doi:10.1109/TSMCB.2008.2004559
[56] Zhai C, Lafferty J (2006) A risk minimization framework for information retrieval. Inform. Processing Management 42(1):31-55.Crossref, Google Scholar · Zbl 1101.68544 · doi:10.1016/j.ipm.2004.11.003
[57] Zhai C, Cohen WW, Lafferty J (2003) Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. Proc. 32nd Internat. ACM SIGIR Conf. Res. Development Inform. Retrieval, Toronto, Canada, 10-17.Crossref, Google Scholar · doi:10.1145/860435.860440
[58] Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learn. 55(3):311-331.Crossref, Google Scholar · Zbl 1089.68615 · doi:10.1023/B:MACH.0000027785.44527.d6
[59] Zhao Y, Karypis G (2005) Hierarchical clustering algorithms for document data sets. Data Mining Knowl. Discov. 10(2):141-168.Crossref, Google Scholar · doi:10.1007/s10618-005-0361-3
[60] Zhong S (2006) Semi-supervised model-based document clustering: A comparative study. Machine Learn. 65(1):3-29.Crossref, Google Scholar · Zbl 1470.62101 · doi:10.1007/s10994-006-6540-7
[61] Zhuang J, · doi:10.1145/1390749.1390759
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.