Abstract
The volume of unstructured information presented on the Internet is constantly increasing, together with the total amount of websites and their contents. To process this vast amount of information it is important to distinguish different clusters of related webpages. Such clusters are used, for example, for knowledge extraction, named entity recognition, and recommendation algorithms. A variety of applications (such as semantic analysis systems, crawlers and search engines) utilizes semantic clustering algorithms to recognize thematically connected webpages. The majority of them relies on text analysis of the web documents content, and this leads to certain limitations, such as long processing time, need of representative text content, or vagueness of natural language. In this article, we present a framework for unsupervised domain and language independent semantic clustering of the website, which utilizes its internal hypertext structure and does not require text analysis. As a basis, we represent the hypertext structure as a graph and apply known flow simulation clustering algorithms to the graph to produce a set of webpage clusters. We assume these clusters contain thematically connected webpages. We evaluate our clustering approach with a corpus of real-world webpages and compare the approach with well-known text document clustering algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 337–348. ACM (2003)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L.J.B., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC 2007 and ISWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7), 107–117 (1998). Proceedings of the Seventh International World Wide Web Conference. http://www.sciencedirect.com/science/article/pii/S016975529800110X
Carlson, A., Betteridge, J., Wang, R.C., Hruschka, Jr., E.R., Mitchell, T.M.: Toward an architecture for never-ending language learning. In: Proceedings of the Conference on Artificial Intelligence (AAAI) (2010)
Carpineto, C., Osinski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Computing Surveys 41(3), July 2009. http://doi.acm.org/10.1145/1541880.1541884
Chakrabarti, D., Mehta, R.: The paths more taken: matching dom trees to search logs for accurate webpage clustering. In: Proceedings of the 19th International Conference on World Wide Web, pp. 211–220. ACM (2010)
Croft, W.B., Metzler, D., Strohman, T.: Search engines: Information retrieval in practice, chap. 4.5. Addison-Wesley Reading (2010)
Devika, K., Surendran, S.: An overview of web data extraction techniques. International Journal of Scientific Engineering and Technology 2(4) (2013)
Ferrara, E., Meo, P.D., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: A survey. CoRR abs/1207.0246 (2012)
Hollink, V., van Someren, M., Wielinga, B.J.: Navigation behavior models for link structure optimization. User Modeling and User-Adapted Interaction 17(4), 339–377 (2007)
Kosala, R., Blockeel, H.: Web mining research: A survey. ACM Sigkdd Explorations Newsletter 2(1), 1–15 (2000)
Lehmann, J., Völker, J. (eds.): Studies on the Semantic Web, chap. Information Extraction for Ontology Learning. Akademische Verlagsgesellschaft - AKA GmbH, P.O. Box 41 07 05, 12117 Berlin, Germany (2014)
Ngomo, A.C.N., Lyko, K., Christen, V.: Coala-correlation-aware active learning of link specifications. In: The Semantic Web: Semantics and Big Data, pp. 442–456. Springer (2013)
Ngonga Ngomo, A.-C., Schumacher, F.: Borderflow: a local graph clustering algorithm for natural language processing. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 547–558. Springer, Heidelberg (2009)
Osinski, S., Stefanowski, J., Weiss, D.: Lingo: search results clustering algorithm based on singular value decomposition. In: Proceedings of the International Conference on Intelligent Information Systems (IIPWM 2004), Zakopane, Poland, pp. 359–368 (2004)
Osiński, S., Weiss, D.: Carrot\(^{2}\): design of a flexible and efficient web information retrieval framework. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 439–444. Springer, Heidelberg (2005)
Poon, H., Domingos, P.: Unsupervised ontology induction from text. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 296–305. ACL 2010, Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1858681.1858712
Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007)
Suchanek, F.M., Sozio, M., Weikum, G.: Sofie: a self-organizing framework for information extraction. In: Proceedings of the 18th International Conference on World Wide Web, pp. 631–640. ACM (2009)
Van Dongen, S.M.: Graph clustering by flow simulation (2001)
Wu, F., Weld, D.S.: Automatically refining the wikipedia infobox ontology. In: Proceedings of the 17th International Conference on World Wide Web, pp. 635–644. ACM (2008)
Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 24–28 1998, pp. 46–54 (1998). http://doi.acm.org/10.1145/290941.290956
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Salin, V., Slastihina, M., Ermilov, I., Speck, R., Auer, S., Papshev, S. (2015). Semantic Clustering of Website Based on Its Hypertext Structure. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and Semantic Web. KESW 2015. Communications in Computer and Information Science, vol 518. Springer, Cham. https://doi.org/10.1007/978-3-319-24543-0_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-24543-0_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-24542-3
Online ISBN: 978-3-319-24543-0
eBook Packages: Computer ScienceComputer Science (R0)