×

HinPage: illegal and harmful webpage identification using transductive classification. (English) Zbl 07730531

Deng, Yi (ed.) et al., Information security and cryptology. 18th International conference, Inscrypt 2022, Beijing, China, December 11–13, 2022. Revised selected papers. Cham: Springer. Lect. Notes Comput. Sci. 13837, 373-390 (2023).
Summary: With the growing popularity of the Internet, websites could make significant profit by hosting illegal and harmful content, such as violence, sexual, illegal gambling, drug abuse, etc. They are serious threats to a safe and secure Internet, and they are especially harmful to the underage population. Government agencies, ISPs, network administrators at various levels, and parents have been seeking for accurate and robust solutions to block such illegal and harmful webpages. Existing solutions detect inappropriate pages based on content, e.g., using keyword matching or content-based image classification. They could be easily escaped by altering the internal format of texts or images, e.g., mixing different alphabets. In this paper, we propose to utilize relatively stable features extracted from the relationships between the targeted illegal/harmful webpages to discover and identify illegal webpages. We introduce a new mechanism, namely HinPage, that utilizes such features for the robust identification of PG (pornographic and gambling) pages. HinPage models the candidate PG pages and the resources on the pages with a heterogeneous information network (HIN). A transductive classification algorithm is then applied to the HIN to identify PG pages.
Through experiments on 10,033 candidate PG pages, we demonstrate that HinPage achieves an accuracy of 83.5% on PG page identification. In particular, it is able to identify illegal/harmful PG pages that cannot be recognized by SOTA commercial products.
For the entire collection see [Zbl 1517.94007].

MSC:

68M11 Internet topics

Software:

metapath2vec
Full Text: DOI

References:

[1] Luo, C.; Guan, R.; Wang, Z.; Lin, C.; de Rijke, M., HetPathMine: a novel transductive classification algorithm on heterogeneous information networks, Advances in Information Retrieval, 210-221 (2014), Cham: Springer, Cham · doi:10.1007/978-3-319-06028-6_18
[2] Yang, H., Du, K., Zhang, Y., et al.: Casino Royale: a deep exploration of illegal online gambling. In: Proceedings of the 35th Annual Computer Security Applications Conference, pp. 500-513 (2019)
[3] Farman, A., Pervez, K., Kashif, R., et al.: A fuzzy ontology and SVM-based Web content classification system. IEEE Access 25781-25797 (2017)
[4] Li, L.; Gou, G.; Xiong, G.; Cao, Z.; Li, Z.; Zeng, B.; Huang, Q.; El Saddik, A.; Li, H.; Jiang, S.; Fan, X., Identifying gambling and porn websites with image recognition, Advances in Multimedia Information Processing - PCM 2017, 488-497 (2018), Cham: Springer, Cham · doi:10.1007/978-3-319-77383-4_48
[5] Hu, W., Wu, O., Chen, Z., et al.: Recognition of pornographic web pages by classifying texts and images. IEEE Trans. Pattern Anal. 1019-1034 (2007)
[6] Huang, Y., Liu, D., Yan, Z., et al.: An abused webpage detection method based on screenshots text recognition. In: Proceedings of the 2021 ACM International Conference on Intelligent Computing and its Emerging Applications, pp. 106-110 (2021)
[7] Chen, Y., Zheng, R., Zhou, A., et al.: Automatic detection of pornographic and gambling websites based on visual and textual content using a decision mechanism. Sensors (2020)
[8] Yang, R., Liu, J., Gu, L., et al.: Search & catch: detecting promotion infection in the underground through search engines. In: IEEE TrustCom, pp. 1566-1571 (2020)
[9] Starov, O., Zhou, Y., Zhang, X., et al.: Betrayed by your dashboard: discovering malicious campaigns via web analytics. In: Proceedings of the World Wide Web Conference, pp. 227-236 (2018)
[10] Salam, H.; Maarof, MA; Zainal, A.; Abraham, A.; Muda, AK; Choo, Y-H, Design consideration for improved term weighting scheme for pornographic web sites, Pattern Analysis, Intelligent Security and the Internet of Things, 275-285 (2015), Cham: Springer, Cham · doi:10.1007/978-3-319-17398-6_25
[11] Wang, L.; Zhang, J.; Wang, M.; Tian, J.; Zhuo, L., Multilevel fusion of multimodal deep features for porn streamer recognition in live video, Pattern Recogn. Lett., 140, 150-157 (2020) · doi:10.1016/j.patrec.2020.09.027
[12] Ahmadi, A.; Fotouhi, M.; Khaleghi, M., Intelligent classification of webpages using contextual and visual features, Appl. Soft Comput., 11, 1638-1647 (2011) · doi:10.1016/j.asoc.2010.05.003
[13] Maktabar, M.; Zainal, A.; Maarof, MA; Kassim, MN; Abraham, A.; Muhuri, PK; Muda, AK; Gandhi, N., Content based fraudulent website detection using supervised machine learning techniques, Hybrid Intelligent Systems, 294-304 (2018), Cham: Springer, Cham · doi:10.1007/978-3-319-76351-4_30
[14] European Commission. Illegal and Harmful Content on the Internet COM(96)487final (1996)
[15] Shin, J., Lee, S., Wang, T.: Semantic approach for identifying harmful sites using the link relations. In: Proceedings of the 2014 IEEE International Conference on Semantic Computing, pp. 16-18 (2014)
[16] Farooq, M.S., Khan, M.A., Abbas, S., et al.: Skin detection based pornography filtering using adaptive back propagation neural network. In: 8th International Conference on Information and Communication Technologies, pp. 106-112 (2019)
[17] Yaqub, W., Mohanty, M., et al.: Encrypted domain skin tone detection for pornographic image filtering. In: 15th IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 1-5 (2018)
[18] Granizo, S.L., Caraguay, Á.L., López, L.I., Hernández-Álvarez, M.: Detection of possible illicit messages using natural language processing and computer vision on twitter and linked websites. IEEE Access (2020)
[19] Lee, P.Y., Hui, S.C., Fong, A.C.M.: An intelligent categorization engine for bilingual web content filtering. IEEE Trans. Multimed. 1183-1190 (2005)
[20] Sae-Bae, N., Sun, X., et al.: Towards automatic detection of child pornography. In: 2014 IEEE International Conference on Image Processing (ICIP), pp. 5332-5336 (2014)
[21] Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Schölkopf, B.: Learning with local and global consistency. In: Advances in Neural Information Processing Systems, pp. 321-328 (2004)
[22] Chrome DevTools. https://chromedevtools.github.io/devtools-protocol/1-3/Page/
[23] OpenCV. https://opencv.org/
[24] Sun, Y.; Han, J.; Yan, X., PathSim: meta path-based top-k similarity search in heterogeneous information networks, Proc. VLDB Endow., 4, 992-1003 (2011) · doi:10.14778/3402707.3402736
[25] Symantec sitereview. https://sitereview.bluecoat.com/
[26] Baidu Security Platform. https://bsb.baidu.com/
[27] Evaluation Standard of Baidu Security Platform. https://bsb.baidu.com/standard
[28] Nomura, S.; Oyama, S.; Hayamizu, T., Analysis and improvement of HITS algorithm for detecting web communities, Syst. Comput., 35, 32-42 (2004) · doi:10.1002/scj.10425
[29] Dong, Y., Chawla, N.V., Swami, A.: metapath2vec: scalable representation learning for heterogeneous networks. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 135-144 (2017)
[30] Sokolov, M., Olufowobi, K., Herndon, N.: Visual spoofing in content-based spam detection. In: 13th International Conference on Security of Information and Networks (2020)
[31] Yuan, K., et al.: Stealthy porn: understanding real-world adversarial images for illicit online promotion. In: IEEE Symposium on Security and Privacy (SP) (2019)
[32] Tong, S., Zhang, H, Shen, B., et al.: Detecting gambling sites from post behaviors. In: IEEE 11th Conference on Industrial Electronics and Applications, pp. 2495-2500 (2016)
[33] Moustafa, M., et al.: Applying deep learning to classify pornographic images and videos. arXiv Preprint arxiv:1511.08899 (2015)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.