×

The max-out min-in problem: a tool for data analysis. (English) Zbl 1543.90252

Summary: Consider a graph with vertex set \(V\) and non-negative weights on the edges. For every subset of vertices \(S\), define \(\phi (S)\) to be the sum of the weights of edges with one vertex in \(S\) and the other in \(V \setminus S\), minus the sum of the weights of the edges with both vertices in \(S\). We consider the problem of finding \(S \subseteq V\) for which \(\phi (S)\) is maximized. We call this combinatorial optimization problem the max-out min-in problem (MOMIP). In this paper we \((i)\) present a linear 0/1 formulation and a quadratic unconstrained binary optimization formulation for MOMIP; (ii) prove that the problem is NP-hard; (iii) report results of computational experiments on simulated data to compare the performances of the two models; (iv) illustrate the applicability of MOMIP for two different topics in the context of data analysis, namely in the selection of variables in exploratory data analysis and in the identification of clusters in the context of cluster analysis; and (v) introduce a generalization of MOMIP that includes, as particular cases, the well-known weighted maximum cut problem and a novel problem related to independent dominant sets in graphs.

MSC:

90C27 Combinatorial optimization
05C22 Signed and weighted graphs
90C20 Quadratic programming
68Q25 Analysis of algorithms and problem complexity
62H30 Classification and discrimination; cluster analysis (statistical aspects)

Software:

subselect; Gurobi; UCI-ml; R

References:

[1] Cadima, J.; Cerdeira, J. O.; Minhoto, M., Computational aspects of algorithms for variable selection in the context of principal components, Comp. Stat. Data Anal., 47, 225-236 (2004) · Zbl 1429.62216
[2] Cadima, J.; Jolliffe, I. T., Variable selection and the interpretation of principal subspaces, J. Agric. Biol. Environ. Stat., 6, 62-79 (2001)
[3] Cerdeira, J. O.; Silva, P. D.; Cadima, J.; Minhoto, M., subselect: Selecting variable subsets (2020), R package version 0.15.2. https://cran.r-project.org/web/packages/subselect/
[4] Ding, C.H.Q., He, Xiaofeng, Zha, Hongyuan, Gu, Ming, Simon, H.D., 2001. A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings 2001 IEEE International Conference on Data Mining. pp. 107-114.
[5] Dua, D.; Graff, C., UCI Machine Learning Repository (2019), University of California, School of Information and Computer Science: University of California, School of Information and Computer Science Irvine, CA, http://archive.ics.uci.edu/ml
[6] Everitt, B. S.; Landau, S.; Leese, M.; Stahl, D., Cluster Analysis (2011), John Wiley & Sons Ltd: John Wiley & Sons Ltd United Kingdom · Zbl 1274.62003
[7] Fisher, R. A., The use multiple measurements in taxonomic problems, Ann. Eugenics, 7, 179-188 (1936)
[8] Gan, G.; Ma, C.; Wu, J., Data Clustering: Theory, Algorithms, and Applications, ASA-SIAM Series on Statistics and Applied Probability (2007), SIAM: SIAM Philadelphia, ASA, Alexandria, VA · Zbl 1185.68274
[9] Garey, M.; Johnson, D. S., Computers and Intractability: A Guide To the Theory of NP- Completeness (1979), W. H. Freeman & Co, San Francisco · Zbl 0411.68039
[10] Garey, M.; Johnson, D. S.; Stockmeyer, L., Some simplified NP-complete graph problems, Theoret. Comput. Sci., 1, 237-267 (1976) · Zbl 0338.05120
[11] Goddard, W.; Henning, M. A., Independent domination in graphs: A survey and recent results, Discrete Math., 313, 839-854 (2013) · Zbl 1260.05113
[12] Gurobi Optimization, LLC, Gurobi Optimizer Reference Manual (2022), https://www.gurobi.com
[13] Guyon, I., Bennett, K., Cawley, G., Escalante, H.J., Escalera, S., Ho, T.K., Macià, N., Ray, B., Saeed, M., Statnikov, A., Viegas, E., 2015. Design of the 2015 ChaLearn AutoML challenge. In: International Joint Conference on Neural Networks. IJCNN, pp. 1-8.
[14] Jiang, D.; Tang, C.; Zhang, A., Cluster analysis for gene expression data: A survey, (IEEE Transactions on Knowledge and Data Engineering, vol. 16 (2004)), 1370-1386
[15] Jolliffe, I. T., Principal Component Analysis (2002), Springer: Springer New York · Zbl 1011.62064
[16] Jolliffe, I. T.; Cadima, J., Principal component analysis: A review and recent developments, Phil. Trans. R. Soc. A, 374, Article 20150202 pp. (2016) · Zbl 1353.62067
[17] Kochenberger, G.; Hao, J.-K.; Glover, F.; Lewis, M.; Lü, Z.; Wang, H.; Wang, Y., The unconstrained binary quadratic programming problem: A survey, J. Comb. Optim., 28, 58-81 (2014) · Zbl 1303.90066
[18] Pandove, D.; Goel, S.; Rani, R., Systematic review of clustering high-dimensional and large datasets, ACM Trans. Knowl. Discov. Data., 12, Article 16 pp. (2018)
[19] R Core Team, R: A Language and Environment for Statistical Computing (2020), R Foundation for Statistical Computing: R Foundation for Statistical Computing Vienna, Austria, https://www.R-project.org/
[20] Şeker, Oylum; Tanoumand, Neda; Bodur, Merve, Digital Annealer for quadratic unconstrained binary optimization: A comparative performance analysis, Appl. Soft Comput., 127, Article 109367 pp. (2022)
[21] Shah, S. A.; Koltun, V., Robust continuous clustering, Proc. Natl. Acad. Sci. U. S. A., 114, 37, 9814-9819 (2017)
[22] Shylo, V. P.; Shylo, O. V., Solving the max-cut problem by the global equilibrium search, Cybernet. Systems Anal., 46, 5, 744-754 (2010) · Zbl 1305.90356
[23] Somers, K. M., Allometry, isometry and shape in principal component analysis, Syst. Zool., 38, 169-173 (1986)
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.