×

Improving Markov chain Monte Carlo model search for data mining. (English) Zbl 1050.68120

Summary: The motivation of this paper is the application of MCMC model scoring procedures to data mining problems, involving a large number of competing models and other relevant model choice aspects. To achieve this aim we analyze one of the most popular Markov Chain Monte Carlo methods for structural learning in graphical models, namely, the \(\text{MC}^{3}\) algorithm proposed by D. Madigan and J. York [Int. Stat. Rev. 63, 215–232 (1995; Zbl 0834.62003)]. Our aim is to improve their algorithm to make it an effective and reliable tool in the field of data mining. In such context, typically highly dimensional in the number of variables, little can be known a priori and, therefore, a good model search algorithm is crucial.
We present and describe in detail our implementation of the \(\text{MC}^{3}\) algorithm, which provides an efficient general framework for computations with both Directed Acyclic Graphical (DAG) models and Undirected Decomposable Models (UDG). We believe that the possibility of commuting easily between the two classes of models constitutes an important asset in data mining, where an a priori knowledge of causal effects is usually difficult to establish.
Furthermore, in order to improve the \(\text{MC}^{3}\) method we propose provide several graphical monitors which can help extracting results and assessing the goodness of the Markov chain Monte Carlo approximation to the posterior distribution of interest.
We apply our proposed methodology first to the well-known coronary heart disease dataset [D. Edwards and T. Havránek, Biometrika 72, 339–351 (1985; Zbl 0576.62067)]. We then introduce a novel data mining application which concerns market basket analysis.

MSC:

68T05 Learning and adaptive systems in artificial intelligence
68W05 Nonnumerical algorithms
Full Text: DOI