Skip to main content

Showing 1–17 of 17 results for author: Boullé, M

  1. arXiv:2409.11100  [pdf, other

    cs.LG stat.ML

    Fractional Naive Bayes (FNB): non-convex optimization for a parsimonious weighted selective naive Bayes classifier

    Authors: Carine Hue, Marc Boullé

    Abstract: We study supervised classification for datasets with a very large number of input variables. The naïve Bayes classifier is attractive for its simplicity, scalability and effectiveness in many real data applications. When the strong naïve Bayes assumption of conditional independence of the input variables given the target variable is not valid, variable selection and model averaging are two common… ▽ More

    Submitted 17 September, 2024; originally announced September 2024.

  2. arXiv:2307.16718  [pdf, other

    cs.LG stat.ML

    An Efficient Shapley Value Computation for the Naive Bayes Classifier

    Authors: Vincent Lemaire, Fabrice Clérot, Marc Boullé

    Abstract: Variable selection or importance measurement of input variables to a machine learning model has become the focus of much research. It is no longer enough to have a good model, one also must explain its decisions. This is why there are so many intelligibility algorithms available today. Among them, Shapley value estimation algorithms are intelligibility methods based on cooperative game theory. In… ▽ More

    Submitted 31 July, 2023; originally announced July 2023.

    Comments: 15 pages, 3 figures

  3. arXiv:2306.05786  [pdf, other

    cs.LG

    Two-level histograms for dealing with outliers and heavy tail distributions

    Authors: Marc Boullé

    Abstract: Histograms are among the most popular methods used in exploratory analysis to summarize univariate distributions. In particular, irregular histograms are good non-parametric density estimators that require very few parameters: the number of bins with their lengths and frequencies. Many approaches have been proposed in the literature to infer these parameters, either assuming hypotheses about the u… ▽ More

    Submitted 9 June, 2023; originally announced June 2023.

    Comments: 30 pages, 47 figures

  4. arXiv:2212.13524  [pdf, other

    cs.LG math.ST stat.ML

    Fast and fully-automated histograms for large-scale data sets

    Authors: Valentina Zelaya Mendizábal, Marc Boullé, Fabrice Rossi

    Abstract: G-Enum histograms are a new fast and fully automated method for irregular histogram construction. By framing histogram construction as a density estimation problem and its automation as a model selection task, these histograms leverage the Minimum Description Length principle (MDL) to derive two different model selection criteria. Several proven theoretical results about these criteria give insigh… ▽ More

    Submitted 27 December, 2022; originally announced December 2022.

    Journal ref: Computational Statistics and Data Analysis, 2023, 180, pp.107668

  5. arXiv:2212.11728  [pdf, other

    cs.LG math.ST stat.ML

    Co-clustering based exploratory analysis of mixed-type data tables

    Authors: Aichetou Bouchareb, Marc Boullé, Fabrice Clérot, Fabrice Rossi

    Abstract: Co-clustering is a class of unsupervised data analysis techniques that extract the existing underlying dependency structure between the instances and variables of a data table as homogeneous blocks. Most of those techniques are limited to variables of the same type. In this paper, we propose a mixed data co-clustering method based on a two-step methodology. In the first step, all the variables are… ▽ More

    Submitted 22 December, 2022; originally announced December 2022.

    Journal ref: Advances in Knowledge Discovery and Management, 834, Springer International Publishing, pp.23-41, 2019, Studies in Computational Intelligence

  6. arXiv:2212.11725  [pdf, other

    cs.LG math.ST stat.ML

    Model Based Co-clustering of Mixed Numerical and Binary Data

    Authors: Aichetou Bouchareb, Marc Boullé, Fabrice Clérot, Fabrice Rossi

    Abstract: Co-clustering is a data mining technique used to extract the underlying block structure between the rows and columns of a data matrix. Many approaches have been studied and have shown their capacity to extract such structures in continuous, binary or contingency tables. However, very little work has been done to perform co-clustering on mixed type data. In this article, we extend the latent block… ▽ More

    Submitted 22 December, 2022; originally announced December 2022.

    Journal ref: Advances in Knowledge Discovery and Management, 834, Springer International Publishing, pp.3-22, 2019, Studies in Computational Intelligence

  7. Interpretable Feature Construction for Time Series Extrinsic Regression

    Authors: Dominique Gay, Alexis Bondu, Vincent Lemaire, Marc Boullé

    Abstract: Supervised learning of time series data has been extensively studied for the case of a categorical target variable. In some application domains, e.g., energy, environment and health monitoring, it occurs that the target variable is numerical and the problem is known as time series extrinsic regression (TSER). In the literature, some well-known time series classifiers have been extended for TSER pr… ▽ More

    Submitted 15 March, 2021; originally announced March 2021.

  8. arXiv:1902.02056  [pdf, other

    stat.ML cs.LG

    Un modèle Bayésien de co-clustering de données mixtes

    Authors: Aichetou Bouchareb, Marc Boullé, Fabrice Rossi, Fabrice Clérot

    Abstract: We propose a MAP Bayesian approach to perform and evaluate a co-clustering of mixed-type data tables. The proposed model infers an optimal segmentation of all variables then performs a co-clustering by minimizing a Bayesian model selection cost function. One advantage of this approach is that it is user parameter-free. Another main advantage is the proposed criterion which gives an exact measure… ▽ More

    Submitted 6 February, 2019; originally announced February 2019.

    Comments: in French

    Journal ref: Extraction et gestion des connaissances 2018, Jan 2018, Paris, France. Revue des Nouvelles Technologies de l'Information, RNTI-E-34, pp.275-280, 2018, Actes de la 18{è}eme Conf{é}rence Internationale Francophone sur l'Extraction et gestion des connaissances (EGC'2018)

  9. arXiv:1608.07929  [pdf, other

    stat.ML cs.SI physics.soc-ph

    Discovering Patterns in Time-Varying Graphs: A Triclustering Approach

    Authors: Romain Guigourès, Marc Boullé, Fabrice Rossi

    Abstract: This paper introduces a novel technique to track structures in time varying graphs. The method uses a maximum a posteriori approach for adjusting a three-dimensional co-clustering of the source vertices, the destination vertices and the time, to the data under study, in a way that does not require any hyper-parameter tuning. The three dimensions are simultaneously segmented in order to build clust… ▽ More

    Submitted 29 August, 2016; originally announced August 2016.

    Comments: Advances in Data Analysis and Classification, Springer Verlag, 2015, Online First

  10. arXiv:1608.05522  [pdf, other

    cs.IT

    Revisiting enumerative two-part crude MDL for Bernoulli and multinomial distributions (Extended version)

    Authors: Marc Boullé, Fabrice Clérot, Carine Hue

    Abstract: We leverage the Minimum Description Length (MDL) principle as a model selection technique for Bernoulli distributions and compare several types of MDL codes. We first present a simplistic crude two-part MDL code and a Normalized Maximum Likelihood (NML) code. We then focus on the enumerative two-part crude MDL code, suggest a Bayesian interpretation for finite size data samples, and exhibit a stro… ▽ More

    Submitted 3 October, 2016; v1 submitted 19 August, 2016; originally announced August 2016.

    Comments: 25 pages

    MSC Class: 68P30 ACM Class: E.4

  11. arXiv:1511.01281  [pdf, other

    stat.ML cs.DB cs.LG

    Co-Clustering Network-Constrained Trajectory Data

    Authors: Mohamed Khalil El Mahrsi, Romain Guigourès, Fabrice Rossi, Marc Boullé

    Abstract: Recently, clustering moving object trajectories kept gaining interest from both the data mining and machine learning communities. This problem, however, was studied mainly and extensively in the setting where moving objects can move freely on the euclidean space. In this paper, we study the problem of clustering trajectories of vehicles whose movement is restricted by the underlying road network.… ▽ More

    Submitted 4 November, 2015; originally announced November 2015.

    Journal ref: Advances in Knowledge Discovery and Management, 615, Springer International Publishing, pp.19-32, 2015, Studies in Computational Intelligence, 978-3-319-23750-3

  12. A Study of the Spatio-Temporal Correlations in Mobile Calls Networks

    Authors: Romain Guigourès, Marc Boullé, Fabrice Rossi

    Abstract: For the last few years, the amount of data has significantly increased in the companies. It is the reason why data analysis methods have to evolve to meet new demands. In this article, we introduce a practical analysis of a large database from a telecommunication operator. The problem is to segment a territory and characterize the retrieved areas owing to their inhabitant behavior in terms of mobi… ▽ More

    Submitted 30 October, 2015; originally announced October 2015.

    Comments: Advances in Knowledge Discovery and Management, 615, Springer International Publishing, pp.3-17, 2015, Studies in Computational Intelligence

  13. arXiv:1508.01340  [pdf, other

    cs.SI cs.DB stat.ML

    Universal Approximation of Edge Density in Large Graphs

    Authors: Marc Boullé

    Abstract: In this paper, we present a novel way to summarize the structure of large graphs, based on non-parametric estimation of edge density in directed multigraphs. Following coclustering approach, we use a clustering of the vertices, with a piecewise constant estimation of the density of the edges across the clusters, and address the problem of automatically and reliably inferring the number of clusters… ▽ More

    Submitted 6 August, 2015; originally announced August 2015.

    ACM Class: H.2.8; I.5.3; G.3

  14. arXiv:1505.01300  [pdf, other

    cs.DB stat.ML

    Cats & Co: Categorical Time Series Coclustering

    Authors: Dominique Gay, Romain Guigourès, Marc Boullé, Fabrice Clérot

    Abstract: We suggest a novel method of clustering and exploratory analysis of temporal event sequences data (also known as categorical time series) based on three-dimensional data grid models. A data set of temporal event sequences can be represented as a data set of three-dimensional points, each point is defined by three variables: a sequence identifier, a time value and an event value. Instantiating data… ▽ More

    Submitted 6 May, 2015; originally announced May 2015.

    ACM Class: H.2.8

  15. arXiv:1503.06060  [pdf, other

    cs.DB stat.ML

    Country-scale Exploratory Analysis of Call Detail Records through the Lens of Data Grid Models

    Authors: Romain Guigourès, Dominique Gay, Marc Boullé, Fabrice Clérot, Fabrice Rossi

    Abstract: Call Detail Records (CDRs) are data recorded by telecommunications companies, consisting of basic informations related to several dimensions of the calls made through the network: the source, destination, date and time of calls. CDRs data analysis has received much attention in the recent years since it might reveal valuable information about human behavior. It has shown high added value in many a… ▽ More

    Submitted 20 March, 2015; originally announced March 2015.

    Comments: Submitted to Industrial Track of ECML/PKDD 2015

    MSC Class: stat.ML - Machine Learning

  16. Nonparametric Hierarchical Clustering of Functional Data

    Authors: Marc Boullé, Romain Guigourès, Fabrice Rossi

    Abstract: In this paper, we deal with the problem of curves clustering. We propose a nonparametric method which partitions the curves into clusters and discretizes the dimensions of the curve points into intervals. The cross-product of these partitions forms a data-grid which is obtained using a Bayesian model selection approach while making no assumptions regarding the curves. Finally, a post-processing te… ▽ More

    Submitted 2 July, 2014; originally announced July 2014.

    Journal ref: Advances in Knowledge Discovery and Management, Guillet, Fabrice and Pinaud, Bruno and Venturini, Gilles and Zighed, Djamel Abdelkader (Ed.) (2014) 15-35

  17. arXiv:1301.2659  [pdf, other

    cs.LG cs.SI stat.ML

    A Triclustering Approach for Time Evolving Graphs

    Authors: Romain Guigourès, Marc Boullé, Fabrice Rossi

    Abstract: This paper introduces a novel technique to track structures in time evolving graphs. The method is based on a parameter free approach for three-dimensional co-clustering of the source vertices, the target vertices and the time. All these features are simultaneously segmented in order to build time segments and clusters of vertices whose edge distributions are similar and evolve in the same way ove… ▽ More

    Submitted 12 January, 2013; originally announced January 2013.

    Journal ref: Co-clustering and Applications International Conference on Data Mining Workshop, Brussels : Belgium (2012)