×

Algorithms for data science. (English) Zbl 1367.62005

Cham: Springer (ISBN 978-3-319-45795-6/hbk; 978-3-319-45797-0/ebook). xxiii, 430 p. (2016).
This textbook on practical data analytics unites fundamental principles, algorithms, and data. Algorithms are the keystone of data analytics and the focal point of this textbook. Data science is the discipline that covers data mining as tool or important topic. The escalating demand for insights into big data requires a fundamentally new approach to architecture, tools, and practices. This is why the term data science is useful. It underscores the centrality of data in the investigation because they store of potential value in the field of action. The label science invokes certain very real concepts within it, like the notion of public knowledge and peer review. This point of view makes that data science is not a new idea. It is part of a continuum of serious thinking that dates back hundreds of years. A good example of the results of data science is the Benford law (see [A. Berger and T. P. Hill, Notices Am. Math. Soc. 64, No. 2, 132–134 (2017; Zbl 1359.60020); An introduction to Benford’s law. Princeton, NJ: Princeton University Press (2015; Zbl 1412.60002)]). In an effort to identifying some of the best-known algorithms that have been widely used in the data mining community, the IEEE International Conference on Data Mining (ICDM) has identified the top \(10\) algorithms in data mining for presentation at ICDM ’06 in Hong Kong. This panel will announce the top 10 algorithms and discuss the impact and further research of each of these 10 algorithms [X. Wu (ed.) and V. Kumar (ed.), The top ten algorithms in data mining. Papers based on the presentations at the IEEE international conference on data mining (ICDM 2006), Hong Kong, December 18–22, 2006. Boca Raton, FL: CRC Press (2009; Zbl 1179.68129); X. Wu et al., “Top 10 algorithms in data mining”, Know. Inf. Syst. 14, 1–37 (2008)]. In the present book, there are clear and intuitive explanations of the mathematical and statistical foundations that make the algorithms transparent. Most of the algorithms announced by IEEE in 2006 are included. But practical data analytics requires more than just the foundations. Problems and data are enormously variable and only the most elementary algorithms can be used without modification. Programming fluency and experience with real and challenging data are indispensable, and so the reader is immersed in Python and R and real data analysis. By the end of the book, the reader will have gained the ability to adapt algorithms to new problems and carry out innovative analysis.
The book has three parts. (I) Data Reduction: Begins with the concepts of data reduction, data maps, and information extraction. The second chapter introduces associative statistics, the mathematical foundation of scalable algorithms and distributed computing. Practical aspects of distributed computing are the subject of the Hadoop and MapReduce chapter. (II) Extracting Information from Data: Linear regression and data visualization are the principal topics of Part II. The authors dedicate a chapter to the critical domain of healthcare analytics for an extended example of practical data analytics. The algorithms and analytics will be of great interest to practitioners interested in utilizing the large and unwieldy data sets of the Centers for Disease Control and Prevention’s Behavioral Risk Factor Surveillance System. (III) Predictive Analytics: Two foundational and widely used algorithms, \(k\)-nearest neighbors and naive Bayes, are developed in detail. A chapter is dedicated to forecasting. The last chapter focuses on streaming data and uses publicly accessible data streams originating from the Twitter API and the NASDAQ stock market in the tutorials.
In contrast to other studies on the algorithms in data analytics, this book is devoted to upper-division undergraduate and graduate students in mathematics, statistics, and computer science. It is intended for a one- or two-semester course in data analytics and reflects the authors’ research experience in data science concepts and the teaching skills in various areas. Particularly valuable are the case studies and exercises (some of them with solutions). The book is open to a wide audience as the prerequisites are on a low level. Students with one or two courses in probability or statistics, an exposure to algebra, calculus, and a programming course will have no difficulties. The core material of every chapter is accessible to anyone with these prerequisites. Each chapter includes exercises of varying levels of difficulty. The chapters expanded at the close with innovations of interest to practitioners of data science need wider prerequisites. The text is eminently suitable for self-study and an exceptional resource for practitioners. The web page https://www.softmathconsultants.com/algorithms-and-data-science/ supports the reader by the data used in the book, tutorials and exercises.

MSC:

62-01 Introductory exposition (textbooks, tutorial papers, etc.) pertaining to statistics
62R07 Statistical aspects of big data and data science
62P10 Applications of statistics to biology and medical sciences; meta analysis
68T05 Learning and adaptive systems in artificial intelligence
68M14 Distributed systems
62J05 Linear regression; mixed models
62H30 Classification and discrimination; cluster analysis (statistical aspects)
62M20 Inference from stochastic processes and prediction
Full Text: DOI