Document Zbl 0923.62107

Guttiérrez Toscano, P.; Marriott, F. H. C.

Unsupervised classification of chemical compounds. (English) Zbl 0923.62107

J. R. Stat. Soc., Ser. C, Appl. Stat. 48, No. 2, 153-163 (1999).

Summary: Clustering chemical compounds of similar structure is important in the pharmaceutical industry. One way of describing the structure is the chemical ‘fingerprint’. The fingerprint is a string of binary digits, and typical data sets consist of very large numbers of fingerprints; a suitable clustering procedure must take account of the properties of this method of coding and must be able to handle large data sets. This paper describes the analysis of a set of fingerprint data. The analysis was based on an appropriate distance measure derived from the fingerprints, followed by metric scaling into a low dimensional space. An approximation to metric scaling, suitable for very large data sets, was investigated. Cluster analysis using two programs, mclust and AutoClass-C, was carried out on the scaled data.

MSC:

62N99	Survival analysis and censored data
62H30	Classification and discrimination; cluster analysis (statistical aspects)
62P99	Applications of statistics
92C40	Biochemistry, molecular biology
92E10	Molecular structure (graph-theoretic methods, methods of differential topology, etc.)

Keywords:

chemical fingerprint; cluster analysis; Rand index; metric scaling

Software:

mclust

Cite Review PDF

Full Text: DOI