Efficient Representation of Biochemical Structures for Supervised and Unsupervised Machine Learning Models Using Multi-Sensoric Embeddings

We present an approach to efficiently embed complex data objects from the chem- and bioinformatics domain like graph structures into Euclidean vector spaces such that those data bases can be handled by machine learning models. The method is denoted as sensoric response principle (SRP). It uses a small subset of objects serving as so-called sensors. Only for these sensors, the computationally demanding dissimilarity calculations, e.g. graph kernel computations, have to be executed and the resulting response values are used to generate the object embedding into an Euclidean representation space. Thus, the SRP avoids to calculate all object dissimilarities for embedding, which usually is computationally costly due to the complex proximity measures in use. Particularly, we consider strategies to determine the number of sensors for an appropriate embedding as well as selection strategies for SRP. Finally, the quality of the embedding is evaluated w.r.t. to the preservation of the original object relations in the embedding space. The SRP can be used for unsupervised and supervised machine learning. We demonstrate the ability of the approach for classification learning in context of an interpretable machine learning classifier.

VIVO