Speaker: Prof. Dr. Sergei L. Kosakovsky Pond, Temple University, Dep. of Biology, Philadelphia, USA

Title: Beyond software tuning: scaling up comparative coding sequence analysis using approximations and models that adapt their complexity to the data (Presentation in English)
Date: Monday, 28 May 2018, 11:00 a.m.
Location: Carl-Bosch-Auditorium, Studio Villa Bosch, Schloss-Wolfsbrunnenweg 33, 69118 Heidelberg (Studio entrance between Villa Bosch and HITS)
Parking: Parking garage "Unter der Boschwiese" (free of charge)

Abstract:
Genetic sequence data are being generated at an ever-increasing pace, while many analytical techniques that are commonly used to make biologically meaningful infer-ences on these data are still “stuck” in the “small data” age. For example, a practical upper bound on the number of sequences that can be analyzed with many popular comparative phylogenetic methods is 1000, especially if codon-substitution models are used. These types of models are an essential tool for deciphering the action of natural selection on genetic sequences, and have been used extensively in biomedical and basic science applications, for example to quantify pathogen evolution: drug re-sistance, zoonotic adaptation, immune escape.

We show how his number can be raised by several orders of magnitude, enabling in-depth study of gene-sized alignments with 10000 – 100000 sequences, much more extensive model testing, or the implementation of more realistic models with added complexity. This can be accomplished via an adaptation of machine learning tech-niques originally developed in the context of large-scale data mining (latent Dirichlet allocation models), and for variable selection.

Specifically, we describe a relatively general approximation technique to limit the num-ber of expensive likelihood function evaluations a priori, by discretizing a part of the parameter space to a fixed grid, estimating other parameters using much faster sim-pler models, and integrating over the grid using MCMC or a variational Bayes ap-proach. We demonstrate how this technique can achieve 100× or greater speedups for detecting sites subject to positive selection, while improving statistical performance. Other analyses where there are only a 2-3 parameters of interest (e.g. detection of directional selection in protein sequences) can be accommodated. When discretization is not appropriate, it is often possible to develop methods that employ variable para-metric complexity chosen with an information theoretic criterion. For example, in the Adaptive Branch Site Random Effects model, we quickly select and apply models of different complexity to different branches in the phylogeny, and deliver statistical per-formance matching or exceeding best-in-class existing approaches, while running an order of magnitude faster.

Curriculum vitae: Please see: http://spond.github.io/CV.js/cv.html

Contact: Benedicta Frech (This email address is being protected from spambots. You need JavaScript enabled to view it., phone: 06221-533-263)

back to top