Life Science Thrust
Applications in life sciences are primary
research drivers for the Center for Science for of Information. Given
the ever-expanding repository of diverse data-sets, the complexity of
underlying processes, and the importance of spatio-temporal context,
life sciences applications serve arguably the most interesting and challenging
test-beds for models and methods. Broadly, challenges in life sciences
targeted by the Center may be viewed in four categories.
- Knowledge extraction from data
- Integrating diverse datasets
- Defining the granularity of data
- Statistical methods with regularization
- Biology-constrained methods
- Information metrics
- Dealing with context.
- Dealing with noise in data
- Robustness of knowledge extraction to noise
- How to deal with missing data?
- Classification of and modularity from data
- Specification and identification of modules (functional, spatial, temporal, etc.) from data
- Quantifying information content of modules
- Quantitative and qualitative comparison of modules.
- Dealing with dynamical data
- How to deal with multivariate and high dimensional time series data?
- Understanding spatio-temporal information processing in systems
- Identifying suitable granularity and context for analyzing data.
We address a few problems below:
Dynamical Data:
A key problem in life sciences is the development of models at different
granularity from time series measurements of cellular constituents such
as proteins, nucleic acids, and metabolites and phenotypes such as gene
expression profiles, cellular proliferation, and cellular death. From
cellular component measurements at different time instances following
a stimulus, it would be desirable to build a biochemical pathway model.
Such models may be correlative or causal and can contain myriad nodes
and edges. If one were to consider modules that are varying with time,
hypergraphs can be constructed based on a correlation metric or interaction
data. These hypergraphs provide a glimpse of the dynamics of the system.
However, it would be desirable to convert these hypergraphs into necessary
and sufficient models to quantitatively describe the cellular phenotypes.
This is a major challenge for the Center.
Many-to-many Network and Biochemical
Pathways: Shannon's methods deal with point to point
interaction or communications. However all biological systems are many-point-to-many-point
communications and there are no algorithms for understanding the information
complexity of this system. We will develop methods to pose the following
questions. What are minimal networks that will provide quantitative
information on phenotypes? What is the sensitivity of different connections
for a given phenotype? Entirely new methods need to be developed to
address this problem.
Modularity in Networks:
We will develop algorithms for deciphering modularity in systems.
Amongst the interaction networks, biologists have painstakingly identified
cliques that have relevance for chosen phenotypes. However, there are
few methods that can predict modules in networks. One quest of this
center is to identify modules from complex pathways.
Genome Encoding and Evolution: A large fraction of the Human Genome codes for gene expression control during the life time of an organism. Driven by exciting new technologies, the field of Genomics is now beginning to decipher the language of gene control. This process holds many challenges related to Information Theory. At a pragmatic level, it requires the integration of large amounts of heterogeneous, noisy and missing data, which nonetheless describe the action of robust networks. There are also fascinating questions of classification and identification of the different functional components of the regulatory networks. Also, by comparing the genomes of different individuals and different species we stand to learn about modes of information transmission through the generations. In many ways the genome is the ultimate information repository, and using Information Theory to better understand it is a major challenge.