### Knowledge Thrust

The Center targets two broad fundamental areas of knowledge management, motivated by three transformative domains.

*Information Science for Collaborative Computing and Inference*: In many applications, high-value data is distributed among parties that share some common goals and have some individual goals. There are important questions involving what data to share and who to share it with to accomplish desired tasks. These issues are particularly important in the face of limited resources such as time, power, and bandwidth, and other considerations such as privacy and security. We will explore fundamental problems in distributed inference and collaborative computing, and particularly the role of information in these tasks.Often parties may be reluctant to share information, even though all would gain from collaboratively computing using everyone's data. The reluctance to share can be quite rational if the drawbacks of revealing one's private, proprietary information, and the loss of control against its further dissemination and misuse, can outweigh the benefits gained from sharing private information. Quantification of the information gained and the private information leaked, would enable rational cost-benefit analysis by potential collaboration participants. In the absence of this, risk aversion dominates, and many potential "win-win" collaborations may not take place. One major challenge in this endeavor is the

*impact of time*- the time-value of information versus the time to compute it (e.g., a data disclosure may be harmless if computing the confidential information from the disclosed data takes long enough). A second major challenge is*mitigation*- perturbing the disclosed data to protect private and confidential information, without damaging its usefulness for the purpose of collaborative computing and inference. A third challenge is*quantifying*the mitigation afforded by secure multiparty computation protocols, which makes possible "computing with data without knowing it" yet must inherently leak the information that can be inferred from knowing one's own inputs and the computed outputs.In addition to computing and inference, another fundamental challenge we will explore are methods to summarize complex or

*high dimensional*datasets, for example nonlinear dimensionality reduction and various techniques for making complex datasets easy to interpret (data visualization). This is particularly important in many of the applications that will be investigated (e.g., biology, economics, social networks, environmental modeling).*Semantic, Goal-Oriented and Communication:*One of the goals of the Center is to propose a modern theory which integrates*computing*and*communication*right from the start. Such a theory would attempt to formalize the "problems" that devices attempt to solve by communicating, i.e., the goals of communication. By then focusing on these goals, we hope that efficiency and reliability measures can be proposed that allow various solutions to be analyzed rigorously and compared quantitatively.*Economics and Information Theory*: Much of modern dynamic theory formulates models by examing how continuously optimizing agents will interact in markets. This has been important in allowing consistent treatment of economic behavior, but the models postulate continuous optimization, implying very rapid responses to policy changes and to market signals, whereas actual behavior is more sluggish. Approaches to address this (e.g., by postulating "adjustment costs") have an ad hoc flavor and are not grounded in direct microeconomic observations.The existing "rational expectations" theories with continuous optimization imply infinite mutual information, in Shannon's sense, between the stochastic process for market signals and the stochastic process of a person's action. At least qualitatively, recognizing that this rate of information flow must be finite explains a broad array of observed facts about economic behavior that has in the past been explained with ad hoc postulates of inertia or adjustment costs. Our work attempts to integrate a formal information-theoretic approach into dynamic economic theory. This seems to be a promising avenue for both explaining observations and improving the formulation of economic policy.

*Learning and Inference in Networks*: In order to model decision-making and behavior in networks, it is important to be able to efficiently estimate joint distributions over possible*network structures*and accurately assess the significance of*discovered patterns.*For example, one network mining task is to estimate the joint distribution of node attributes (e.g., the political views of users in Facebook) conditioned on the network structure, modeling dependencies among neighboring nodes (e.g., similar political views among friends). The resulting distribution is useful to jointly predict the unknown features of nodes in a network, exploiting dependencies among nodes to improve predictions. While there are some recently developed methods for this problem, little is known about the theoretical foundations of these methods or of the underlying estimation problem. Another fundamental problem is to estimate probability distributions over the*graph structures*themselves. Accurate estimation can improve understanding of the underlying network generation process and is a necessary precursor for anomaly detection in network activity graphs (e.g., intrusion and fraud detection). Current methods result in estimated models that fail to capture the natural variability of real world social network domains. These and other foundational problems are pursued.*Environmental Modeling and Statistical Emulation:*Many environmental and climatological processes are studied with the aid of deterministic computer models. The computer model encapsulates knowledge about the evolution of the process over*space and time*, typically through the numerical solution of a system of differential equations. Although such models are typically deterministic, many quantities are not known with certainty, including the value of the output at new input values, and the relationship of the model to true system quantities.