\subsection{ Abstract of Optimization Presentation} \begin{slide} \begin{itemize} \item We review problems including TSP, linear programming, and data analysis $(\chi^2)$. \item We discuss relevance of central limit theorem, least squares, maximum likelihood, histograms, and scatterplots. \item We use simple length measurement, as well as full scale physics experiment to discuss $\chi^2$, maximum likelihood, and their contrast. \item Computer implementations are discussed for Nonlinear Minimization, linear programming, and histogramming. \item The foils end with notes for a set of CPS713 projects in this area. \end{itemize} \subsection{ Optimization Examples---I} \begin{slide} \begin{itemize} \item Given $f(\underline{x})$, find minimum value of $f$ as a function of $\underline{x}$---for example, \begin{description} \item{$\bullet$} $f(\underline{x}) = x_1^2 + x_2^2$ \hspace{1em} (continuous variables) minimum \ $x_1 = x_2 = 0$ \end{description} \end{itemize} \vspace{10pt} \begin{center} {\bf Travelling Salesman Problem} \end{center} \begin{itemize} \item Let \begin{eqnarray*} i=1 \ldots N & \hspace{1em} & \mbox{label cities} \\ \eta_{ip} = 1 & \hspace{1em} & \mbox{if traveller is at city $i$ at time step} \\ & & p = 1 \ldots N \\ d_{ij} = & \hspace{1em} & \mbox{distance from city $i$ to city $j$}. \end{eqnarray*} \item Minimize \[ D = \sum_{i=1}^N\ \sum_{j=1}^N\ \sum_{p=1}^{N-1} d_{ij}\ \eta_{ip}\ \eta_{jp+1} \] \end{itemize} This is a (NP Complete) Discrete Variables example. The concept of NP completeness, discussed in detail by computer science, appears of little practical significance. \subsection{ Examples -- II} \begin{slide} \begin{itemize} \item Given a collection of molecules, find its ground state (minimize its energy as a function of atomic positions) and see if it will make a nice drug. \item Given a set of pixels, forming an image (as seen by your eyes), optimize your life based on information---pattern recognition/(artificial) intelligence (e.g., if image contains a lion, run away; while if image contains football thrown by Jim~Kelly, catch it and run to goal line). \item Given a set of ELINT (electronic intelligence $=$ signals as a function of position), decide on best steps to win the battle---spatial reasoning or situation assessment. \item Minimize ${\displaystyle \int} \left( \nabla \phi \right)^2 d^3 x$ which is variational principle that derives Finite Element method \end{itemize} \subsection{ Examples -- III} \begin{slide} \begin{itemize} \item Given position on chess board, decide on best move. Write down a tree \vspace{12pt} \centerline{\psfig{figure=opt.fig1.ps,clip=true,width=3.5in}} \vspace{10pt} \noindent and expand several ``plies'' (two plies---one by a white, and one by block make a move). Assign values to final position. Chosen move is one that leads to highest (best) value of worst possible scenario---minimax problem. \end{itemize} \subsection{ Examples IV -- Linear Programming} \begin{slide} \noindent{\bf Transportation} \begin{itemize} \item Take $M$ warehouses and $N$ retail outlets. Consider one product. We wish to minimize cost of shipping products from warehouses to outlets. \begin{description} \item{$\bullet$} $a_i$ units at warehouse $i$ \item{$\bullet$} $b_j$ units needed at outlet $j$ \item{$\bullet$} $x_{ij}$ units shipped from $i$ to $j$ at cost $c_{ij}$ each \[ \mbox{constraints} \left\{ \begin{array}{ll} {\displaystyle \sum_{j=1}^N}\, x_{ij} & \le a_i \\ & \\ {\displaystyle \sum_{i=1}^M}\, x_{ij} & = b_j \end{array} \right. \] \end{description} \item Minimize ${\displaystyle \sum_{i=1}^N}\ {\displaystyle \sum_{j=1}^M}\ c_{ij}\, x_{ij}$ objective function. \item This is Linear Programming---objective function and constraints are linear. \end{itemize} \subsection{ Optimization Examples -- V} \begin{slide} \noindent{\bf Interpolation} \begin{itemize} \item Given a set of values $y=y_i$ at $x=x_i$, find the best estimate of $y=\tilde{y}$ at $x\ge \tilde{x}$. \vspace{10pt} \centerline{\psfig{figure=opt.fig2.ps,clip=true,width=4.5in}} \vspace{10pt} \item Typically do by assuming $y(x)$ is a polynomial---maybe of degree~3 in this example. \item Note that interpolation, like many optimization problems, is mathematically undefined. One can find polynomials of degree~4 taking any given values of $x_1$, $x_2$, $x_3$, $x_4$, and $\tilde{x}$. We can only determine value at $\tilde{x}$ by $\underline{\rm assuming}$ polynomial of degree~3 (or other limiting assumption). \end{itemize} \subsection{ A Collection of Methods for Optimization} \begin{slide} \begin{tabular}{l} Maximum Likelihood \\ $\chi^2$ \\ Maximum Entropy \\ Combinational Optimization \\ $\alpha$-$\beta$ Pruning \\ Branch and Bound \\ Branch and Cut \\ Linear Programming \\ Nonlinear Programming \\ Integer Programming \\ Heuristic \\ Greedy Algorithm \\ Conjugate Gradient \\ Simulated Annealing \\ Simulated Tempering \\ Neural Networks \\ Deterministic Annealing \\ Genetic Algorithms \end{tabular} We will start with Maximum Likelihood and $\chi^2$ illustrated by data analysis problem \subsection{ Optimization Example -- V: Analysis of Experimental Data} \begin{slide} \begin{itemize} \item Interpolation involves finding function values given function values at other points. This naturally extends to case where given function values have errors. \item We consider the general case where we have a set of observations $x_1$, $x_2 \ldots x_n$ and we wish to deduce something from it. \begin{description} \item{$\bullet$ a)} $x_i$ are separate measurements of the length of a rod which differ due to uncertainities in measurement process (measurer drunk, rod quivering in earthquake). $\underline{\rm Goal}$:\ What is best estimate for length of rod and error therein. \end{description} \end{itemize} \subsection{ Analysis of Experimental Data (cont.)} \begin{slide} \begin{itemize} \begin{description} \item{$\bullet$ b)} $x_i$ are measurements of particle positions in a detector at the proposed SSC---Superconducting Supercollidor $\underline{\rm Goal}$:\ Deduce which is the correct theory of the fundamental interactions. What particles exist? \item{$\bullet$ c)} $x_i$ could be number of people claiming they will vote for Clinton and $i$ could label either geographic position or time sample taken. $\underline{\rm Goal}$:\ Deduce who will win Presidential election and what strategy changes are appropriate to affect result. \end{description} \item In each case, the $x_i\,$'s are related to ``goal'' (which can be qualitative or quantitative) by probability distributions. \end{itemize} \subsection{ Data Analysis Examples---I} \begin{slide} \begin{itemize} \item Consider three examples defined already. Their characteristics are: \begin{description} \item{$\bullet$ (a)} Length of Rod This is simplest case. Here, the length is well defined and uncertainties come from variations in measurement process itself. \item{$\bullet$ (b)} High Energy Physics The events recorded are fundamentally probabilistic because quantum theory describes events this way, $\underline{\rm not}$ deterministically. There are no ``hidden variables,'' such as $\underline{\alpha}$ in example (c). \end{description} \end{itemize} \subsection{ Data Analysis Examples---II} \begin{slide} \noindent $\bullet$ (c)\ Voting \begin{itemize} \item We can associate with the voting process, a function $V$ \begin{eqnarray*} V & = & 0 \qquad \mbox{doesn't vote} \\ & = & 1 \qquad \mbox{votes for Clinton} \\ & = & 2 \qquad \mbox{votes for Bush} \\ & = & 3 \qquad \mbox{votes for Perot, etc.} \end{eqnarray*} \item Each person is described by a set of attributes $\underline{\alpha}$ $\alpha_1 = $ age,\ \ $\alpha_2 = $ income,\ \ $\alpha_3 = $ place they live, etc. with a big enough size vector $\underline{\alpha}$ \[ V \equiv V\left( \underline{\alpha} \right) \] \item Each person $\rightarrow \underline{\alpha}_i$ a particular vector. \item If we were mathematicians, we could find $V\left( \underline{\alpha} \right)$ from survey (determine parameters in it) and so predict election result. \item But society is a ``complex system.'' Impossible to write down some function $V$ even though actions might be described by some sort of function. \begin{center} $\Rightarrow$ use probabilistic approach. \end{center} \end{itemize} \subsection{ The Computational and Numerical Issues} \begin{slide} \begin{description} \item{$\bullet$ 1.} The underlying mathematics is that of random variables. The key basic theorem is the \begin{itemize} \item CENTRAL LIMIT THEOREM---this is at the level of CPS615. \end{itemize} \item{$\bullet$ 2.} There is a set of methods where we believe that we KNOW something precise about associated probability distributions. \begin{itemize} \item LEAST SQUARES METHOD \item MAXIMUM LIKELIHOOD METHOD \end{itemize} \item{$\bullet$ 3.} There are a set of techniques for data exploration where we search a large dataset for expected or unexpected correlations and clusterings which are qualitative and usually not describable by mathematical functions. \begin{itemize} \item HISTOGRAMS \item SCATTER PLOTS \end{itemize} \end{description} \subsection{ SSC Scenario---I} \begin{slide} \begin{itemize} \item {\bf Data Analysis at the ``SSC''} 1.\ A typical problem which is rich enough to exhibit all these issues. \begin{eqnarray*} p\ \ \overline{p}\ \ & \hspace{1em} & \rightarrow N\ \mbox{particles} \\ \ \ \rightarrow\ \ \leftarrow & \hspace{1em} & N = 2 \rightarrow 1000? \end{eqnarray*} \begin{center} \begin{tabular}{lll} $\underline{\rm Particles\ labelled\ by}$ && \\ type, && $\pi$, $k$, $p$, $\overline{p} \ldots$ \\ momentum, && $\underline{p}$, $E$ \\ and energy, mass$^2$ &= & $E^2 - \underline{p}^2$ \\ \phantom{and energy}\ \ \ \ $\uparrow$ \\ \ \ essentially determined && \\ \ \ by type && \\ \end{tabular} \end{center} \item A typical experiment might record 10 million such events on some storage medium (e.g., $\sim 3$ events per second for a year). \item The events might be ``unbiased'', i.e., all possible events recorded or they may be selected by a ``trigger.'' \end{itemize} \subsection{ SSC Scenario II} \begin{slide} \begin{itemize} \item The events are specified by a set of parameters. \item $N = $ number of particles, and for each particle, parameters defining them. This is a vector $\underline{y}$. \item Each event described by $\underline{y}_i$ produced with probability $p\left( \underline{y}_i \right)$. \item Any realistic detector is imperfect. ``Image Processing'' is used to find tracks of charged particles. Magnets used to bend charged particles where size of bend $\Rightarrow$ sign of charge, magnitude of momentum. \end{itemize} \centerline{\psfig{figure=opt.fig3.ps,clip=true,width=4.0in}} \subsection{ Uncertainties in SSC Data Analysis} \begin{slide} \begin{itemize} \item Flaws in Image Processing cf.\ SDI problem of tracking a bunch of missiles ``Mathematically ill-posed problem'' use neural networks, etc. \item Geometric problems $\Rightarrow$ In some regions of momentum $\underline{p}$ particles are ``lost'' with some probability. \item Neutral particles only make tracks in dense material $\Rightarrow$ harder to measure and ``destructive'' \begin{description} \item{$\bullet$} So first measure charged particles \item{$\bullet$} then measure photons $\left( \pi^0 \right)$ \item{$\bullet$} then measure $\mu^{\pm}$ \item{$\bullet$} cannot measure neutrinos \end{description} \end{itemize} \subsection{ SSC Data Analysis Problem Statement} \begin{slide} \begin{itemize} \item What you see $=$ probability from Science convoluted (integrated) with probability from measurement probability from trigger. \item Probability from Science contains functional dependences which are understood together with those that are not. $\Rightarrow$ this is also a ``complex system''. \item The observed features $=$ \begin{tabular}{lcl} $P\left( \underline{y}_i \right) =$ & ${\displaystyle \int}$ & known dependencies \\ \\ &&convoluted with unknown \\ &&dependencies \\ \\ &&together with totally \\ &&unknown probabilities. \end{tabular} \item Can MODEL dependencies and probabilities (use suitably inspired $=$ theoretically motivated ``guesswork'') $\Rightarrow$ PARAMETERIZE ignorance or use approaches which do NOT involve parameterization. \end{itemize} \subsection{ Maximum Likelihood Principle---Example} \begin{slide} \noindent{\bf Rod Quivering in an Earthquake---I} \begin{itemize} \item $x_1$, $x_2 \ldots x_N$ are $N$ measurements of the length of a rod. \item Owing to a Gaussian earthquake (or a Gaussian i.e., normal drunkenness) occurring during measurement, the $x_i$ are scattered about the true value: \item You visit your favorite theorist (or do yourself after sobering up) who $\underline{\rm believes}$ that \item the probability distribution of $x_i$ is Gaussian with mean $\alpha_0 =$ true length and standard deviation $\sigma_0$. \item We may write down the (relative) probability of each measurement $x_i$ as: \[ P_i \left( x_i \right) = c\, \exp \left\lbrace - \frac{1}{2\sigma_0^2} \left( x_i - \alpha_0 \right)^2 \right\rbrace \] \end{itemize} \subsection{ Rod Quivering in an Earthquake--II} \begin{slide} \begin{itemize} \item The Likelihood function for whole experiment of $N$ events is defined as: \begin{eqnarray*} L \left( x_1 \ldots x_N \right) & = & \prod_{i=1}^N c\, \exp \left\lbrace - \frac{1}{2\sigma_0^2} \left( x_i - \alpha_0 \right)^2 \right\rbrace \\ & = & \tilde{c}\, \exp \left\lbrace - \frac{1}{2\sigma_0^2} \sum_i \left( x_i - \alpha_0 \right)^2 \right\rbrace \end{eqnarray*} \item This is---up to a constant---the probability of the $x_i$ $\underline{\rm given}$ the true length $\alpha_0$ and certain theoretical assumptions we represent as $T$. \[ L \left( x_1 \ldots x_N \right) = {\rm Pr} \left\lbrace x_1 \ldots x_n | \alpha_0\,, T \right\rbrace \] in conditional probability notation. \item $T$ is, for instance, the notion that distribution Gaussian and the value of $\sigma_0$ taken (the latter being calculable from the strength of earthquake $\underline{\rm or}$ amount of beer drunk). \end{itemize} \subsection{ Bayes Formula} \begin{slide} \begin{itemize} \item Use of Bayes Formula Now what we really want is the probability distribution of an estimate $\alpha$ of $\alpha_0$. We can formally get this by use of Bayes law: \item First extend $P\left(x_i \right) = \exp \left\lbrace - {1\over 2\sigma_0^2} \left( x_i - \alpha \right)^2 \right\rbrace$ to be defined for any value of $\alpha$: not necessarily $\alpha =$ true value $\alpha_0$. Introduce: \item $P\left( \alpha | x_i\,, T\right) = $ Probability true length is $\alpha$ $\underline{\rm given}$ measurements $x_i$ and predjudice $T$. Then Bayes law of conditional probability is $$ P \left( \alpha | x_i\,, T \right) = P \left( x_i | \alpha\,, T \right) {P(\alpha | T)\over P\left( x_i | T \right)} \eqno{(*)} $$ \item This is the essence of the maximum likelihood principle. $(*)$ converts the likelihood $L\left( x_1 \ldots x_n \right) = P\left( x_i | \alpha\,, T\right)$ which is a function probability of $x_i$ for fixed $\alpha\,, T$ into $P\left( \alpha | x_i\,, T\right)$ which is the desired probability distribution of $\alpha$ for given measurements $x_i$ and predjudice $T$. \end{itemize} \subsection{ Interpretation of Bayes Relation -- I} \begin{slide} \begin{itemize} \item Interpretation of equation $(*)$ on previous page \item $(\alpha)$\ We can ignore $P\left( x_i | T\right)$ because it is independent of $\alpha$ and fixed for our given $x_i$ and $T$. It has the practical point that Likelihood is never normalized sensibly. \item $(\beta)$\ $P(\alpha|T)$ philosophically represents a priori knowledge of $\alpha$ before experiment (measurement of $x_i$) performed. \item This is shaky---not because we don't really know how to define what $|T$ (``given $T$'') means---but, for instance, suppose we wish to parameterize ``complete ignorance'' of $\alpha$ before experiment performed. \end{itemize} \subsection{ Interpretation of Bayes Relation -- II} \begin{slide} \begin{itemize} \item Natural is $P(\alpha | T) =$ constant \item But suppose we put $\beta = \alpha^2$ and then $\beta$ has a priori probability $\varphi (\beta | T)$ \[ \varphi (\beta | T) d\beta = P(\alpha | T) d\alpha \] i.e., $\varphi (\beta | T ) = {1\over 2\sqrt{\beta}} \not=$ constant. \item So the concept of ``complete ignorance'' is not well defined. Jeffrey (a big bad English-Oxford-book) spends 100's of pages trying to avoid this difficulty. His efforts are so much garbage. \item Fortunately, there is no trouble for large $n$ because of both an obvious numerical reason (see next section (i) (d)) and a not so obvious mathematical reason. \end{itemize} \subsection{ Basic Likelihood Formula} \begin{slide} \begin{itemize} \item The Likelihood Maximizes Consider, again, the explicit form of the Likelihood of a rod quivering in the earthquake. \[ L \left( x_1 \ldots x_N \right) = c\, \exp \left\lbrace - \frac{1}{2\sigma^2} \sum_{i=1}^N \left( x_i - \alpha \right)^2 \right\rbrace \] \item According to Bayes Principle (*), we now regard $L$ as a function of $\alpha$ and so rewrite it: \begin{eqnarray*} L (\alpha ) & = & c\, \exp \left\lbrace - \frac{N}{2\sigma^2} \left( \alpha - \frac{1}{N} \sum_{i=1}^n x_i \right)^2 \right. \\ & - & \frac{1}{2\sigma^2} \left\lbrack \left. \sum_i\, x_i^2 - \frac{1}{N} \left( \sum_i\, x_i \right)^2 \right\rbrack \right\rbrace \end{eqnarray*} \end{itemize} \subsection{ Pictorial View of Maximum Likelihood} \begin{slide} \begin{itemize} \item The last term is a constant independent of $\alpha$, and so as a function of $\alpha$, $L$ is just a Gaussian which maximizes at $\alpha = \frac{1}{N} \sum\limits_{i=1}^N x_i$ and has width $\frac{\sigma}{\sqrt{N}}$. \vspace{8pt} \centerline{\psfig{figure=opt.fig4.ps,clip=true,width=4.25in}} \vspace{8pt} \item So as $L = {\rm Pr} \left( \alpha | x_i\,, T\right)$ we say that maximum likelihood estimate of $\alpha$ is $\alpha^* = \frac{1}{N} \sum\limits_{i=1}^N x_i$ and its standard deviation is $\sigma/\sqrt{N}$. \item Note that whatever $P(\alpha )$ you put in the shape of $LP(\alpha | T)$ is essentially independent of $P$ as $N\rightarrow\infty$ and we get a Gaussian peaked at $\alpha^*$ riding on top of constant value $P(\alpha^* | T)$. \end{itemize} \subsection{ General Maximum Likelihood Method} \begin{slide} \begin{itemize} \item The above is glib in many ways---especially in its treatment of errors. We will repair this later with real live theorems. First, we state the obvious generalization. \item The general maximum likelihood method is to put $P(\alpha | T)=1$ in Bayes Principle (*) and form $L(\alpha) = \prod_{i=1}^N P\left( x_i\,, \alpha \right)$ for any measurement $x_i$ whose probability $P\left( x_i\,, \alpha\right)$ depends on value $x_i$ and theoretical parameter(s) $\alpha$. \item The maximum likelihood estimate of $\alpha$ is $\alpha=\alpha^*$ where $L\left( \alpha^* \right)$ is a maximum. The error $\Delta\alpha$ is the standard deviation of the distribution of $L$ as a function of $\alpha$, i.e., \[ L(\alpha ) = \exp \left\lbrace {\rm const.} - \frac{1}{2\Delta\alpha^2} \left( \alpha - \alpha^* \right)^2 \right\rbrace \] \item Note one usually manipulates \[ - \ln L = - \sum_{i=1}^N \ln P \left( \underline{x}_i,\, \underline{\alpha} \right) \] \end{itemize} \subsection{ Lifetime Example for Maximum Likelihood---I} \begin{slide} \begin{itemize} \item Physics texts give the example of determining the lifetime $\tau_0$ of a particle from the observation of decay times $t_1 \ldots t_N$ of $N$ decays. \item The probability distribution is Poisson: \[ p(t) dt = \frac{1}{\tau} \exp \left( - \frac{t}{\tau} \right) dt \] \item Now $dt$ is meaningless---it is independent of $\tau$ and only affects normalization of likelihood. The unknown $P( \alpha | T )$, $P\left( x_i | T \right)$ terms in (B1) made this arbitrary anyway! \item Likelihood $L(\tau ) \propto \frac{1}{\tau} e^{-\frac{t_1}{\tau}} \ldots \frac{1}{\tau} e^{-\frac{t_N}{\tau}}$ or $L(\tau ) = \exp \left\lbrack -\frac{1}{\tau} \sum\limits_{i=1}^N t_i - N \ln \tau \right\rbrack$ where we lazily convert proportionality into an $=$ sign. Maximum is \begin{eqnarray*} {dL\over d\tau} & = & 0 \qquad \mbox{or} \\ \frac{1}{\tau^{*2}} \sum_{i=1}^N t_i - \frac{N}{\tau^*} & = & 0 \\ \underline{\rm or} \qquad \tau^* & = & \frac{1}{N} \sum_{i=1}^N t_i \end{eqnarray*} \end{itemize} \subsection{ Lifetime Example for Maximum Likelihood -- II} \begin{slide} \begin{itemize} \item i.e., the maximum likelihood estimate is just the mean: such a simple result is only true because of particular Poisson example. \item In this special case, we can illustrate that maximum likelihood method does give results which ``converge'' as $N\rightarrow\infty$. \item According to the central limit theorem, the variable $y = \frac{1}{N} \sum\limits_{i=1}^N t_i$ is a sum over independent random variables and has mean $\langle y \rangle = \tau_0$ and standard deviation $\sigma$ \begin{eqnarray*} \sigma^2 & = & \frac{1}{N} \left\lbrace \left\langle t_i^2 \right\rangle - \left\langle t_i \right\rangle^2 \right\rbrace \\ & = & \frac{1}{N} \left\lbrace \frac{1}{\tau_0} \int t^2 e^{- \frac{t}{\tau_0}} dt - \tau_0^2 \right\rbrace \\ \mbox{or}\ \ \sigma & = & \frac{\tau_0}{\sqrt{N}} \end{eqnarray*} \end{itemize} \subsection{ Multiplication of Experiments} \begin{slide} \begin{itemize} \item If we have two experiments that are independent but depend on same theoretical parameter $\alpha$, then \begin{eqnarray*} \bullet\ \ L_1 = & \prod^{N_1}_{i=1} p\left( x_i\,, \alpha \right) & \quad \mbox{is likelihood of} \\ && \quad \mbox{experiment \#1} \\ && \\ \bullet\ \ L_2 = & \prod^{N_2}_{i=1} q\left( y_i\,, \alpha \right) & \quad \mbox{is likelihood of} \\ && \quad \mbox{experiment \#2} \end{eqnarray*} \item and clearly the combined likelihood is \[ L = L_1\ L_2 \] \item i.e., one should multiply likelihoods when combining experiments. We will apply this in the next section ($\chi^2$) to case when each $L_i$ $\underline{\rm is}$ Gaussian in $\alpha$. \end{itemize} \subsection{ General Maximum Likelihood} \begin{slide} \begin{itemize} \item Note that one can prove that Maximum Likelihood is asymptotically the ``best algorithm,'' i.e., in limit $N\rightarrow\infty$ ($N$ total events, there is NO method which will give a smaller standard deviation). \item Also, Maximum Likelihood method will give one the ``right answer,'' i.e., Bias $=$ Correct Estimate---Maximum Likelihood Estimate $\rightarrow 0$ (like $1/N$) as $N\rightarrow\infty$. \end{itemize} \subsection{ $\chi^2$ Method} \begin{slide} \begin{itemize} \item This is standard and fully, correct (best) method for combining results from several independent experiments. It is in fact used (because it's easy conceptually and for computer) even when it's inapplicable (e.g., if there are correlations between supposed ``independent'' random variables or if the basic experiments have too few events to be Gaussian). \item The problem is:\ Given $N$ experimental measurements $x_i$ and errors $\sigma_i$ plus estimates $\xi_i$ of the $x_i$ which are functions of $N$ theoretical parameters. Then the probability of each observation is \[ P_i = \frac{1}{\sqrt{2\pi} \sigma_i} \exp \left\lbrace - \frac{\left( x_i - \xi_i \right)^2}{2\sigma_i^2} \right\rbrace \] \end{itemize} \subsection{ $\chi^2$ Method Derived from Likelihood} \begin{slide} \begin{itemize} \item The Likelihood \[ L = \prod_{i=1}^N P_i \propto \exp \left\lbrace - \sum_{i=1}^N \frac{\left( x_i - \xi_i \right)^2}{2\sigma_i^2} \right\rbrace \] \item so, the principle of maximizing the likelihood is equivalent to minimizing $\chi^2$ where \[ \chi^2 = \sum_{i=1}^N \frac{\left( x_i - \xi_i \right)^2}{\sigma_i^2} \] with respect to $N$ theoretical parameters in $\xi_i$. \item (i)\ Notice that $\chi^2$ is the sum $\sum\limits_{i=1}^N v_i^2$ where $v_i$ are independent Gaussian random variables of unit standard deviation and zero mean. \item We $\underline{\rm can}$ calculate $\chi^2$ distribution exactly, but in practice this is rarely necessary. Thus, we can easily show that---for fixed $\xi_i$---that \begin{eqnarray*} \left\langle \chi^2 \right\rangle & = \phantom{2}N & \\ \sigma_{\chi^2}^2\ \ \ & = 2N & (**) \end{eqnarray*} \item and this is sufficient for large $N$ as central limit theorem says $\chi^2$ will be Gaussian with above mean and standard deviation. \end{itemize} \subsection{ $\chi^2$ Standard Deviation} \begin{slide} \begin{description} \item{(ii)} Now $(**)$ is not quite correct because if we determine the theoretical parameters to minimize $\chi^2$ then implicitly $\xi_i$ are also random variables that are functions of $x_j$. \end{description} \begin{itemize} \item It is not too hard to show that in this case, one should just replace $N$ by $N-n$ (number of degrees of freedom) in $(**)$. \item In any case, $(**)$ is very important because it allows an easy criterion for the goodness of fit. \item Thus, if value of $\chi^2$ is many (say $>2$) standard deviations away from its mean, then the theoretical form is in doubt. \item The usual maximum likelihood method does not have this advantage: \item Remember, $L$ was unnormalized and so $\underline{\rm its}$ value had no significance for the goodness of fit. This is quite a difficult problem. \end{itemize} \subsection{ Scattering Experiment Example} \begin{slide} \noindent{\bf Examples:} \begin{description} \item{1)} A physics experiment measures a scattering cross-section $\sigma$ as a function of angle $\theta$. \vspace{10pt} \centerline{\psfig{figure=opt.fig5.ps,clip=true,width=3.5in}} In the case of decay (of nucleus), we know rigorously that angular distribution $\sigma (\theta )$ is a polynomial in $\cos\theta$ with highest degree $\le 2J$ (spin of nucleus) \end{description} \[ \sigma (\theta ) = a_0 + a_1 \cos \theta + a_2 \cos^2 \theta + a_3 \cos^3 \theta \] \subsection{ Individual Data Points} \centerline{\psfig{figure=opt.fig6.ps,clip=true,width=5.0in}} \begin{itemize} \item There are two typical forms of data---firstly, isolated points. \item The measurements $x_i$'s are cross sections and the theoretical predictions $\xi_i$ are linear in unknown parameters $a_j$. \[ \xi_i = \sum_{m=1}^n c_{im}\, a_m \] \end{itemize} \subsection{ Histogrammed Data} \centerline{\psfig{figure=opt.fig6b.ps,clip=true,width=4.0in}} \begin{itemize} \item Now some experiments measure as suggested in previous picture at a set of discrete angles. \item More common is to measure a more or less complete range of $\theta$ but bin these, so that as in picture, basic measurement is number of events $N_i$ in i'th ``bin.'' \item Here, we have nine bins in $0\le \theta \le 180$. \begin{eqnarray*} x_i & = & N_i \ \ \mbox{(up to constant)} \\ \mbox{error} & \mbox{in} & x_i \ \ \mbox{is}\ \ \sqrt{N_i} \end{eqnarray*} \end{itemize} \subsection{ Scattering Experiment---$\chi^2$ Minimization} \begin{slide} \begin{itemize} \item Under these circumstances, the $\chi^2$ method becomes: \begin{eqnarray*} \mbox{minimize:} & & \sum_{i=1}^N \frac{\left( x_i - \xi_i \right)^2}{\sigma_i^2} \qquad - \left( \chi^2 \right) \\ x_i & = & \ \mbox{const.}\ N_i \\ \sigma_i & = &\ \mbox{const.}\ \sqrt{N_i} \qquad \sigma_i = \frac{x_i}{\sqrt{N_i}} \end{eqnarray*} where $i$th bin has $N_i$ events. Note estimated error comes from measurement itself. \item Minimizing $(\chi^2)$ with respect to parameters $a_m$ gives \begin{eqnarray*} \sum_{i=1}^N \frac{\left( x_i - \xi_i \right)}{\sigma_i^2}\, \frac{\partial\xi_i}{\partial a_m} & = & 0\ \ \ \ \ \hbox{or} \\ \sum_l A_{ml}\ a^*_l & = & b_m \end{eqnarray*} where $^*$ denotes $\chi^2$ (or maximum likelihood) estimate. \end{itemize} \subsection{ $\chi^2$ Formulae} \begin{slide} \begin{itemize} \item The $\chi^2$ matrix \[ a_{ml} = \sum_i \frac{C_{im}\, C_{il}}{\sigma_i^2} \\ \] \item which is a Positive Symmetric Matrix \[ = \sum_i \frac{\left( \partial \xi_i / \partial a_m \right)\, \left( \partial \xi_i / \partial a_l \right)}{\sigma_i^2} \] \item The right hand side \[ b_m = \sum_i \frac{x_i C_{im}}{\sigma_i^2} = \sum_i \frac{x_i}{\sigma_i^2}\, \frac{\partial \xi_i}{\partial a_m} \] \item Therefore, $\underline{a}^* = A^{-1} \underline{b}$ is our best estimate for the parameters $\underline{a}$. \end{itemize} \subsection{ Errors in $\chi^2$} \begin{slide} \begin{itemize} \item We can find errors in $\underline{a}$ in either of two ways: \begin{description} \item{$\bullet$\ 1.} regard $\underline{a}^*$ as a function of $x_i$ and calculate standard deviation from central limit theorem. \item{$\bullet$\ 2.} shape of likelihood \[ L = \exp \left( - \frac{1}{2} \chi^2 \right) \] \end{description} \item We have $-$ as far as shape of $L$ is concerned. $L = \exp$ (Independent of $a$ plus no terms linear in $a-a^*$ due to ``maximum'' condition $-1/2 \left(\underline{a} - \underline{a}^* \right)^T A \left( \underline{a} - \underline{a}^* \right)$ ). \item i.e., this above formula implies that $A$ is error matrix for $a$. \end{itemize} \subsection{ Example II -- Regression Analysis} \begin{slide} \begin{itemize} \item In typical regression analysis the $\xi_i$ are also linear in the unknowns, e.g., \vspace{8pt} \centerline{\psfig{figure=opt.fig7.ps,clip=true,width=4.25in}} \vspace{8pt} \[ \mbox{Sales}(t) = \sum_{i=1}^m x_i P_i (t) \] \item where $x_i$ is unknown and $P_i$ is known expansion function. \vspace{8pt} \begin{tabular}{ll} e.g., & $P_1 (t) = 1$ \\ & $P_2 (t) = (t - 1992)$ \\ & $P_3 (t) = (t - 1992)^2$, etc. \\ \end{tabular} \vspace{8pt} \item $P_k (t)$ isn't linear in $t$, but Sales$(t)$ is linear in $P_k (t)$ as this is what counts! \end{itemize} \subsection{ Example III -- Nuclear Decay} \begin{slide} \begin{itemize} \item However, we consider nuclear decay \vspace{8pt} \centerline{\psfig{figure=opt.fig8.ps,clip=true,width=4.25in}} \vspace{8pt} \item $e_i = $ number of decays observed in time interval $t_0 + (i-1) \delta t$ to $t_0 + i\, \delta t$. \[ \xi_i = c\, \exp \left( - \lambda [ t_o + (i - 1/2 ) ] \delta t \right) \] is a function of two unknowns, $c$ and $\lambda$. \item $\xi$ is linear in $c$ but nonlinear in $\lambda$. \item Note we can analyze nuclear decays in two ways---maximum likelihood or $\chi^2$. Let us compare these. \end{itemize} \subsection{ Maximum Likelihood and $\chi^2$ for Experimental Data Analysis} \begin{description} \item{$\bullet$ a)} {\bf Maximum Likelihood} $i$ labels events \ \ \ \ \ $i=1 \ldots n_{ev}$ \begin{itemize} \item Minimize---$\ln L$ \[ L = \prod^{n_{ev}}_{i=1} \frac{1}{\tau} c \exp \left( - \lambda t_i \right) \] \end{itemize} \item{$\bullet$ b)} $\chi^2$ $i$ labels bins \ \ \ \ \ $\left( i=1 \ldots N_{\rm bin}\right)$ in time \begin{itemize} \item Minimize $\chi^2$ \[ \chi^2 = \sum^{N_{\rm bin}}_{i=1} {\left( N_i - \xi_i \right)^2 \over \sigma_i^2} \] \item where $\sigma_i \sim \sqrt{N_i}$ and $N_i$ is number of events in $i\,$th bin $\left( t_0 + (i-1) \delta t \le t \le t_0 \right)$. \item If $N_i$ large, $N_i$ will be Gaussian, whatever distribution of individual events. \end{itemize} \end{description} \subsection{ When Should You Use $\chi^2$ and When Maximum Likelihood?} \begin{slide} \begin{itemize} \item Which is best? Likelihood method is ``best,'' but $\chi^2$ method is computationally easier as $N_{\rm bin} << N_{\rm ev}$. \item Typically, as number of events in each bin must be large \[ N_{\rm bin} \sim \frac{N_{\rm ev}}{50} \,. \] \item $\chi^2$ method will give an estimate with a larger standard deviation, but the difference will not be large if $\lambda \delta t$ is small, so probabilities $\exp (-\lambda t)$ do not vary across bin. So, can choose correctly \begin{description} \item{$\bullet$} Small total number of events---no good method. \item{$\bullet$} Large number of events--Maximum Likelihood works and is best. \item{$\bullet$} Very large number of events---Maximum Likelihood and $\chi^2$ works, but $\chi^2$ will give a good answer with less computational complexity. \end{description} \end{itemize} \subsection{ Least Squares} \begin{slide} \begin{itemize} \item Generally, we analyze \begin{eqnarray*} \chi^2 & = & \sum_{i=1}^N \left( \frac{e_i - \xi_i}{\sigma_i} \right)^2 \\ & = & \sum_{i=1}^N \hbox{res}_{\,i}^{\,2} \ \ \ \ \ \ \leftarrow\ \mbox{scaled residual} \end{eqnarray*} which shows why $\chi^2$ is also called least squares fit. As in above examples, expand $\xi_i$ about $\underline{x}_0$ \begin{eqnarray*} \xi \left( \underline{x} \right) = \xi \left( \underline{x}_0 \right) & + & \left( \underline{x} - \underline{x}_0 \right)^T \frac{\partial\xi}{\partial\underline{x}} \\ & + & \mbox{Higher order in} \left( \underline{x} - \underline{x}_0 \right) \end{eqnarray*} \item In this approximation, we now get ``standard linear formulae.'' \begin{eqnarray*} \chi^2 \left( \underline{x} \right) = \chi^2 \left( \underline{x}_0 \right) & + & \frac{1}{2} \left( \underline{x} - \underline{x}_0 \right)^T A \left( \underline{x} - \underline{x}_0 \right) \\ & - & \left( \underline{x} - \underline{x}_0 \right)^T \underline{b} + \mbox{corrections} \end{eqnarray*} \item Minimize $\chi^2$ ignoring corrections gives \[ \underline{\underline{A}} \left( \underline{x} - \underline{x}_0 \right) = \underline{b} \] which is same equation we found for iterative solution of P.D.E. {\bf BUT} only exact when $\xi_i$ LINEAR in unknown parameters $\underline{x}$. \end{itemize} \subsection{ Nonlinear $\chi^2$ Minimization} \begin{slide} \begin{itemize} \item $\underline{\underline{A}}$ is symmetric positive definitive \begin{eqnarray*} \xi_i & = & \xi_i \left( \underline{x}_0 \right) + \sum_{d=1}^m t_{id} \left( x - x_0 \right)_d \\ \mbox{where}\ \ \ t_{id} & = & \frac{\partial\xi_i}{\partial x_d} \\ A_{dd'} & = & \sum_i t_{id}\, t_{id'} \end{eqnarray*} \item Basic idea: \begin{description} \item{$\bullet$} Think of initial $\underline{x}_0$ and establish recursive method \[ \underline{\underline{A}} \left( \underline{x}_{k+1} - \underline{x}_k \right) = \underline{b} \] \item{$\bullet$} Calculate $\underline{\underline{A}}$ and $\underline{b}$ using $\underline{x} = \underline{x}_k$. \[ \underline{x}_{k+1} = \underline{x}_k + A^{-1} b \hspace{4em} \mbox{(NL.1)} \] \end{description} \item As $A$ positive definite, inverse always exists. \end{itemize} \subsection{ Validity of Linearization When Does It Work?} \begin{slide} \begin{itemize} \item Why doesn't this work? \vspace{8pt} \centerline{\psfig{figure=opt.fig9.ps,clip=true,width=4.5in}} \vspace{8pt} \item $\lambda_1 > \ldots > \lambda_n$. \item My experience at $n \sim 30$, $\lambda_n \sim 10^{-8} \lambda_1$. But we made approximation and so $\lambda_n$ term swamped by corrections neglected in expanding $\xi_i$ linearly. Equation (NL.2) gives big changes \[ \delta \widetilde{x}_m = \lambda_m^{-1} b_m \hspace{5em} \mbox{(NL.2)} \] Better is to use \[ \delta \widetilde{x}_m = 0 \hspace{7em} \mbox{(NL.3)} \] And, use of (NL.2) usually gives $\chi^2 \left( \underline{x}_{k+1} \right) >> \chi^2 \left( \underline{k}_n \right)$. Not $\chi^2 \left( x_{k+1} \right) < \chi^2 \left( \underline{x}_k \right)$ \end{itemize} \subsection{ Nonlinear Minimization That Works!} \begin{slide} \begin{itemize} \item Some solutions to this problem are \item 1)\ {\bf Conjugate Gradient}, but note matrix no longer sparse and so iterative approach not so natural. \item 2)\ {\bf Marquardt's method} changes $A$ in ad-hoc fashion, solving (NL.1) with the replaced value \[ \underline{\underline{A}} \rightarrow \underline{\underline{A}} + QI \] where $Q$ is Marquardt's Parameter \[ \lambda_i \rightarrow \lambda_i + Q \] \item Suppose, $Q \sim \sqrt{\lambda_1 \lambda_n}$. Then shifts $\delta \tilde{x}_d$ are unchanged for large $\lambda_d$, small for small $\lambda_d$. \item Increase $Q$ if $\chi^2$ increases when you use (NL.1). \item Decrease $Q$ if $\chi^2$ decreases when you use (NL.1). \end{itemize} \subsection{ Linear Programming} \begin{slide} \begin{itemize} \item Minimize $z = \sum_{j=1}^r c_j x_j$ where $x_j \ge 0$ and there are Linear Constraints \[ \sum_{j=1}^r a_{ij}\, x_j \qquad \le , \ge , \hbox{or} = b_i \ \ \ \ \ i=1\ldots m \] \item Introduce new variables to convert {\bf all} constraints to equalities. \item Suppose $i=1$ is $\ge$. Write $i=1$ constraint as \[ x_{r+1} = \sum_{j=1}^r a_{ij} x_j - b_1 \ \ \ \ \ \ \ge 0 \] \begin{eqnarray*} \sum_{j=1}^r a_{ij} x_j - x_{r+1} & = & b_1 \\ a_{1r+1} & = & -1 \\ c_{r+1} & = & 0 \end{eqnarray*} \item So linear programming becomes: \begin{description} \item{$\bullet$} Minimize $z = \underline{c} \underline{x}$ \item{$\bullet$} Subject to $\underline{\underline{A}} \underline{x} = \underline{b}$ as well as that all components of $\underline{x}$ are positive. \end{description} \end{itemize} \subsection{ Convex Regions and Linear Programming} \begin{slide} \centerline{\psfig{figure=opt.fig10.ps,clip=true,width=4.25in}} \vspace{8pt} \begin{itemize} \item Set of $\underline{x}$ satisfying constraints are convex region. \item Convex says that if $\underline{x}_1$ and $\underline{x}_2$ in a region, $\lambda \underline{x}_1 + (1-\lambda ) \underline{x}_2$ in region. \item Extreme point of convex region is one where $$ x_{\rm extreme} \ne \lambda \underline{x}_1 + (1-\lambda ) \underline{x}_2 $$ where $\lambda \ne 0, 1$. \item One can show that the optimal solution (if it exists) is an extreme point of the convex region. \end{itemize} \subsection{ Matrix Formulation of Linear Programming} \begin{slide} \begin{itemize} \item Consider $\underline{\underline{A}} \underline{x} = \underline{b}$. $\underline{\underline{A}}$ is $m\times r$ matrix. Usually $r>> m$. \item Can show that extreme points satisfy $m$\phantom{$n-$} nonzero elements in $\underline{x}$ $n-m$ zero elements in $\underline{x}$ \item Choose columns so non-zero elements are first $m$ entries in $\underline{x}$ \[ \underline{\underline{A}}^{\rm reduced} \underline{x} = \underline{b} \] where $\underline{\underline{A}}^{\rm reduced}$ is $m\times m$. \item Simplex method is a strategy for moving from one solution $\underline{x}^{\rm extreme}$ to another $\underline{x}^{\rm extreme}$ by changing one column of $\underline{\underline{A}}^{\rm reduced}$ to another that is not being used. \item There is another class of methods---associated with Karmarkar at AT and T which are so called interior point methods. That is, you do not go from vertex to vertex of simplex but rather wander around inside simplex. Both simplex and interior point methods are important. \end{itemize} \subsection{ Computational Issues in $\chi^2$ and Maximum Likelihood} \begin{slide} \begin{itemize} \item Take Maximum Likelihood for definiteness \begin{description} \item{$\bullet$\ 1.} Complex nonlinear (in dependence of parameters) function to maximize \item{$\bullet$\ 2.} Each function value \[ -\ln L = - \sum_{i=1}^N \ln p \left( \underline{x}_i , \underline{\alpha} \right) \] (where $\underline{x}_i$ determines events and $\underline{\alpha}$ are unknown parameters) is sum of $N$ independent calculations which is natural form of parallelism \item{$\bullet$\ 3.} Parallelism in optimization (in space of $\underline{\alpha}$) depends on particular minimization approach adopted. Also depends on complexity of calculation \[ \ln p \left( \underline{x}_i , \underline{\alpha} \right) \] \[ \frac{\partial}{\partial\underline{\alpha}} \ln p \left( \underline{x}_i , \underline{\alpha} \right) \] versus complexity of manipulation of $-Ln L \left( \underline{\alpha} \right)$ as a function of $\underline{\alpha}$ \item{$\bullet$\ 4.} $\chi^2$ and Likelihood are similar in aspects of analysis. \end{description} \end{itemize} \subsection{ Matrices in $\chi^2$ Maximum Likelihood} \begin{slide} \begin{itemize} \item The arithmetic in ``$\underline{\alpha}$ world'' is that of full matrices which are $n\times n$ where $n$ is number of parameters. \item In ``linear'' case ($\xi_i$ linear in components of $\underline{\alpha}$, $\chi^2$ quadratic in components), minimization requires single solution of equations \[ \underline{\underline{M}} \underline{\alpha} = \underline{b} \hspace{7em} (**) \] \item In ``nonlinear'' case ($\xi_i$ nonlinear in $\alpha$), one must iteratively solve $(**)$. \item Note that one expands $\xi_i$ up to linear term which gives a quadratic approximations to $\chi^2$. \item Thus, need to calculate \[ \mbox{``value,''}\ \ \frac{\partial}{\partial\alpha_k}\ \ \mbox{``value''} \ \ \ \ \ \mbox{NOT} \ \ \frac{\partial^2}{\partial\alpha_k\,\partial\alpha_j}\ \ \mbox{``value''} \] \item where ``value'' \begin{eqnarray*} & = & \ln p \left( \underline{x}_i \right) \mbox{for Likelihood method} \\ & = & \xi_i \left( \underline{x}_i \right) \mbox{for}\ \chi^2 \mbox{method} \\ \end{eqnarray*} \end{itemize} \subsection{ Basic Software Structure for Derivative Calculation} \begin{slide} Often, values are formed inside loops. Inside loop over events \vspace{8pt} \centerline{\psfig{figure=opt.fig10b.ps,clip=true,width=4.5in}} \vspace{8pt} This code can get rather tedious. {\tt ADIFOR\/} formally differentiates ``existing'' FORTRAN code and uses ``chain rule'' to generate derivative calculations. \subsection{ Computational Complexity} \begin{slide} \begin{itemize} \item Maximum Likelihood VERY VERY expensive as \[ \sum_{i=1}^{\rm \#\ events} \] \item Remember that the number of events could be $10^7$ at ``SSC''. $\chi^2$ is cheaper as \[ \sum_{i=1}^{\rm \#\ experiments\ or\ bins} \] \item Usually, number of experiments (bins) measured in $10\rightarrow 10^3$ range. \end{itemize} \subsection{ Histograms} \begin{slide} \begin{itemize} \item High energy physicists like to find new particles. \item For example, Nobel prize was given for the discover of $\psi$ particle which is formed from ``charm'' and ``anticharm'' quarks. \item Mass of $\varphi = 3.1$ GeV = $\left( p_1 + p_2 \right)^2$ where $\psi$ decays into particle~1 plus particle~2 where particle $i$ has momentum $\underline{p}_i$. \item We measure $p_1$ and $p_2$ as well as lots of other things and plot \vspace{8pt} \centerline{\psfig{figure=opt.fig11.ps,clip=true,width=4.5in}} \vspace{8pt} \item There are an accumulation of events with $\left( p_1 + p_2 \right)^2$ near 3.1~GeV. \end{itemize} \subsection{ Histogram Formalism I} \begin{slide} \begin{itemize} \item We can see with a histogram. Let \[ M_0^2 \le \left( p_1 + p_2 \right)^2 \le M_1^2 \] be a region of interest. \item We have $B$ Bins where each has width $\Delta M^2$ \[ \Delta M^2 = \frac{\left( M_1^2 - M_0^2 \right)}{B} \] \item Let $l\,$'th bin lie between $M_0^2 + \left( l-1 \right) \Delta M^2$ and $M_0^2 + l\Delta M^2$. $l=1\ldots B$. \item Define Random Variable \begin{eqnarray*} \eta_l\ \hbox{(event)} & = & 1 \ \ \mbox{event\ lies\ in}\ l\,\mbox{th\ bin} \\ & = & 0 \ \ \mbox{event\ lies outside}\ l\,\mbox{th\ bin} \end{eqnarray*} \item Let $\langle \eta_l \rangle = \mu_l = $ probability that an event will fall in bin. \end{itemize} \subsection{ Histogram Formalism II} \begin{slide} \begin{itemize} \item The means and standard deviations \end{itemize} \begin{eqnarray*} \langle \eta_l \rangle & = & \left( 1 - \mu_l \right) \cdot {\rm O} + \mu_l \cdot 1 = \mu_l \\ \langle \eta_l^2 \rangle & = & \left( 1 - \mu_l \right) \cdot {\rm O}^2 + \mu_l \cdot 1^2 = \mu_l \\ \langle \eta_l^2 \rangle - \langle \eta_l \rangle^2 & = & \mu_l - \mu_l^2 \hspace{8em} \mbox{(Fulle)} \\ & = & \mu_l \ \ \ \mbox{if}\ \ \ \mu_l << 1 \hspace{3em} \mbox{(Approxe)} \end{eqnarray*} \begin{itemize} \item Standard deviation \[ \sigma_l = \sqrt{\mu_l} = \sqrt{{\rm mean}_{\,l}} \] \item Define \begin{eqnarray*} N_l & = & \sum_{i=1}^{\rm \#\ events} \eta_l \ \ \mbox{(event}\ i) \\ & = & \mbox{\#\ events in}\ l \mbox{th\ bin} \end{eqnarray*} \item The central limit theorem applies to $N_l$ or to \[ f_l = \frac{N_l}{N_{\rm total}} \leftarrow \mbox{Total \# events} \] \item This gives $\langle f_l \rangle = \mu_l$ and $f_l$ is Gaussianly distributed with error \[ \langle \sigma_{\eta_l} \rangle = \frac{\langle \sigma_{\eta l} \rangle}{\sqrt{N_{\rm total}}} \] \end{itemize} \subsection{ Errors of Histograms -- I} \begin{slide} \begin{itemize} \item Note this implies that $N_l$: \begin{description} \item{a)} is GAUSSIAN \item{b)} has Standard deviation \end{description} \[ = \sqrt{N_{\rm total}} \langle \sigma_{\eta_l} \rangle\ =\ \sqrt{N_{\rm total}\ \mbox{mean}\ ( \eta_l )}\ =\ \sqrt{\langle N_l \rangle} \] \item This is a very very important rule in all SAMPLING EXPERIMENTS. e.g., $l=1$ \ \ \ \ vote for Clinton $l=2$ \ \ \ \ vote for Perot, etc. \end{itemize} \subsection{ Errors of Histograms -- II} \begin{slide} \begin{itemize} \item In a poll of 1,000 people, one finds that 25\% are in $l$th bin (i.e., will vote for $xyz$). \begin{eqnarray*} N_l & = & 250 \\ \sqrt{N_l} & \approx & 16 \end{eqnarray*} \item Thus, error in estimate of fraction voting for $xyz$ is $\underline{16\%}$---this is STATISTICAL error only. \item The systematic bias (is ``1,000'' chosen people representative) is a harder problem! \item As $0.25 = \mu_l$ not that small, should reduces statistical error by \[ \sqrt{\frac{0.25 - 0.25^2}{0.25}} = \sqrt{\frac{3}{4}} = 0.86 \] using equations (fulle), not (approxe). \end{itemize} \subsection{ Histograms as Robust Data Exploration} \begin{slide} \begin{itemize} \item Histograms are unbiased ways of analyzing data. One does NOT need to know dependence of data on $\left( p_1 + p_2 \right)^2$. \item General problem is data exploration. We have a set of events in a space of many dimensions \item High Energy Physics has momenta \hspace{8em} $p_1$\ \ $p_2$\ \ $p_3$\ \ $p_4$\ \ $p_5$\ \ $\ldots$ and Particle Type \hspace{2em} 1\ \ \,2\ \ \ 3\ \ \ 4\ \ \ 5\ \ \ \ etc. \vspace{8pt} \begin{center} {\bf Political Polls} \end{center} \item (Unknown) parameters affecting a person's choice for president. \end{itemize} \subsection{ General Histogram} \begin{slide} \begin{enumerate} \item $\underline{\rm Select}$ subspace e.g., all events with $> 30$ particles or all events with a lepton. e.g., all people in phone book whose surname starts with Kr. \item Choose some $d$ functions $f$ (event). \begin{eqnarray*} i & = & 1 \ldots d \\ d & = & 1 \\ f_1 & = & \left( p_1 + p_2 \right)^2 \\ \mbox{or}\ \ \ f_1 & = & \mbox{choice\ for\ President} \end{eqnarray*} \item Set up set of bins---conventionally uniform in possible values for each $f_i$. \end{enumerate} \subsection{ Scatterplots} \begin{slide} \begin{itemize} \item Histogram can be used if $d=1$. \item If $d=2$, we find a scatterplot, such as \vspace{8pt} \centerline{\psfig{figure=opt.fig12.ps,clip=true,width=4.5in}} \item where we set up bins in two-dimensional space. \item e.g., $f_1 =$ Religion and $f_2 =$ \# of $z$'s in name would be correlated (Catholic Church strong in Poland?) \item e.g., $f_2 = $ Maximum transverse momentum and $f_1 = \left( p_1 + p_2 \right)^2$ would be correlated. \item as $\psi$'s tend to be produced in high transverse momentum (violent) collisions. \item $d \ge 3$. The future with virtual reality exploration. \end{itemize} \subsection{ Possible Projects I---Maximum Likelihood} \begin{slide} \begin{description} \item{I)} Produce a general $\chi^2$ or maximum likelihood data analysis system on a parallel machine in either explicit message passing or HPF. Apply to a simple example. You can use Marquardt's method for minimization (see gcf records (my program ``Manxcat'' performed successfully $10^6 \chi^2$ fits), Numerical Recipes). \item{$\bullet$} Basic Structure: \vspace{8pt} \centerline{\psfig{figure=opt.fig13.ps,clip=true,width=3.75in}} \end{description} \subsection{ Possible Projects II} \begin{slide} Separately or as follow-on to I. \begin{description} \item{II)} Perform fit to specific dataset. Simplest: Generate ``fake'' data, e.g., \[ f(x,y) = A \exp \left\lbrack - \frac{(x-\alpha )^2}{2 \gamma^2} - \frac{(y-\beta )^2}{2 \delta^2} \right\rbrack \] \item{$\bullet$} Fixed Parameters are $A$, $\alpha$, $\beta$, $\gamma$, $\delta$ \item{$\bullet$} events---choose $x_i$, $y_i$ randomly. \[ P \left( \underline{x} \right)\ \propto\ F(x, y) \] \item{$\bullet$} $\chi^2$ --- divide two-dimensional plane into subsquares with (see figure) $N_x N_y$ bins. Label bins with $i$. \vspace{8pt} \centerline{\psfig{figure=opt.fig14.ps,clip=true,width=4.0in}} \end{description} \subsection{ Possible Projects II (cont.)} \begin{slide} \begin{itemize} \item Number of events in each bin is \begin{eqnarray*} N_{\rm expected} & \propto & {\displaystyle \int_{\rm bin}} f(x, y) dx dy \\ & \sim & f \left( x_{\rm central} , y_{\rm central} \right) dx dy \\ \\ \mbox{with error} & & \sqrt{N_{\rm expected}} \end{eqnarray*} \item Assume Total number of events large. \item Generate ``fake'' experimental measurement as what you get from Gaussian distribution with \hspace{2em} mean\phantom{r}\ \ $N_{\rm expected}$ \hspace{2em} error\ \ $\sqrt{N_{\rm expected}}$ using chosen values $A$, $\alpha$, $\beta$, $\gamma$, $\delta$. \item Fit to general form and see that you find ``chosen'' values for these five parameters. \end{itemize} \subsection{ More Possible Maximum Likelihood Projects} \begin{slide} \begin{description} \item{III)} Find an interesting data set --- High Energy Physics --- Stock Option Pricing tapes from NPAC Simplest: each event/$\chi^2$ bin on a separate node of a MIMD parallel machine. Complex: allow parallelism in calculation of USER1, as in parallel stock pricing model \item{IV)} Compare $\chi^2$ with maximum likelihood Performance vs. Accuracy Tradeoff \item{V)} Integrate into AVS \item{VI)} Evaluate High Performance Fortran \end{description} \subsection{ Projects---Data Exploration} \begin{slide} \begin{description} \item{I)} Produce a generic histogram/scatterplot/other ``robust'' data analysis methods package on a parallel machine. \item{$\bullet$} This should be structured \vspace{8pt} \centerline{\psfig{figure=opt.fig15.ps,clip=true,width=2.75in}} \vspace{8pt} \item{$\bullet$} A Parallel Histogram is a simple database problem. Replicate in each node and combine at end. \item{$\bullet$} Scatterplots may be too big? \item{$\bullet$} Implement parallelism for histograms and parallelism for event generation. \end{description} \subsection{ More Data Exploration Projects} \begin{slide} \begin{description} \item{II)} Test approach with some examples \begin{description} \item{$\bullet$} ``Fake data'' as with $f(x, y)$ before \item{$\bullet$} Real data as before \item{$\bullet$} Theoretical calculation as in ``Quantum Monte Carlo'' (Cornell expertise) in Chemistry and Physics. \end{description} \item{III)} Evaluate High Performance Fortran \item{IV)} Integrate into AVS \item{V)} Integrate with visual data exploration---user is walking through a multi-dimensional space looking for expected or unexpected structure. \begin{description} \item{$\bullet$} Histograms/Scatterplots are one or two dimensions. There is a natural 3D extension which deserves consideration. \item{$\bullet$} What world to be explored has much higher dimensionality? \end{description} \item{VI)} Use Mathematica as front end to define selection of subspaces. \end{description}