Scoring Rules, Generalized Entropy, and Utility Maximization Robert Nau Fuqua School of Business Duke University (with Victor Jose and Robert Winkler) Presentation for IEOR Seminar Berkeley, October 29, 2006 Overview Scoring rules are reward functions for defining subjective probabilities and eliciting them in forecasting applications and experimental economics (de Finetti, Brier, Savage, Selten...)

Cross-entropy, or divergence, is a physical measure of information gain in communication theory and machine learning (Shannon, KullbackLeibler...) Utility maximization is the decision makers objective in Bayesian decision theory and game theory (von Neumann & Morgenstern, Savage...) General connections Any decision problem under uncertainty may be used to define a scoring rule or measure of divergence between probability distributions. The expected score or divergence is merely the expected-utility gain that results from solving the problem using the decision makers

true probability distribution p rather than some other baseline distribution q. Specific results We explore the connections among the best-known parametric families of generalized scoring rules, divergence measures, and utility functions. The expected scores obtained by truthful probability assessors turn out to correspond exactly to wellknown generalized divergences. They also correspond exactly to expected-utility gains in financial investment problems with utility functions drawn from the linear-risk-tolerance (a.k.a. HARA) family. These results generalize to incomplete markets via

a primal-dual pair of convex programs. Part 1: Scoring rules Consider a probability forecast for a discrete event with n possible outcomes (states of the world). Let ei = (0, ..., 1, ..., 0) denote the indicator vector for the ith state (where 1 appears in the ith position). Let p = (p1, ..., pn) denote the forecasters true subjective probability distribution over states. Let r = (r1, ..., rn) denote the forecasters reported distribution (if different from p). Let q = (q1, ..., qn) denote a baseline (prior) distribution upon which the forecaster seeks to improve.

Definition of a scoring rule A scoring rule is a function S(r, ei, q) that determines the forecasters score (reward) for giving the forecast r, relative to the baseline q, when the ith state is subsequently observed to occur. Let denote the forecasters expected score for reporting r when her true distribution is p and the baseline distribution is q. Thus, in general, a scoring rule can be expressed as a function of three vector-valued arguments, which is linear in the 2nd argument.

Proper scoring rules The scoring rule S is [strictly] proper if S(p, p, q) [>] S(r, p, q) for all r [p], i.e., if the forecasters expected score is [uniquely] maximized when she reports her true probabilities. Henceforth let denote the forecasters expected score for a truthful forecast, as a function of p and q. S is [strictly] proper iff is [strictly] convex. Proper scoring rules, continued

If S is strictly proper, then it is uniquely determined from by McCarthys (1956) formula: Thus, a strictly proper scoring rule is completely characterized by its expected-score function. Henceforth only strictly proper scoring rules will be considered, and it will be assumed that r = p. Standard scoring rules The three most commonly used scoring rules all assume a uniform baseline distribution (q = 1/n), which will be temporarily suppressed. Quadratic scoring rule:

Spherical scoring rule: Logarithmic scoring rule: History of standard scoring rules The quadratic scoring rule was introduced by de Finetti (1937, 1974) to define subjective probability; later used by Brier (1950) as a tool for evaluating and paying weather forecasters; more recently advocated by Selten (1998) for paying subjects in economic experiments. The spherical and logarithmic rules were introduced by I.J. Good (1971), who also noted that the spherical and quadratic rules could be generalized to positive exponents

other than 2, leading to... Generalized scoring rules (uniform q) Power scoring rule ( quadratic at = 2): Pseudospherical scoring rule ( spherical at = 1) Both rules rescaled logarithmic rule at = 1. Weighted scoring rules (arbitrary q) Our first contribution is to merely point out that the power and pseudospherical rules can be weighted by an arbitrary baseline distribution q and scaled so as to be valid for all real .

Under the weighted rules, the score is zero in all states iff p q, and the expected score is positive iff p q. Thus, the weighted rules measure the value added of p over q as seen from the forecasters perspective. Weighted power scoring rule: Weighted pseudospherical scoring rule: Properties of weighted scoring rules Both rules are strictly proper for all real . Both rules weighted logarithmic rule ln(pi/qi) at =1.

For the same p, q, and , the vector of weighted power scores is an affine transformation of the vector of weighted pseudospherical scores, since both are affine functions of (pi/qi)1. However, the two rules present different incentives for information-gathering and honest reporting. The special cases = 0 and = have interesting properties but have not been previously studied. Special cases of weighted scores Weighted expected score functions Weighted power expected score:

Weighted pseudospherical expected score: Special cases of expected scores Power Pseudospherical Figure 1. Weighted power score vs. beta (uniform q) 2 1.5 1 0.5 0

-2 -1.5 -1 -0.5 -0.5 0 0.5 1

1.5 2 2.5 3 -1 -1.5 -2 State 1 (p=0.05)

-2.5 State 2 (p=0.25) -3 State 3 (p=0.70) -3.5 Behavior of the weighted power score for n = 3. For fixed p and q, the scores diverge as . For << 0 [ >> 2] only the lowest [highest]

Figure 2. Weighted pseudospherical score vs. beta (uniform q) 2 1.5 1 0.5 0 -2 -1.5 -1

-0.5 -0.5 0 0.5 1 1.5 2 2.5

3 -1 -1.5 -2 State 1 (p=0.05) -2.5 State 2 (p=0.25) -3

State 3 (p=0.70) -3.5 By comparison, the weighted pseudospherical scores approach fixed limits as . Again, for << 0 [ >> 2] only the lowest [highest] Figure 3. Expected scores vs. beta (p=0.05, 0.25, 0.70, uniform q) 1 Pseudospherical 0.8

Power 0.6 0.4 0.2 -2 -1.5 -1 -0.5

0 0.5 1 1.5 2 2.5 3

The corresponding expected scores vs. are equal at = 1, where both rules converge to the weighted logarithmic scoring rule, but elsewhere the weighted power expected score is strictly larger. Part 2. Entropy In statistical physics, the entropy of a system with n possible internal states having probability distribution p is defined (up to a multiplicative constant) by In communication theory, the negative entropy H(p) is the self-information of an event from a stationary random process with distribution p, measured in

terms of the average number of bits required to optimally encode it (Shannon 1948). The KL divergence The cross-entropy, or Kullback-Leibler divergence, between two distributions p and q measures the expected information gain (reduction in average number of bits per event) due to replacing the wrong distribution q with the right distribution p: Properties of the KL divergence Additivity with respect to independent partitions of the state space:

Thus, if A and B are independent events whose initial distributions qA and qB are respectively updated to pA and pB, the total expected information gain in their product space is the sum of the separate expected information gains, as measured by their KL divergences. Properties of the KL divergence Recursivity with respect to the splitting of events: Thus, the total expected information gain does not depend on whether the true state is resolved all at once or via a sequential splitting of events.

Other divergence/distance measures The Chi-square divergence (Pearson 1900) is used by frequentist statisticians to measure goodness of fit: The Hellinger distance is a symmetric measure of distance between two distributions that is popular in machine learning applications: Onward to generalized divergence... The properties of additivity and recursivity can be considered as axioms for a measure of expected information gain which imply the KL divergence. However, weaker axioms of pseudoadditivity and pseudorecursitivity lead to parametric families of

generalized divergence. These generalized divergences interpolate and extrapolate beyond the KL divergence, the Chi-square divergence, and the Hellinger distance. Power divergence The directed divergence of order , a.k.a. the power divergence, was proposed by Havrda & Chavrt (1967) and further elaborated by Rathie & Kannappan (1972), Cressie & Read (1980), Haussler and Opper (1997), among others: It is pseudoadditive and pseudorecursive for all , and it coincides with the KL divergence at = 1.

It is identical to the weighted power expected score, hence the power divergence is the implicit information measure behind the weighted power scoring rule. Pseudospherical divergence An alternative generalized entropy was introduced by Arimoto (1971) and further studied by Sharma & Mittal (1975), Boekee & Van der Lubbe (1980) and Lavenda & Dunning-Davies (2003), for >1: The corresponding divergence, which we call the pseudospherical divergence, is obtained by introducing a baseline distribution q and dividing out the unnecessary in the numerator: Properties of the pseudospherical

divergence It is defined for all real (not merely > 1). It is pseudoadditive but generally not pseudorecursive. It is identical to the weighted pseudospherical expected score, hence the pseudospherical divergence is the implicit information measure behind the weighted pseudospherical scoring rule. Interesting special cases The power and pseudospherical divergences both coincide with the KL divergence at = 1. At = 0, = , and = 2 they are linearly (or at least monotonically) related to the reverse KL

divergence, the squared Hellinger distance, and the Chi-square divergence, respectively: Where weve gotten so far... There are two parametric families of weighted, strictly proper scoring rules which correspond exactly to two well-known families of generalized divergence, each of which has a full spectrum of possibilities ( < < ). But what is the decision-theoretic significance of these quantities? What are some guidelines for choosing among the the two families and their parameters?

Part 3. Financial decisions under uncertainty with linear risk tolerance Suppose that an investor with subjective probability distribution p and utility function u bets or trades optimally against a risk-neutral opponent or contingent claim market with distribution q. For any risk-averse utility function, the investors gain in expected utility yields an economic measure of the divergence between p and q. In particular, suppose the investors utility function belongs to the linear risk tolerance (HARA) family, i.e., the family of generalized exponential, logarithmic, and power utility functions.

Risk aversion and risk tolerance Let y denote gain or loss relative to a (riskless) status quo wealth position, and let u(y) denote the utility of y. The monetary quantity (y) u (y)/u (y) is the investors local risk tolerance at y (the reciprocal of the Pratt-Arrow measure of local risk aversion). The usual decision-analytic rule of thumb is as follows: an investor with current wealth y and local risk tolerance (y) is roughly indifferent to accepting a 50-50 gamble between the wealth positions y (y) and y (y), i.e., indifferent to gaining (y) or losing (y) with equal probability.

Linear risk tolerance (LRT) utility The most commonly used utility functions in decision analysis and financial economics have the property of linear risk tolerance, i.e., (y) = + y, where > 0 is the risk tolerance coefficient. If the unit of money is normalized so that the risk tolerance equals 1 at y = 0 (status quo wealth), then = 1, and the utility function is u(y) = g (y), where: Special cases of normalized LRT utility Qualitative properties of LRT utility g (0) = 0 and g (0) = 1 for all : the functions {g (y)} are mutually tangent with dollar-utile parity at y = 0.

Figure 4. Norm alized LRT utility functions (beta = risk tolerance coefficient) 0.8 0.6 0.4 0.2 beta = -1 quadratic 0 -0.8 -0.6

-0.4 -0.2 -0.2 -0.4 0 0.2 0.4 0.6

0.8 beta = 0 exponential -0.6 beta = 1 logarithmic -0.8 beta = 2 square-root -1

1 The investors decision model Model Y: the investor seeks the payoff vector y that maximizes her own LRT expected utility under her distribution p subject to not decreasing the opponents linear expected utility (i.e., expected value) under his distribution q. The investors reward in state i is her own ex post utility payoff g (yi). A modified decision model Model Y: the investor seeks the payoff vector y

that maximizes the sum of her own LRT expected utility under her distribution p and the opponents linear expected utility (expected value) under his distribution q. The investors reward in state i is her own ex post utility payoff g (yi) plus the opponents ex ante expected monetary payoff. Main result 1. In the solution of Model Y, the investors utility payoff in state i is the weighted pseudospherical score, whose expected value is the pseudospherical divergence. 2. In the solution of Model Y, the investors utility

payoff in state i is the weighted power score, whose expected value is the power divergence. 3. For any p, q, and , the weighted power expected score (power divergence) is greater than or equal to the weighted pseudospherical expected score (pseudospherical divergence). Observations Insofar as Model Y is a more realistic investment problem than Model Y, the pseudospherical divergence appears to be more economically meaningful than the power divergence. The same results are obtained if the investor is endowed with linear utility while the opponent is risk

averse with risk tolerance coefficient 1. Both of these problems involve non-decreasing risk tolerance on the part of the more-risk-averse agent only if is between 0 and 1. Extension to incomplete markets Suppose the investor faces an incomplete market in which asset prices are supported by a convex set of risk neutral distributions. Let Q denote the matrix whose rows are the extreme points of the set of risk neutral distributions. Then the investor seeks the payoff vector y that maximizes her own LRT expected utility under her distribution p subject to the constraint Qy 0.

This is a convex optimization problem whose dual is to find the risk neutral distribution in the convex hull of the rows of Q that minimizes the pseudospherical divergence from p. Details of duality relationship Let z denote a vector of non-negative weights, summing to 1, for the k rows of Q. Then zTQ is a supporting risk neutral distribution in the convex hull of the rows of Q, and the primal-dual pair of optimization problems is as follows: Conclusions The commonly used power & pseudospherical scoring

rules can be improved by incorporating a notnecessarily-uniform baseline distribution. The resulting weighted expected scores are equal to well-known generalized divergences. The weighted pseudospherical scoring rule and its divergence have a more natural utility-theoretic interpretation than the weighted power versions. Values of between 0 and 1 appear to be the most interesting, and the cases = 0 and = have been so far under-explored.