ESRC Census Development Programme Identifying the cash-rich and the cash-poor: Lessons from the Census Rehearsal Dr Paul Williamson Department of Geography Why is income so important? Arguably the most direct measure of utility [Yes, I know - Money cant buy happiness] Helps target Neighbourhood renewal Helps with planning [of houses, shops, leisure facilities] Consumer marketing Tax-benefit analysis Most requested addition to 2001 Census
The 2001 Census Geography of income: Other sources of data on income Benefits data Government surveys (e.g. GHS, LFS, FES, FRS, NES) Commercially-held data [Postcode sector and postcode unit estimates] The Census Rehearsal (1999) Key Objectives Evaluation of: Extant methods for small-area income estimation New approaches Utility of non-census information
(e.g. council tax; house price; benefits data) [ Methods of imputing income band means ] Definition of income Income Wealth Gross or net income? Pre or post housing costs? Adult or Household? Household? Total Equivalised [Per capita / OECD / McClements]
Surrogates Univariate % unemployed % 2+ car households % residents in Social Classes I + II % owner-occupation Multivariate (deprivation indices) Carstairs [Unemployment, overcrowding; not owning car; head in Social Class IV or V] Townsend [Unemployment; overcrowding; not owning home; not owning car]
Breadline [not owning car; not owning house; lone parenthood; social class IV or V; illness; unemployment] DLTR Index of Multiple Deprivation 2000 Green (Wealth) [owning 2+ cars; NS-SEC I or II; High qualifications] Geodemographic SuperProfiles MOSAIC GB Profiles Model Individual income Dale (SOC2000; Economic activity; age; sex; Region] Lee (SOC2000; Economic activity] Regression (individual and/or ecological)
Household income Regression (household and/or ecological) Bramley & Smart (H/h comp.; earners; tenure; area level deprivation) The 1999 Census Rehearsal Key features full census questionnaire + INCOME Large achieved sample c. 65,000 households c. 140,000 individuals Spatially contiguous Clustered sampling strategy: 7 part districts [Excluding NI]
38 wards 650 EDs Potential problems non-response rate overall (~ 50%) income (~15%) other variables (5-20%) full responses for ~ 55 % of achieved sample [individuals and households] non-response bias Income Band
0 <60 60-119 120-199 200-299 300-479 480+ Total N All 20.8 13.2 20.5 15.5 13.3 11.3 5.5 100.0 125138
No data missing 16.9 11.0 18.7 16.1 15.9 14.6 6.8 100.0 67283 Social Class No data (1991) All missing None 28.2
25.4 I 4.0 4.9 II 18.9 21.4 III(N) 17.5 19.3 III(M) 11.5 10.8 IV 14.7 13.7 V 5.0 4.3 Army
0.2 0.2 100.0 Total 100.0 N 117010 67283 Correlation coefficient Indicators (calculated for Rehearsal sub-set 1991 Enumeration districts) Original Ideal Townsend index 0.82 0.79 % households with No car
2+ cars 0.89 0.87 0.86 0.83 0.90 0.94 0.98 0.93 0.92 0.87 0.92 0.97 0.94 0.92
0.57 0.58 0.55 0.56 % households Owner-occupied Social rented Detached Detached or semi Flats % of economically active Unemployed Social Class I+II Banding of income question What is your total current gross income from all sources?
Per week or Nil _ Nil Less than 60 60 to 119 _ 120 to 199 200 to 299 300 to 479 480 or more Per year (approximately) _ Less than 3,000 3,000 to 5,999 _ 6,000 to 9,999 _ 10,000 to 14,999 _ 15,000 to 24,999
_ 25,000 or more Only 10% of adults in top band but problem compounded when individual incomes aggregated to estimate household income band mid-point band mean value of band means area sensitive? Income band () 0 1-60 61-120 121-200 201-300 301-480 481+
Council Tax National Average Band A Band H Income band mean 0 0 0 34 35 27 91 93 86 156 155 156 245 241
242 375 364 391 765 652 1353 Source: FRS 1998/9 (Crown Copyright) Digression: modelling income band means Alternative modelling strategies include: National mean Sub-group mean (e.g. by council tax band) Statistical distributions (log-normal; pareto) New variant of log-normal approach with addition of modelled median etc. Results For all bands sub-group mean best
if possible For closed-bands, national mean is next best For open (top) band, new proposed lognormal approach is best, particularly where there is evidence of strong spatial clustering Results of modelling top income band Sample size 1000 Imputation method Sub-group mean National mean Log-normal (%) Pareto (2-point) Pareto (x-point) Average over Socio-economic strata Spatial strata (cars; council tax)
(region) Error Error sq. Error Error sq. 14 7437 -5 5062 7 -53 51933 7373 12 12847 5820 -18 149 59261 157
35923 211 83351 301 100911 [Same ranking applies for samples sizes from 15000 to 150] Results of modelling top income band Sample size 20 Imputation method Sub-group mean National mean Log-normal (%) Pareto (2-point) Pareto (x-point) Average over Socio-economic strata
Spatial strata (cars; council tax) (region) Error Error sq. Error Error sq. 9 175501 -2 347994 6 352234 -101 232754 30 208267 41 375803 185
260801 221 443589 252 286535 324 486772 Spatial scale At what scale does income vary most? MAUP 1991 vs 1998/9 boundaries zones with <10 households or 25 residents excluded from analysis SOC 2000 / NS-SEC Lack of alternative SOC2000 coded data Therefore have to use Census Rehearsal data Use partitioned data to avoid unduly
advantaging SOC2000 based approaches Results Census Rehearsal Income Distribution 30 25 20 15 10 5 0 Nil <3 3-5 6-9
10-14 15-24 Annual Gross Income ( 000s) 25+ Heterogeneity rules OK! At ward level the % household reps. in top income-band averaged 9.1% but ranged from 2.8% to 21.6% 89% of EDs contained one or more household reps. in top income-band i.e. in top income-decile of the population Income distribution of household representative (Person 1 on Census Rehearsal form)
All EDs EDs in lowest income quintile 16000 4000 12000 3000 8000 2000 4000 1000
0 0 EDs in second income quintile EDs in middle income quintile 4000 4000 3000 3000 2000
2000 1000 1000 0 0 EDs in fourth income quintile EDs in top income quintile 3000 5000
4000 2000 3000 2000 1000 1000 0 0 Nil <3 3-5 6-9
10-14 15-24 Income bracket (000 p.a.) 25+ Nil <3 3-5 6-9 10-14 15-24 Income bracket (000 p.a.) 25+
Missing data Missing data have minimal impact on results From Raw to Ideal data, most correlations change by <0.02 Very few values change by >0.05 Exception is NS-SEC 8 [by definition!] Correlations lower for Ideal than Raw Surrogates calculated direct from Rehearsal circumvents data response bias? Scale Higher correlations at higher geographies District effect small but significant BUT none of districts in SE England Overfitting No significant impact
MAUP Correlations vary by up to 0.1 between alternative boundaries at same spatial scale BUT No detectable effect on rankings of surrogate income measures Adult income (r2) Surrogate Ward ED Univariate NS-SEC 1+II 0.81 0.81 Multivariate Townsend 0.36 0.46 Green (wealth) 0.57 0.55 Geodemographic
PCA_96 Na 0.82 Voas 0.83 0.59 Model Dale 0.91 0.89 Lee 0.90 0.87 Voas (individual) 0.91 0.80 [See final slide for definition of surrogates] Postcode 0.64 0.38 0.50 0.69 0.48 0.90
0.88 0.83 Regression model (adults) Age, Age2, sex, ethnicity, marital status Type and tenure of dwelling Qualifications Economic (in)activity and health Mean SOC2000 and SIC2000 income Supervisory status District of residence
Caveats Best performing surrogates in danger of over-fitting? For Dale, Lee and Voas mean occupational income calculated directly from Census Rehearsal dataset (no other SOC2000 sources available at time of analysis) BUT No significant difference if SOC minor or unit codes used No significant difference if data partitioned Household income (r2) Surrogate Ward ED Univariate NS-SEC 1+II 0.82 0.81 Multivariate
Townsend 0.48 0.46 Green (wealth) 0.61 0.50 Geodemographic PCA_96 na 0.81 Voas 0.81 0.60 Model Dale 0.90 0.85 Lee 0.87 0.83 Voas (household) 0.76 0.74 [See final slide for definition of surrogates] Postcode 0.64
0.44 0.56 0.67 0.48 0.86 0.83 0.74 Accuracy For many purposes relative, rather than absolute, accuracy is most important ranking a) NS-SEC based income surrogate [NSSEC12] % of economically active in NSSEC 1+2 100% 75%
50% 25% 0% 0 100 200 300 400 500 Observed mean individual income ( week)
600 b) Regression based estimate [VOASIND] Predicted mean individual income ( week) 600 500 400 300 200 100 0 0
100 200 300 400 500 Observed mean individual income ( week) 600 c) Sub-group mean based estimate [LEINCM] Predicted mean individual income ( week) 600
500 400 300 200 100 0 0 100 200 300
400 500 Observed mean individual income ( week) 600 Surrogate/Estimate % NSSEC Individual Sub-group Ecological 1+2 Regression mean Regression [NSSEC12] [VOASIND] [LEEINCM] % ranked in same decile as income Decile [low income] 1
2 3 4 5 6 7 8 9 [high income] 10 71 46 32 32 25 17 26 23 28 55
66 34 40 26 34 28 28 35 51 77 74 40 35 37 39 45 43 48
57 82 80 52 43 40 37 30 31 46 60 82 Overall 36 42
50 46 Within 1 decile 82 84 89 92 Other data sources < 1% of unexplained spatial variation in income attributable to area level effects House price has no significant impact could be due to data problems
Council tax band has small but significant effect [for areas of enumeration district size and below] Lack of utility counter-intuitive? current value purchase price purchase income current income Conclusions (I) Best approaches capture 80-90% of spatial variation in income, even for smallest spatial units But considerable within-area heterogeneity Best approaches are regression or subgroup mean based Conventional deprivation indices a poor second to % social class / NS-SEC I+II Conclusions (II) Geodemographic classifications at best perform as well as % NS-SEC I+II, and perform best for areas of ward size and
above Qualified support for use of statistical distributions in modelling top income band means Implications Moral for marketers: Target people, not places Moral for policy makers: Deprivation indices not the best proxy for income ONS ward income estimates (based on ecological regression) likely to perform well Longer term Consider external correlates (e.g. IMD 2000; benefits data) Lobby for Census Office to create smallarea income estimate by imputing income on Census microdata
include non-census information (?) Acknowledgements House price data were taken from the Experan Limited Postal Sector Data, ESRC/JISC Agreement. Grateful thanks are due to the Census Custodians of England, Wales and Scotland for granting permission to access the Census Rehearsal dataset. A debt of gratitude is also owed to a number at the Office for National Statistics, in particular Keith Whitfield and Philip Clarke. Finally, thanks are due to David Voas for undertaking some of the preparatory work for this project. All analyses and conclusions remain the sole responsibility of the Dr Paul Williamson. Definitions (I) NS-SEC I+II: % persons aged 16-74 in NS-SEC I or II Townsend: Multiple deprivation indicator based on % economically active unemployed; % overcrowded households; % households with no car and % of households not owner occupied
Green (Wealth): Affluence indicator based on % households with 2+ cars; % persons aged 16-74 in NS-SEC I and % adults with high educational qualifications PCA_96: Geodemographic classification based on principal components analysis of 20 normalised census variables, individuals in each of 96 area types assumed to have mean income of all persons in area type Voas: Alternative geodemographic classification, in which five census variables are divided into above or below median, one variable into thirds; with all cross-tabulated to give a total of 96 discrete area types Definitions (II) Dale: Income imputed given mean income for population sub-group defined by sex, SOC 2000 minor group, economic activity (missing; employed full-time; employed part-time; self-employed; other), age (missing; 0-15; 16-19; 20-29; 30-49; 50+) [Maximum of 4860 valid sub-groups] Lee: Income imputed given mean income for population sub-group defined by SOC 2000 minor group, economic activity (child; not applicable; employed full-time; employed part-time; self-employed;
unemployed; retired; other inactive) [maximum of 649 valid subgroups] Definitions (III) Voas (individual): Regression model for adult income (children assumed to have 0 income); INCOME0.5 predicted given: mean income by SOC2000 unit; mean income by Industry category, age, age 2, residents, residents2, rooms and cars plus dummy variables for sex, white, full-time student, married, Single/Widowed/Divorced, Longterm ill, No qualifications, GCSE or equivalent, A levels or equivalent, Undergraduate degree or equivalent, employed full-time, employed part-time, self-employed, unemployed, retired, permanently sick, other economically inactive excluding pensioners and students, Semi-detached, terrace, flat, caravan, privately rented, social rented, employed manager or supervisor and district of residence Voas (household): Regression model for total household income; HHINC0.5 predicted given same set of predictors as for Voas (individual), but based only upon head of households characteristics