The U.S. Census Bureau Adopts Differential Privacy John
The U.S. Census Bureau Adopts Differential Privacy John M. Abowd Chief Scientist and Associate Director for Research and Methodology U.S. Census Bureau 2018 International Methodology Symposium Ottawa, Ontario, Canada November 9, 2018 Acknowledgments and Disclaimer The opinions expressed in this talk are my own and not necessarily those of the U.S. Census Bureau The application to the Census Bureaus 2020 publication system incorporates work by Daniel Kifer (Scientific Lead), Simson Garfinkel (Senior Computer Scientist for Confidentiality and Data Access), Tamara Adams, Robert Ashmead, Michael Bentley, Stephen Clark, Aref Dajani, Jason Devine, Nathan Goldschlag, Michael Hay, Cynthia Hollingsworth, Meriton Ibrahami, Michael Ikeda, Philip Leclerc, Ashwin Machanavajjhala, Christian Martindale, Gerome Miklau, Brett Moran, Edward Porter, Sarah Powazek, Anne Ross, William Sexton, and Lars Vilhuber [link to the September 2018 Census Scientific Advisory Committee presentation] Parts of this talk were supported by the National Science Foundation, the Sloan Foundation, and the Census Bureau (before and after my appointment started) 2
Disclosure Avoidance for the 2010 Census 3 This is the official form for all the people at this address. It is quick and easy, and your answers are protected by law. 4 2010 Census of Population and Housing Basic results from the 2010 Census: Total population
308,745,538 Household population 300,758,215 Group quarters population Households 7,987,323 116,716,292 5 2010 Census Person-Level Database Schema Variables Distinct values Habitable blocks 10,620,683 Habitable tracts
73,768 Sex 2 Age 115 Race/Ethnicity (OMB Categories) 126 Race/Ethnicity (SF2 Categories) 600 Relationship to person 1 National histogram cells (OMB Categories) 17 492,660 6
2010 Census: Summary of Publications (approximate counts) Publication Released counts (including zeros) PL94-171 Redistricting 2,771,998,263 Balance of Summary File 1 2,806,899,669 Summary File 2 2,093,683,376 Public-use micro data sample Lower bound on published statistics Statistics/person
30,874,554 7,703,455,862 25 7 The 2000 and 2010 Disclosure Avoidance System operated as a privacy filter: Blue = Public Data Red = Confidential Data Raw data from respondents: Decennial Response File Selection & unduplication: Census Unedited File Edits, imputations: Census Edited File
Confidentiality edits (household swapping), tabulation recodes: Hundred-percent Detail File Pre-specified tabular summaries: PL94-171, SF1, SF2 (SF3, SF4, in 2000) Special tabulations and post-census research 8 The protection system relied on swapping households: Advantages of swapping:
Easy to understand Does not affect state counts if swaps are within a state Can be run state-by-state Operation is invisible to rest of Census processing Town 1 Disadvantages: Does not consider or protect against database reconstruction attacks Does not provide formal privacy guarantees Swap rate and details of swapping must remain secret Privacy guarantee based on the lack of external data Town 2 State X 9 Database Reconstruction
10 (Dinur Nissim 2003): Database Reconstruction A statistical database can be reconstructed with a relatively small number of random queries Previous work showed that query privacy could only be assured: By tracking every query Even then, it is exponentially hard This work proposes a generalized solution by adding noise 11 The database reconstruction theorem is the death knell for traditional data publication systems from confidential sources, including those used by national statistical offices. 12
Reconstruction Equation System Collect more than 5 billion statistics from official 2010 Census tables. From the sample space at the block and tract level (2 x 115 x 2 x 63 = 28,980), write the linear equations for each sample statistic, including zeros. 13 Properties of the Reconstruction We created equations to solve for the underlying population for every census tract and block The solution: Cant be overdetermined (known to come from a real person table) Usually underdetermined: most blocks have many solutions All solutions share some exact images For example, block and voting-age variables are the same in every solution Full details will be released soon 14
Results Using the Published 2010 Census 1. The tabulation variables from the confidential hundred-percent detail micro-data file can be reconstructed quite accurately from PL94 + balance of SF1 2. While there is a vulnerability, the risk of re-identification was small 3. Experiments are at the person level, not household 4. Experiments led to the declaration that reconstruction of Title 13sensitive data is an issue, no longer a risk 5. This provides strong motivation for the adoption of differential privacy for the 2018 End-to-End Census Test and 2020 Census 15 Formal Privacy 16 (Dwork, McSherry, Nissim & Smith 2006): Differential Privacy Definition of a criterion that constrains algorithms that process the confidential data A generic approach for protecting privacy by adding random noise
Key features: lower bound for the amount of noise that needs to be added upper bound for privacy loss mechanisms are composable 17 The DP tradeoff: Accuracy vs. Privacy-loss Estimated 1.0 0.9 Marginal Social Benefit Curve 0.8 Marginal social cost: slope of the estimated production technology Marginal social
benefit: slope of the estimated social benefit curve Source: Abowd and Schmutte ( forthcoming, American Ec onomic Review ) Data Accuracy 0.7 Social Optimum: Marginal Social Benefit = Marginal Social Cost 0.6 0.5 0.4 Estimated
3.5 4.0 4.5 5.0 5.5 6.0 Privacy-loss (e) 18 OnTheMap was the Census Bureaus first product to use Differential Privacy This work was done at Cornell University while Abowd and Vilhuber were on IPA assignments to the Census Bureau. Gehrke is now Technical Fellow at Microsoft. Kifer is now the scientific lead on the 2020
DAS. Machanavajjhala is now a contractual collaborator on the 2020 DAS. Vilhuber is now on IPA assignment to the Census Bureau. 2 8 0 0 19 OnTheMap was much easier to make formally private than the decennial census Decennial Census OnTheMap New data product designed from the ground-up to be formally private Users were willing to accept data with noise Added noise was quantified for
users Developed by a relatively small, highly-trained team that became comfortable with formal privacy requirements Data product started in 1790 with many legacy data users Most users were not aware that noise had been added with swapping Swap rate held confidential Huge team, with wide levels of experience and many with no prior experience with formal privacy methods 20 Formal Privacy and the 2020 Census 21
In 2017, the Census Bureau announced that it would use differential privacy for the 2020 Census. Differential privacy provides: Provable bounds on the maximum privacy loss Algorithms that allow policy makers to manage the trade-off between accuracy and privacy The graph on this slide is not hypothetical; it was computed from real population data from the 1940 Census using the table specifications for current redistricting data Pre-Decisional 22 As reported Consider a census block As collected Age < 18
Male 4 Female 4 Age >= 18 4 4 Age < 18 Male 3 Female 2 Age >= 18 5
6 Age < 18 Male 6 Female 2 Age >= 18 3 5 High privacy loss More accurate sex distribution
More accurate age distribution 23 There was no off-the-shelf system for applying differential privacy to a national census We had to create a new system that: Produced higher-quality statistics at more densely populated geographies Produced consistent tables We created a new differential privacy algorithm and system that: Produces statistics controlling accuracy from the top-down E.g., national Level -> state Level -> county Level -> tract Level -> block Level Creates privatized micro-data that can be used for any tabulation without additional privacy loss Fits into the decennial census production system 24 We have created a Disclosure Avoidance System that drops into the 2020 Census production system Global
Confidentiality Protection Process Decennial Response File Census Unedited File Census Edited File Disclosure Avoidance System accuracy trade-offs
Pre-specified tabular summaries: PL94-171, SF1, SF2 Microdata Detail File Special tabulations and post-census research Privacy-loss Budget, Accuracy Decisions 25 The disclosure avoidance system uses differential privacy to defend against a reconstruction attack Differential privacy provides: Provable bounds on the accuracy of the best possible database reconstruction given the released tabulations Algorithms that allow policy makers to decide the trade-off between accuracy and privacy loss
Reconstruction is still possible, but the data intruder doesnt get the confidential data! (gets the safe microdata instead) Final privacy-loss budget determined by Data Stewardship Executive Policy Committee (DSEP) with recommendation from Disclosure Review Board (DRB) Pre-Decisional 26 The Disclosure Avoidance System Relies on Injecting Noise with Formal Privacy Rules Advantages of noise injection with differential privacy: Privacy operations are closed under composition
Privacy guarantees are robust to post-processing Privacy guarantees are future-proof Privacy guarantees are provable and tunable Privacy guarantees are public and explainable Protects against database reconstruction attacks Disadvantages: Global Confidentiality Protection Process Disclosure Avoidance System Entire country must be processed at once for best accuracy Every use of the private data must be tallied in the privacy-loss budget 27 The Challenge of Creating the 2020 DAS Business processes: All desired tabulations must be known in advance All uses of confidential data must be tracked and accounted for
Data quality checks on tables cannot be done by looking at raw data Communications strategy: Differential privacy is not widely known or understood outside academia Most data users expect the same accuracy regardless of the level of detail In 2000 and 2010 we used swapping with an undisclosed swap rate The Census Bureau did not quantify the error rate 28 And we can make the DAS public! Open source system Source code published on GitHub All parameters public including global e Testable with data from 1940 Census 29 Demonstration of the Actual 2020 Disclosure Avoidance System Using Public 1940 Census Data 30
31 32 Managing the Tradeoff 33 You know what I look like already 34 How to Think about the Social Choice Problem The marginal social benefit is the sum of all persons willingness-topay for data accuracy with increased privacy loss The marginal rate of transformation is the slope of the privacy-loss v. accuracy graphs we have been examining This is exactly the same problem being addressed by Google in RAPPOR, Apple in iOS 11, and Microsoft in Windows 10 telemetry
35 Production Possibilities for Privacy-loss v. Accuracy Tradeoff 1.0 Estimated Marginal Social Benefit Curve 0.9 0.8 Social Optimum: MSB = MSC 0.7 Data Accuracy 0.6 0.5
3.0 3.5 4.0 4.5 5.0 5.5 6.0 Privacy-loss Budget 36 But the Choice Problem for Redistricting Tabulations Is More Challenging In the redistricting application, the fitness-for-use is based on Supreme Court one-person one-vote decision (All legislative districts must have approximately equal populations; there is judicially approved variation)
Is statistical disclosure limitation a statistical method (permitted by Utah v. Evans) or sampling (prohibited by the Census Act, confirmed in Commerce v. House of Representatives)? Voting Rights Act, Section 2: requires majority-minority districts at all levels, when certain criteria are met The privacy interest is based on Title 13 requirement not to publish exact identifying information The public policy implications of uses of detailed race and ethnicity 37 Production Possibilities for Alternative Mechanisms 1.0 0.9 0.8 Proposed 2020 Census differential privacy implementation with usecase based accuracy improvements Data Accuracy 0.7
0.6 0.5 Simple differential privacy implementation with no accuracy improvements 0.4 0.3 0.2 0.1 0.0 0.0 0.5 1.0
Randomized response: method used by Google, Privacy-loss Budget Apple and Microsoft 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
5.5 6.0 38 Production Possibilities for Alternative Mechanisms 1.0 Where social scientists act like MSC = MSB 0.9 0.8 Proposed 2020 Census differential privacy implementation with usecase based accuracy improvements Data Accuracy 0.7
0.6 0.5 0.4 0.3 Where computer scientists act like MSC = MSB 0.2 0.1 0.0 0.0 Simple differential privacy implementation with no accuracy improvements 0.5
1.0 Randomized response: method used by Google, Privacy-loss Budget Apple and Microsoft 1.5 2.0 2.5 3.0 3.5 4.0 4.5
5.0 5.5 6.0 39 Estimated Marginal Social Benefit Curves Production Possibilities for Alternative Mechanisms 1.0 0.9 More accuracy favoring 0.8 0.7
More privacy favoring Data Accuracy 0.6 Social Optima: MSB = MSC Blue tangency (3.5, 94%) Green tangency (1.0, 60%) 0.5 0.4 0.3 0.2 0.1 0.0 0.0 0.5
40 Thank you [email protected] johnabowd.com References Dinur, Irit and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems(PODS '03). ACM, New York, NY, USA, 202-210. DOI: 10.1145/773153.773173. Dwork, Cynthia, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. in Halevi, S. & Rabin, T. (Eds.) Calibrating Noise to Sensitivity in Private Data Analysis Theory of Cryptography: Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006. Proceedings, Springer Berlin Heidelberg, 265-284, DOI: 10.1007/11681878_14. Dwork, Cynthia. 2006. Differential Privacy, 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP 2006), Springer Verlag, 4052, 1-12, ISBN: 3-54035907-9. Dwork, Cynthia and Aaron Roth. 2014. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science. Vol. 9, Nos. 34. 211407, DOI: 10.1561/0400000042. Dwork, Cynthia, Frank McSherry and Kunal Talwar. 2007. The price of privacy and the limits of LP decoding. In Proceedings of the thirty-ninth annual ACM symposium on Theory of computing(STOC '07). ACM, New York, NY, USA, 85-94. DOI:10.1145/1250790.1250804. Machanavajjhala, Ashwin, Daniel Kifer, John M. Abowd , Johannes Gehrke, and Lars Vilhuber. 2008. Privacy: Theory Meets Practice on the Map, International Conference on Data Engineering (ICDE) 2008: 277-286, doi:10.1109/ICDE.2008.4497436. Dwork, Cynthia and Moni Naor. 2010. On the Difficulties of Disclosure Prevention in Statistical Databases or The Case for Differential Privacy, Journal of Privacy and Confidentiality: Vol. 2: Iss. 1, Article 8. Available at: http://repository.cmu.edu/jpc/vol2/iss1/8. Kifer, Daniel and Ashwin Machanavajjhala. 2011. No free lunch in data privacy. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data (SIGMOD '11). ACM, New York, NY, USA, 193-204. DOI:10.1145/1989323.1989345. Erlingsson, lfar, Vasyl Pihur and Aleksandra Korolova. 2014. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In Proceedings of the 2014 ACM SIGSAC
Conference on Computer and Communications Security (CCS '14). ACM, New York, NY, USA, 1054-1067. DOI:10.1145/2660267.2660348. Garfinkel, Simson L., John M. Abowd and Sarah Powazek 2018 Issues Encountered Deploying Differential Privacy WPES 2018, https://arxiv.org/abs/1809.02201. Abowd, John M. and Ian M. Schmutte. 2017 . Revisiting the economics of privacy: Population statistics and confidentiality protection as public goods. Labor Dynamics Institute, Cornell University, Labor Dynamics Institute, Cornell University, at https://digitalcommons.ilr.cornell.edu/ldi/37/ Abowd, John M. and Ian M. Schmutte. Forthcoming. An Economic Analysis of Privacy Protection and Statistical Accuracy as Social Choices. American Economic Review, at https:// arxiv.org/abs/1808.06303 Apple, Inc. 2016. Apple previews iOS 10, the biggest iOS release ever. Press Release (June 13). URL=http://www.apple.com/newsroom/2016/06/apple-previews-ios-10-biggest-iosrelease-ever.html. Ding, Bolin, Janardhan Kulkarni, and Sergey Yekhanin 2017. Collecting Telemetry Data Privately, NIPS 2017. 42
Bone Form and Function ... Stress and Strain Stress - measured as pressure over cross-sectional area Strain - deformation in material caused by stress Compare elastic rubber band and rigid bone Stress-strain curve: Connective tissues (revisited) Extracellular matrix / producing...
How are the deaths of Candy's dog & Lennie related? Why did they both have to die? Was it 'just' for George to kill Lennie? WRAP-UP. Thursday is project prep day. Sign in for attendance, then you can work with...
Replacing the 2008 exclusion for hazardous secondary materials transferred off-site. with the verified recycler exclusion, increasing oversight by the state or EPA and thus preventing unpermitted facilities from receiving hazardous secondary material, unless they have obtained a variance from the...
Medicare Clients and Physical Health Managed Care . Before CCO implementation - To be enrolled in a managed care plan, the client had to enroll into the Medicare Advantage plan offered by the MCO or the managed care plan would...
Ser excel·lent és ser creador d'alguna cosa, un sistema, un lloc, una empresa, una llar, una vida. S'excel·lent és exercir la nostra llibertat i ser responsable de cadascuna de les nostres accions. Ser excel·lent és aixecar els ulls de la...
A region is area with common features that set it apart from different areas. There are 5 regions in the United States. They are the Northeast, the Southeast, the Middle-West, the Southwest, and the West. There are 4 regions in...
THE EXTRA-SOLAR PLANET PROJECT Asteroids, Comets, Meteors, and Pluto PowerPoint Presentation Asteroids, Comets, Meteors, and Pluto Asteroids, Comets, Meteors, and Pluto Asteroids, Comets, Meteors, and Pluto Asteroids, Comets, Meteors, and Pluto Asteroids, Comets, Meteors, and Pluto Life in the Solar...
Monitoring and Responding to Biodiversity Loss: the Brazilian Experience ... Cover The Brazilian Ministry of Environment commissioned a wall-to-wall mapping of vegetation cover of all the Brazilian biomes for the baseline year of 2002 at the publication scale of 1:250.000...
Ready to download the document? Go ahead and hit continue!