Practical Data Privacy by Katharine Jarmul
Author:Katharine Jarmul [Katharine Jarmul]
Language: eng
Format: epub, mobi
Publisher: O'Reilly Media, Inc.
Published: 2023-07-24T16:00:00+00:00
Understanding Differential Privacy
To help you better understand how differential privacy actually works with your data, letâs walk through a few implementation examples.
Weâll start by analyzing a real-world use case, and then build our own mechanism and consider its privacy guarantees.
Differential Privacy in Practice: Anonymizing the US Census
The Constitution of the United States of America calls for a full census of all persons every ten years. This count is used for numerous significant decisions, including representation in Congress, federal funding and monetary support for state initiatives. Getting it rightâensuring that everyone is counted but only once-requires a huge effort. The privacy implications are equally significant.
In the past, the US Census Bureau has used a variety of obfuscation methods to ensure âanonymizationâ of the results. These included combinations of aggregation and a method called shuffling (or sometimes scrambling), which took census block data and shuffled the households so that the census blocks were mixed with one another. Because the methods used retained lots of information about the individuals, it allowed for private information to leak in ways the original Census workers did not anticipate.
To determine the potential for outsiders to re-identify households in the released data, the Census Bureau ran several attacks on the 2010 Census results. Reconstructing age, gender, race/ethnicity combinations revealed correctly reidentified data with 38% accuracy by combining the data with an external source that was readily available. These external sources often had complete identity information (names, contact information and other details). This could have been a consumer database, a voting or driving record database (for adults) or even an insurance database. For smaller census blocks, they were able to perform this re-identification with much higher success. With ubiquitous data available for free or low prices or when performed by a company with large access to consumer and household data like a large e-commerce provider, this type of attack is not only feasible, it is actively used for direct marketing campaigns and targeted advertising.
How exactly did these reconstruction attacks work? They literally built a system of equations and used a solver to determine potential candidates (more details in this article). From those candidates, they were able to deduce the most probable by linking this information with a few consumer databases or another dataset acquired via a data breach. Although there are plenty of false positives, in less populated regions this proved even more effective (up to 72%).6.
As a consequence, the US Census Bureau decided that they would use differential privacy for the 2020 Census. The task at handâââcould they create a data workflow that allowed for differentially private census results for 308 million persons that was still usable for the critical tasks requiring accurate results?
They refined their privacy parameters using example data and prior census responses, determining exactly what noise measurements and distributions fit their needs. They worked diligently to determine the balance of privacy to utility, ensuring the most accuracy while still guaranteeing basic privacy protections. They built up infrastructure in Apache Spark to run through, aggregate and finalize all results, which are available on the US Census homepage.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Access | Data Mining |
Data Modeling & Design | Data Processing |
Data Warehousing | MySQL |
Oracle | Other Databases |
Relational Databases | SQL |
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8310)
Azure Data and AI Architect Handbook by Olivier Mertens & Breght Van Baelen(6820)
Building Statistical Models in Python by Huy Hoang Nguyen & Paul N Adams & Stuart J Miller(6797)
Serverless Machine Learning with Amazon Redshift ML by Debu Panda & Phil Bates & Bhanu Pittampally & Sumeet Joshi(6684)
Data Wrangling on AWS by Navnit Shukla | Sankar M | Sam Palani(6470)
Driving Data Quality with Data Contracts by Andrew Jones(6422)
Machine Learning Model Serving Patterns and Best Practices by Md Johirul Islam(6171)
Learning SQL by Alan Beaulieu(6005)
Weapons of Math Destruction by Cathy O'Neil(5798)
Big Data Analysis with Python by Ivan Marin(5402)
Data Engineering with dbt by Roberto Zagni(4410)
Solidity Programming Essentials by Ritesh Modi(4056)
Time Series Analysis with Python Cookbook by Tarek A. Atwan(3917)
Pandas Cookbook by Theodore Petrou(3620)
Blockchain Basics by Daniel Drescher(3307)
Hands-On Machine Learning for Algorithmic Trading by Stefan Jansen(2914)
Feature Store for Machine Learning by Jayanth Kumar M J(2822)
Learn T-SQL Querying by Pam Lahoud & Pedro Lopes(2803)
Mastering Python for Finance by Unknown(2748)
