Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling
Lecture 5
2
Differential Privacy
▪ What’s the issue again?
▪ Reporting on data from the students
• Study 1: “In March 2024, there were 86 students taking the DSE
course, 10 of which were on financial aid”
➔ No personal information revealed
• Study 2: “In April 2024, there were 85 students taking the
DSEcourse, 9 of which were on financial aid”
➔ No personal information revealed
Curator Outside
Observer
x
D
Result
Algorithm
6
Differential Privacy
▪ Some examples of analysis with sensitive data
▪ Participating in a survey on our class, where you would answer negatively
Prob(your exam will be more difficult after survey without your data) = 1%
eε ≈ 1 + ε for small ε, if ε = 0,01
Prob(your exam will be more difficult after survey with your data) = 1,01%
▪ Does not mean the premium will go up, the exam will be more difficult: just that
the increase is limited.
7
Privacy loss parameter ε
▪ Privacy loss parameter ε
▪ Smaller means more privacy (but less accuracy)
▪ If ε = 0
• P(M(D)) = P(M(D’))
• Total privacy, only noise
▪ Rule of thumb
• ε between 0.001 and 1
• Proper ε depends on dataset size and Prob(M(D)). Intuitively: the larger the dataset,
the less the impact of a single data instance.
• “almost no utility is expected from datasets containing 1/ε or fewer records.”
(Nissim et al., 2018)
➢ ε = 0.001 ➔ datasets required of size at least 1000
➢ ε = 0.01 ➔ datasets required of size at least 100
In the context of differential privacy, Laplace noise is added to the results of a function
to obscure the contributions of individual data points, helping to protect the privacy of
sensitive information while still allowing for meaningful analysis of the dataset. The
amount of noise added is typically calibrated based on the sensitivity of the function
(i.e., how much a single individual's data can change the output) and a privacy
How many students are
parameter, ensuring that the results remain statistically valid while safeguarding
individual privacy.
on financial aid in the
course DSE?
D
Result
Algorithm
9
Differential Privacy
▪ Counting
• How many students disliked the course?
• Add Lap(1/ε) noise ➔ ε-differential privacy
• Run analysis twice: two different answers, but there exist accuracy bounds (given ε)
10
Differential Privacy
▪ Reporting on data from the students
• Study 1: “In March 2024, there were 86 students taking the DSE course, 10 of
which were on financial aid”
➔ No personal information revealed
• Study 2: “In April 2024, there were 85 students taking the DSE course, 9 of
which were on financial aid”
➔ No personal information revealed
• Combination with background knowledge, troublesome!
➢ Student Sam knows that student Tim dropped out of the class in March, Sam now
knows that student Tim was on financial aid.
▪ Differential privacy
• Study 1: “In March 2024, there were 86 students taking the DSE course,
approximately 11 of which were on financial aid”
Study 2: “In April 2024, there were 85 students taking the DSE course,
approximately 8 of which were on financial aid”
13
Differential Privacy
▪ Properties
• Quantification of privacy loss
• Compositional: allows for complex differential private techniques
• Immune to post-processing: guarantee no matter the data, technology or
computation power
• Transparant: can reveal what procedure and parameter you used (otherwise
uncertainty on for example accuracy of the results)
14
Differential Privacy
▪ Assumption 2: trusted data curator
▪ Needed?
• Local (decentralized) differential privacy
➢ Don’t trust the curator ➔ Noise added before recording answer
Don’t trust the outside observers
• Centralized differential privacy
➢ Trust the data curator ➔ Noise added after recording answer
Don’t trust the outside observers
Curator Outside
Observer
x x
D
Result
Algorithm
15
Differential Privacy
▪ Two types of differential privacy
▪ Randomized Response: local
• For example, I want to know how much students liked the course,
I ask: “Did you like the class?”
➢ Flip coin:
▪ if head: then write real answer
▪ if tails: flip again, if heads: then write real answer, if tails: write inverse
▪ 75% correct, but total deniability
• If we know 1/3 did not like the class
➢ How many in randomized response result? 5/12
Why? if p is percentage of positive answers, we’d expect to observe positive
answer
▪ 1 out of 4: when it has changed from a negative answer
▪ 3 out of 4: when it is real positive answer
➔ ¼ x (1-p) + ¾ x p
➢ Average over large numbers: will be quite accurate, while having plausible
deniability
Differential Privacy
▪ Two types of differential privacy
▪ Centralized vs decentralized?
• Choice based on risk of hacking, leaks and subpoenas.
17
Differential Privacy
▪ Differential Privacy (Cynthia Dwork et al. 2006)
• Advantages
➢ If linked with additional data added, still privacy
➢ The learnt patterns do generalize
• Challenges
➢ Still can infer properties: with or without data instance (eg Facebook traits or salary)
➢ Learn secrets from large group in dataset (eg Strava)
18
Data Preprocessing for privacy
▪ How to measure level of anonymity?
• K-anonymity of dataset: issues
• Differential privacy of algorithm: gold standard
▪ Several preprocessing methods to get privacy
• Grouping instances: Aggregation
• Grouping variable values: Discretisation and Generalisation
• Suppressing: replace certain values by *
• Adding noise
▪ Continuum!
Diff. privacy Diff. privacy Removing personal
decentralized centralized t-closeness l-diversity
k-anonymity identifiers
privacy utility
low ε high k low k
high ε
high l low l
low t high t
19
Conclusion
▪ Differential privacy
▪ Dream of analysing data while enabling privacy
▪ Privacy as a matter of accumulated risk, not binary
▪ With clear parameter ε that quantifies privacy loss: the
additional risk to an individual resulting from the use of its data.
▪ Always bounded, mathematically provable.
20
Differential Privacy
▪ Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith (2006)
21
Presentation and Paper Ideas
▪ Differential privacy in action
(use by Apple, Google, Linkedin, etc.)
▪ The trade-offs in value
▪ Practical implementation and examples of
k-anonymity/l-diversity/t-closeness
22