Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling

Uploaded by

Niloofar Fallahi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Data Science Ethics - Lecture 5 - Privacy in Data Preprocessing and Modeling

Uploaded by

Niloofar Fallahi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Science & Ethics

Lecture 5

Data Preprocessing and Modeling: Privacy

Prof. David Martens


david.martens@uantwerp.be
www.applieddatamining.com
@ApplDataMining
AI Ethics in the News
Differential Privacy
▪ Cynthia Dwork et al. (2006)
▪ Data Analysis/Modeling (centralized)
Data Preprocessing (local)
▪ How to analyse data, while preserving privacy?
Census data, surveys, etc.
▪ Goal: Allows social scientist to share useful statistics about
sensitive datasets.
• How many people in Belgium have HIV?
• How many students have financial aid?

2
Differential Privacy
▪ What’s the issue again?
▪ Reporting on data from the students
• Study 1: “In March 2024, there were 86 students taking the DSE
course, 10 of which were on financial aid”
➔ No personal information revealed
• Study 2: “In April 2024, there were 85 students taking the
DSEcourse, 9 of which were on financial aid”
➔ No personal information revealed

• Combination with background knowledge, troublesome!


➢ Student Sam knows that student Tim dropped out of the class in
March, Sam now knows that student Tim was on financial aid.
▪ Similarly with census data, reporting covid cases per zip code,
etc.
Example inspired from Wood et al. (2018) 3
Differential Privacy
▪ Differential privacy: a property of an algorithm:

▪ Note that for small ε, eε ≈ 1 + ε

▪ Whether you are in the dataset on covid numbers/financial


aid, or not, has little impact on the reported numbers
[and hence your privacy]
4
Differential Privacy
▪ Differential privacy: a property of an algorithm:

• ε: depending on algorithm and requirements


➢ Stronger privacy for smaller ε

Strong mathematical defintion of privacy


of an algorithm Hsu et al (2014) 5
Differential Privacy
▪ Two assumptions (for now)
1. Single count query (“how many”)
2. Trusted curator (don’t trust the outside observer)

Curator Outside
Observer
x
D
Result
Algorithm

6
Differential Privacy
▪ Some examples of analysis with sensitive data
▪ Participating in a survey on our class, where you would answer negatively
Prob(your exam will be more difficult after survey without your data) = 1%
eε ≈ 1 + ε for small ε, if ε = 0,01
Prob(your exam will be more difficult after survey with your data) = 1,01%

▪ Participating in medical study looking at cancer and smoking, where you


might fear the insurance premium goes up as you are a smoker
Prob(higher insurance premium after study without your data) = 2%
eε ≈ 1 + ε for small ε, if ε = 0,01
Prob(higher insurance premium after study with your data) = 2,02%

▪ Does not mean the premium will go up, the exam will be more difficult: just that
the increase is limited.

7
Privacy loss parameter ε
▪ Privacy loss parameter ε
▪ Smaller means more privacy (but less accuracy)
▪ If ε = 0
• P(M(D)) = P(M(D’))
• Total privacy, only noise

▪ Rule of thumb
• ε between 0.001 and 1
• Proper ε depends on dataset size and Prob(M(D)). Intuitively: the larger the dataset,
the less the impact of a single data instance.
• “almost no utility is expected from datasets containing 1/ε or fewer records.”
(Nissim et al., 2018)
➢ ε = 0.001 ➔ datasets required of size at least 1000
➢ ε = 0.01 ➔ datasets required of size at least 100

Nissim et al. (2018) Wood et al. (2018) 8


Differential Privacy
▪ Definition, algorithm is differential private if:
• Outcome will remain “largely” the same whether you participate in the dataset
or not
• Give roughly the same privacy to X when X is in the data or not

▪ How to make counting query ε-differential private?


• Add Laplace noise Laplace noise is a type of random noise that follows a Laplace distribution, which is often used in differential privacy and statistical modeling.

In the context of differential privacy, Laplace noise is added to the results of a function
to obscure the contributions of individual data points, helping to protect the privacy of
sensitive information while still allowing for meaningful analysis of the dataset. The
amount of noise added is typically calibrated based on the sensitivity of the function
(i.e., how much a single individual's data can change the output) and a privacy
How many students are
parameter, ensuring that the results remain statistically valid while safeguarding
individual privacy.
on financial aid in the
course DSE?
D
Result
Algorithm

“There were 85 students taking DSE course,


approximately 10 of which passed were on financial aid.”

9
Differential Privacy
▪ Counting
• How many students disliked the course?
• Add Lap(1/ε) noise ➔ ε-differential privacy
• Run analysis twice: two different answers, but there exist accuracy bounds (given ε)

Revisit link with size of dataset:


• Small dataset of 10 records, and the simple
sum function. Adding the noise from the
distribution with ε = 0.01 will lead to very
inaccurate estimates of the sum, where the
noise accounts for most of the answer.
• For a very large dataset with millions of
records, the effect on the accuracy of the
answer is smaller.
• Differential privacy will hence require
increasing the minimal dataset size needed
to provide accurate results.

10
Differential Privacy
▪ Reporting on data from the students
• Study 1: “In March 2024, there were 86 students taking the DSE course, 10 of
which were on financial aid”
➔ No personal information revealed
• Study 2: “In April 2024, there were 85 students taking the DSE course, 9 of
which were on financial aid”
➔ No personal information revealed
• Combination with background knowledge, troublesome!
➢ Student Sam knows that student Tim dropped out of the class in March, Sam now
knows that student Tim was on financial aid.

▪ Differential privacy
• Study 1: “In March 2024, there were 86 students taking the DSE course,
approximately 11 of which were on financial aid”
Study 2: “In April 2024, there were 85 students taking the DSE course,
approximately 8 of which were on financial aid”

Example inspired from Nissem et al. (2018)


Nissim et al. (2018) Differential privacy: A primer for a non-technical audience 11
Differential Privacy
▪ Assumption 1: Single Count Query. Needed?
▪ What if we answer the same question over and over again?
▪ Compositional
• Every analysis has some leakage, accumulates elegantly over more analyses
• Diff. privacy still holds, where if two studies, ε1 ε2 ➔ ε = ε1 + ε2 for combination
• Can create complex algorithms. Differentially private algorithms exist for linear
regression, clustering, classification, etc.

▪ ε as privacy budget (Nissim et al., 2018)


• How much privacy an analysis may use
• How much the risk to an individual’s privacy may increase
• More analyses implies less “budget” for each

Nissim et al. (2018) Differential privacy: A primer for a non-technical audience 12


Differential Privacy
▪ Promises:
• No matter what attack, computing power or additional data: outcome with or
without data similar
• Noone can learn “much” about you because of the data, while still allowing
analysis
• Guess whether your data is in the dataset or not, not much better than random
guess

▪ Does not promise:


• No secrets will be revealed (can be even without participating)
➢ Study finds that professor in Antwerp makes X €
➢ Even if not participated: secret revealed
• Absolute privacy: ε

13
Differential Privacy
▪ Properties
• Quantification of privacy loss
• Compositional: allows for complex differential private techniques
• Immune to post-processing: guarantee no matter the data, technology or
computation power
• Transparant: can reveal what procedure and parameter you used (otherwise
uncertainty on for example accuracy of the results)

• Golden standard of anonymization

14
Differential Privacy
▪ Assumption 2: trusted data curator
▪ Needed?
• Local (decentralized) differential privacy
➢ Don’t trust the curator ➔ Noise added before recording answer
Don’t trust the outside observers
• Centralized differential privacy
➢ Trust the data curator ➔ Noise added after recording answer
Don’t trust the outside observers

Curator Outside
Observer

x x
D
Result
Algorithm

15
Differential Privacy
▪ Two types of differential privacy
▪ Randomized Response: local
• For example, I want to know how much students liked the course,
I ask: “Did you like the class?”
➢ Flip coin:
▪ if head: then write real answer
▪ if tails: flip again, if heads: then write real answer, if tails: write inverse
▪ 75% correct, but total deniability
• If we know 1/3 did not like the class
➢ How many in randomized response result? 5/12
Why? if p is percentage of positive answers, we’d expect to observe positive
answer
▪ 1 out of 4: when it has changed from a negative answer
▪ 3 out of 4: when it is real positive answer
➔ ¼ x (1-p) + ¾ x p
➢ Average over large numbers: will be quite accurate, while having plausible
deniability
Differential Privacy
▪ Two types of differential privacy
▪ Centralized vs decentralized?
• Choice based on risk of hacking, leaks and subpoenas.

17
Differential Privacy
▪ Differential Privacy (Cynthia Dwork et al. 2006)
• Advantages
➢ If linked with additional data added, still privacy
➢ The learnt patterns do generalize
• Challenges
➢ Still can infer properties: with or without data instance (eg Facebook traits or salary)
➢ Learn secrets from large group in dataset (eg Strava)

▪ Large use cases


• Google: usage statistics about Chrome malware (2014)
decentralized
• Apple: usage statistics about iPhone (2016)
• US Census (2020) centralized

18
Data Preprocessing for privacy
▪ How to measure level of anonymity?
• K-anonymity of dataset: issues
• Differential privacy of algorithm: gold standard
▪ Several preprocessing methods to get privacy
• Grouping instances: Aggregation
• Grouping variable values: Discretisation and Generalisation
• Suppressing: replace certain values by *
• Adding noise
▪ Continuum!
Diff. privacy Diff. privacy Removing personal
decentralized centralized t-closeness l-diversity
k-anonymity identifiers
privacy utility
low ε high k low k
high ε
high l low l

low t high t

19
Conclusion
▪ Differential privacy
▪ Dream of analysing data while enabling privacy
▪ Privacy as a matter of accumulated risk, not binary
▪ With clear parameter ε that quantifies privacy loss: the
additional risk to an individual resulting from the use of its data.
▪ Always bounded, mathematically provable.

20
Differential Privacy
▪ Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith (2006)

2016 Time of Test Award 2017 Gödel Award

21
Presentation and Paper Ideas
▪ Differential privacy in action
(use by Apple, Google, Linkedin, etc.)
▪ The trade-offs in value
▪ Practical implementation and examples of
k-anonymity/l-diversity/t-closeness

22

You might also like