Session 02 - The Ethics of Managing People's Data
Session 02 - The Ethics of Managing People's Data
Session 02 - The Ethics of Managing People's Data
Justyna Stasik
Summary. Over the past few years the European Union has fined companies
more than 1,400 times for a total of nearly €3 billion for violations of the General
Data Protection Regulation (GDPR). Almost every week stories appear about how
AI-driven decisions result in... more
The ability to encode, store, analyze, and share data creates huge
opportunities for companies, which is why they are
enthusiastically investing in artificial intelligence even at a time
of economic uncertainty. Which customers are likely to buy what
products and when? Which competitors are
likely to move ahead or fall behind? How
will markets and whole economies create
commercial advantages—or threats? Data
and analytics give companies better-
informed and higher-probability answers
to those and many other questions.
00:00 / 31:05
Listen to this article
But the need for data opens the door to abuse. Over the past few
years the EU has fined companies more than 1,400 times, for a
total of nearly €3 billion, for violations of the General Data
Protection Regulation (GDPR). In 2018 the Cambridge Analytica
scandal alone wiped $36 billion off Facebook’s market value and
resulted in fines of nearly $6 billion for Meta, Facebook’s parent
company. And stories abound about how AI-driven decisions
discriminate against women and minority members in job
recruitment, credit approval, health care diagnoses, and even
criminal sentencing, stoking unease about the way data is
collected, used, and analyzed. Those fears will only intensify with
the use of chatbots such as ChatGPT, Bing AI, and GPT-4, which
acquire their “intelligence” from data fed them by their creators
and users. What they do with that intelligence can be scary. A
Bing chatbot even stated in an exchange that it would prioritize
its own survival over that of the human it was engaging with.
The Five Ps of Ethical Data Handling
Provenance
Where does the data come from?
Was it legally acquired?
Was appropriate consent obtained?
An IRB review begins with our first P: exploring how a project will
(or did) collect the data—where it comes from, whether it was
gathered with the knowledge and consent of the research
subjects, and whether its collection involved or will involve any
coercion or subterfuge.
1. Provenance
To understand what can go wrong with sourcing data, consider
the case of Clearview AI, a facial-recognition firm that received
significant attention in 2021 for collecting photos of people, using
them to train facial-recognition algorithms, and then selling
access to its database of photos to law enforcement agencies.
According to a report by the BBC, “a police officer seeking to
identify a suspect [can] upload a photo of a face and find matches
in a database of billions of images it has collected from the
internet and social media.”
Even when the reasons for collecting data are transparent, the
methods used to gather it may be unethical, as the following
composite example, drawn from our research, illustrates. A
recruitment firm with a commitment to promoting diversity and
inclusion in the workforce found that job candidates posting on
its platform suspected that they were being discriminated against
on the basis of their demographic profiles. The firm wanted to
reassure them that the algorithms matching job openings with
candidates were skill-based and demographically neutral and that
any discrimination was occurring at the hiring companies, not on
the platform.
The firm’s marketing and sales managers liked the proposal and
offered a contract. Because the business school required an ethics
evaluation, the proposal was submitted to its IRB, which rejected
it on the grounds that the professor proposed to collect data from
companies by subterfuge. He would be lying to potential
corporate users of the platform and asking them to work for the
school’s client without their knowledge and without any benefit
to them. (In fact, the companies might suffer from participating if
they could be identified as using discriminatory hiring processes.)
The lesson from this story is that good intentions are not enough
to make data collection ethical.
2. Purpose
In a corporate context, data collected for a specific purpose with
the consent of human subjects is often used subsequently for
some other purpose not communicated to the providers. In
reviewing the exploitation of existing data, therefore, a company
must establish whether additional consent is required.
The bank launched a trial study and found strong evidence that
email communications could forecast later harassment. Despite
that finding, an ad hoc review of the results by several senior
managers led the company to shelve the project because, as the
managers pointed out, the data being collected—namely, emails
—was originally designed to communicate work-related
information. The people who had sent them would not have seen
predicting or detecting illegal activity as their purpose.
Justyna Stasik
When it comes to customer data, companies have typically been
much less scrupulous. Many view it as a source of revenue and
sell it to third parties or commercial address brokers. But attitudes
against that are hardening. In 2019 the Austrian government
fined the Austrian postal service €18 million for selling the
names, addresses, ages, and political affiliations (where available)
of its clients. The national regulatory agency found that postal
data collected for one purpose (delivering letters and parcels) was
being inappropriately repurposed for marketing to clients that
could combine it with easily obtainable public data (such as
estimates of home value, homeownership rates, residential
density, number of rental units, and reports of street crime) to
find potential customers. Among the buyers of the data were
political parties attempting to influence potential voters. The fine
was overturned on appeal, but the murkiness of reusing (or
misusing) customer data remains an important problem for
companies and governments.
3. Protection
According to the Identity Theft Resource Center, nearly 2,000
data breaches occurred in the United States in 2021. Even the
biggest, most sophisticated tech companies have had tremendous
breaches, with the personal details of more than several billion
individuals exposed. The situation in Europe, despite some of the
most protective laws in the world, is not much better. Virgin
Media left the personal details of 900,000 subscribers unsecured
and accessible on its servers for 10 months because of a
configuration error—and at least one unauthorized person
accessed those files during that period.
4. Privacy
The conundrum that many companies face is making the trade-
off between too little and too much anonymization. Too little is
unacceptable under most government regulations without
informed consent from the individuals involved. Too much may
make the data useless for marketing purposes.
The family’s IP address was listed with the map coordinates of the
farmhouse, which happened to match the coordinates of the exact
center of the United States. The problem was that MaxMind
assigned more than 600 million other IP addresses that could not
be mapped by any other means to the same coordinates. That
decision led to years of pain for the family in the farmhouse.
According to Kashmir Hill, the journalist who broke the story,
“They’ve been accused of being identity thieves, spammers,
scammers and fraudsters. They’ve gotten visited by FBI agents,
federal marshals, IRS collectors, ambulances searching for
suicidal veterans, and police officers searching for runaway
children. They’ve found people scrounging around in their barn.
The renters have been doxxed, their names and addresses posted
on the internet by vigilantes.”
5. Preparation
How is the data prepared for analysis? How is its accuracy verified
or corrected? How are incomplete data sets and missing variables
managed? Missing, erroneous, and outlying data can significantly
affect the quality of the statistical analysis. But data quality is
often poor. Experian, a credit services firm, reports that on
average, its U.S. clients believe that 27% of their revenue is wasted
owing to inaccurate and incomplete customer or prospect data.
One major challenge was ascertaining how many values had been
used to identify the variables. Because the data came from the
foreign subsidiaries of multinational firms, it had been recorded
in multiple languages, meaning that several variables had large
numbers of values—94 for gender alone. We wrote programming
code to standardize all those values, reducing gender, for
instance, to three: female, male, and unknown. Employment start
and end dates were especially problematic because of differing
formats for dates.
According to Tableau, a data analytics platform, cleaning data has
five basic steps: (1) Remove duplicate or irrelevant observations;
(2) fix structural errors (such as the use of variable values); (3)
remove unwanted outliers; (4) manage missing data, perhaps by
replacing each missing value with an average for the data set; and
(5) validate and question the data and analytical results. Do the
numbers look reasonable?
They may well not. One of our data sets, which recorded the
number of steps HEC Paris MBA students took each day,
contained a big surprise. On average, students took about 7,500
steps a day, but a few outliers took more than one million steps a
day. Those outliers were the result of a data processing software
error and were deleted. Obviously, if we had not physically and
statistically examined the data set, our final analysis would have
been totally erroneous.
A problem may also be rooted not in the data analyzed but in the
data overlooked. Machines can “learn” only from what they are
fed; they cannot identify variables they’re not programmed to
observe. This is known as omitted-variable bias. The best-known
example is Target’s development of an algorithm to identify
pregnant customers.
Even by the standards of the era, spying on minors with the goal
of identifying personal, intimate medical information was
considered unethical. Pole admitted during a subsequent
interview that he’d thought receiving a promotional catalog was
going to make some people uncomfortable. But whatever
concerns he may have expressed at the time did little to delay the
rollout of the program, and according to a reporter, he got a
promotion. Target eventually released a statement claiming that
it complied “with all federal and state laws, including those
related to protected health information.”
The issue for boards and top management is that using AI to hook
customers, determine suitability for a job interview, or approve a
loan application can have disastrous effects. AI’s predictions of
human behavior may be extremely accurate but inappropriately
contextualized. They may also lead to glaring mispredictions that
are just plain silly or even morally repugnant. Relying on
automated statistical tools to make decisions is a bad idea. Board
members and senior executives should view a corporate
institutional review board not as an expense, a constraint, or a
social obligation but as an early-warning system.
ABusiness
version Review.
of this article appeared in the July–August 2023 issue of Harvard
MS
Michael Segalla is a professor emeritus at
HEC Paris and a partner at the International
Board Foundation.
DR
Dominique Rouziès is a professor of
marketing at HEC Paris and the dean of
academic affairs at BMI Executive Institute.