Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Scientific Accuracy

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Pattern recognition and forensic identification:

The presumption of scientific accuracy and


other falsehoods
IR Coyle, D Field and P Wenderoth*
Decision-making in forensic contexts where patterns (such as fingerprints)
are compared involves processes of perception and cognition which are
notoriously fallible in many circumstances. The known or potential rate of
error in those scientific methods of forensic identification which have long
been accepted by the courts is often higher than would generally be
perceived, despite the presumption of accuracy of such techniques. In this
article, the authors argue that errors arising from perceptual and cognitive
errors in such forensic identification evidence are overwhelmingly due to the
misuse and profound lack of understanding of basic epistemological and
statistical principles. To avoid miscarriages of justice, these principles need to
be understood and safeguards employed so that the legal process is not
contaminated by pseudoscience.

INTRODUCTION
Few things can be more damning to the prospects of a defendant than the unqualified pronouncement
of an authoritative expert that there is forensic evidence directly linking the accused to a crime scene.
Before the First Fleet sailed into Sydney Harbour, Lord Mansfield issued the following warning, the
basic thrust of which is still apposite today:
The fact that an expert witness has impressive scientific qualifications does not by that fact alone make
his opinion on matters of human nature and behaviour within the limits of normality any more helpful
than that of the jurors themselves. But there is a danger that they may think it does.1

Forensic evidence comes in many forms. To name but a few tools of trade of forensic scientists,
this may involve comparison of latent fingerprints found at the crime scene with exemplar prints either
obtained pursuant to a forensic order or extant in some database; comparison of bite marks on a victim
with orthodontic analysis; DNA evidence; hair and fibre matching; or other emerging techniques based
on the anthropometric or biomedical characteristics of humans.
Whatever their genesis, all of these techniques have one thing in common: the potential for
human error. The decision that evidence found at a crime scene can be matched to a particular suspect
is made by a human. While mathematical algorithms processed with computational power almost
beyond comprehension may reduce to a manageable number the comparisons that need to be made,
the ultimate decision is always made by a human.
This startlingly simple observation has far-reaching consequences that have not been fully
appreciated or accommodated by the legal system. Decision-making, whether in a forensic context or
otherwise, is as much a function of the processes of the human mind as of the technology on which
such decisions are founded. It is as meaningless to try to isolate the two as it is to try to determine
what caused a motor vehicle accident by only considering the vehicles involved and the traffic
conditions, whilst ignoring the decisions made by the drivers.
Decision-making in forensic contexts, in which patterns of various types are compared, involves
processes of perception and cognition. Perception and cognition are notoriously fallible in many
*

Ian Coyle: Visiting Professorial Fellow, Forensic Psychologist, Forensic Ergonomist and Forensic Psychopharmacologist, Bond
University Centre for Forensic Excellence; Principal Consultant Safetysearch Forensic Consultants, Gold Coast, Queensland.
David Field: Associate Professor of Law, Director, Bond University Centre for Forensic Excellence. Peter Wenderoth: Professor
of Psychology, Macquarie University. The authors would like to thank Professor Don Thomson and the Honourable Tim
Carmody for their helpful comments on the manuscript.

Folkes v Chadd (1782) 99 ER 589.

214

(2009) 33 Crim LJ 214

Pattern recognition and forensic identification

circumstances. This has long been recognised with respect to the identification of an accused person as
the perpetrator of a crime by way of eyewitness testimony, and legal safeguards have evolved to limit
the consequences of error when such evidence is adduced, albeit that these safeguards are often
useless.2
It is therefore curious that such safeguards are, in the main, conspicuously absent when the
perceptions of forensic experts that patterns they have observed in evidence (whether they be
fingerprints, bite marks, or DNA) match the characteristics of a suspect. This appears to be due in
large measure to the misuse and profound lack of understanding of basic epistemological and
statistical principles not only among lawyers, but also among scientists and other forensic experts.
This has led, inter alia, to the legal doctrine of the presumption of scientific accuracy.
When a scientific instrument may be said to belong to:
[a] class of instruments of a scientific or technical character, which by general experience [are] known
to be trustworthy, and are so notorious that the court requires no evidence to the effect that they do fall
into such class, before allowing the presumption in question to operate with regard to readings made
thereon,3

a court will, at common law, be entitled to take what is called judicial notice of its reliability. This
means that the results or readings which are derived from such an instrument, may be relied on in
evidence when this is relevant to the outcome of a case.
This doctrine is founded on the notion that some tests are so notorious in their accuracy that it
would require statistical improbability on a vast scale for them to be wrong in any particular case.
Unfortunately, the same presumption of accuracy seems to have become applied to so-called scientific
testing procedures which rely not upon the use of instruments, but the bare application of human
judgment, albeit skilled and experienced judgment.
Whatever may be presumed in law, errors in decision-making in tests undoubtedly do occur.
These errors are common in the area of forensic identification, and are so ubiquitous as to require a
fundamental reassessment of the way such evidence is received by the courts. A convenient place to
start when considering this proposition is with biometric identification since this has the longest
history.

BIOMETRIC

IDENTIFICATION

Biometric measurement goes back to the 19th century, when Alphonse Bertillon categorised a series of
measurements of the human body (forearm length, hand width etc). These were used to describe an
individual. Literally tens of thousands of such measurements were collected in England, America and
France and used to obtain convictions, typically of habitual criminals. Despite misgivings, the most
significant of which was the report of the Royal Commission in 1898 in England, the system
continued to be used. Then there occurred the case of Mr Will West.
In 1903, Mr West was incarcerated in Leavenworth prison, Kansas, in the United States. His
Bertillon measurements were taken and were identical, based on 15 matching points of comparison, to
another inmate who had been admitted to Leavenworth two years earlier and was still there. This was
the genesis of the requirement to have 16 matches or points of comparison that migrated, through a
process of unscientific osmosis, to the system of fingerprint matching.
Fingerprint examination replaced the Bertillon system during the early part of the 19th century.
Since then, the criteria for obtaining a match have varied. In most of Europe, a fingerprint match
requires 16 points; in Greece it is 10; and in Turkey eight points are required. In the United States and
Australia, no specific criteria are used the analyst simply forms an opinion that a latent and
exemplar print (ie one taken from the suspect) match.
There is no empirical or statistical basis for these thresholds: none whatsoever. As Thompson and
Cole noted:
2

Coyle IR, Field D and Miller G, The Blindness of the Eye-witness (2008) 82 ALJ 471.

Porter v Kolodzeil [1962] VR 75 at 78.

(2009) 33 Crim LJ 214

215

Coyle, Field and Wenderoth


Latent Print Examiners (LPEs) have no scientific basis to estimate the probability of a random match
between two impressions, and they present no statistics in connection with their testimony. If they find
sufficient consistent detail they simply declare a positive identification of individualization, claiming the
potential donor for the mark has been reduced to one and only one area of friction ridge skin in the
world to the exclusion of all other friction ridge skin in the world.4

And so LPEs in Australia and North America simply state that the latent print matched or did not
match the exemplar print and ignore the vexatious issue that they might not be correct 100% of the
time when making such judgments. LPEs routinely assert that latent and exemplar prints can be
matched by properly trained examiners with no realistic chance of incorrect matching (ie making a
false positive error). This is a comforting thought for an accused. It is also wrong.
Errors in fingerprint analysis have been known since the 1920s. Typically these have been
shrugged off as being due to poor training, poor supervision or difficulties in matching poor quality
latent impressions with exemplars in a database. The cases of Brandon Mayfield in the United States
and Shirley McKie in Scotland have conclusively demonstrated that these arguments will not fly.5 In
both of these cases, the most experienced LPEs in North America and the United Kingdom
conclusively and comprehensively identified the wrong person.
What processes of decision-making are involved when a forensic scientist compares latent and
exemplar evidence of whatever type? The short answer is that no one knows. Because forensic
scientists making decisions as to the similarity or otherwise of such evidence produce almost no
documentation, it is very difficult (if not impossible) to determine, post facto, what led them to their
conclusion.
It seems clear from the fact that there is no standard in North American and Australia vis--vis the
number of indicia that must match before a positive identification is called that LPEs must consider
the overall pattern of the latent print as well as individual minutiae, but how this is done is not clear.
Some clues can be gleaned from computer algorithms based on multivariate statistical techniques such
as factor analysis and principal component analysis,6 but we have no idea if humans adopt a
decision-making heuristic based on these or similar approaches in forensic pattern matching. It is
entirely possible that the ultimate decision making heuristic can be reduced to X looks like Y.
It is arguable that forensic scientists who make a decision based on such a simplistic decisionmaking heuristic, without any other explicitly reasoned argument, have made their decision on the
basis of degrees of consistency between the crime scene and exemplar evidence. The concept of
degrees of consistency is inherent in verbal descriptors such as unable to exclude, matches to a
reasonable degree of medical certainty etc, typically employed by expert witnesses to describe the
congruence between latent and exemplar forensic identification evidence.
In the context of eyewitness identification evidence, the High Court has explicitly rejected the
notion of degrees of consistency. In Martin v Osborne, Dixon J, as he then was, observed:
If an issue is to be proved by circumstantial evidence, facts subsidiary to or connected with the main
fact must be established from which the conclusion follows as a rational inference. In the inculpation of
an accused person the evidentiary circumstances must bear no other explanation.7

In Plomp v The Queen, Dixon CJ, citing this observation, acknowledged the difficulty in stating
this rule, which he opined has not been overcome by employing the expression more consistent as
if there were degrees of consistency.8 This line of reasoning was affirmed in Pitkin v The Queen,
where Deane J (Toohey and McHugh JJ concurring), observed:
There are not, as Dixon CJ observed, degrees of consistency and, if a reasonable jury ought to have
found that an inference or hypothesis consistent with innocence was open on the evidence, then it ought
4

Thompson WC and Cole SA, Psychological Aspects of Forensic Identification Evidence in Costanzo M, Krauss D and
Pezdek K (eds), Expert Psychological Testimony for the Courts (Routledge, New York, 2006) p 38.

Thompson WC and Cole SA, Lessons from the Brandon Mayfield Case (2005) 29 The Champion 32.

Joliffe IT, Principal Component Analysis (Springer, New York, 1986).

Martin v Osborne (1936) 55 CLR 367 at 375.

Plomp v The Queen (1963) 110 CLR 234 at 243.

216

(2009) 33 Crim LJ 214

Pattern recognition and forensic identification


to have given the appellant the benefit of the doubt necessarily created by that circumstance.9

This case involved the eyewitness identification of the defendant on the basis that a witness, having
observed the theft of a ladys bag, at a subsequent photo-identification selected a photograph and
stated that: This looks like the person that I seen take the ladys handbag. In quashing the original
verdict, their Honours observed:
Under our system of administering criminal justice, a person is not to be convicted of a serious crime on
the sole basis of a verbal ambiguity.10

To reiterate, much forensic identification evidence deals with degrees of consistency, albeit not
expressed in these terms, and ambiguities are inherent in many of the verbal descriptors used to anchor
forensic identification evidence. Equally, it is the authors view that scientific ambiguity of a high
order is a prominent feature of forensic identification evidence. These ambiguities arise from different
sources.

SOURCES

AND TYPES OF ERROR IN FORENSIC IDENTIFICATION

Potential sources of errors in declaring that items have a common source include fraud, incompetence,
instrumentation and technological errors11 (although these are dependent ultimately on human errors)
and fundamental methodological errors that are inherent in the field in question.12 Although fraud,
egregious incompetence and technological errors are of enormous concern in forensic evidence,13
these issues are not considered here. Rather, this article is concerned with fundamental
epistemological issues arising from psychological processes inextricably linked to the function of the
mind that are relevant to all types of forensic identification evidence.
There are two types of errors an individual can make in arriving at a decision, whether this
involves forensic identification or any other sort of decision: a false positive and a false negative.
Within the context of this article, a false positive error (also referred to as a Type 1 error or false
alarm) involves the incorrect acceptance of a hypothesis that two objects match. A false negative error
(also referred to as a Type 2 error or miss) involves the incorrect rejection of a hypothesis that two
objects match.
These types of errors, and their associated error rates, are related. They also have different
consequences in different fields. For example, a biometric scanning system based on iris scans that had
a high false negative rate would be unacceptable as a screening system for admission to an
unremarkable commercial building, even if it had a zero false positive error rate, since many
customers would be denied access. However, the same system might be considered appropriate for a
sensitive military installation.
A sophisticated and pervasive methodology, Receiver Operating Characteristic (ROC)14 Analysis,
has been developed from the simple premise that there are four fundamental decisions that can be
made in matching or identifying objects: false positive/true positive and false negative/true negative.15
ROC analysis enables the accuracy of decision-making to be determined, as well as the effects of
9

Pitkin v The Queen (1995) 80 A Crim R 302 at 306; 69 ALJR 612.

10

Pitkin v The Queen (1995) 80 A Crim R 302 at 306; 69 ALJR 612.

11

An example here would be the failure of a facial recognition algorithm to provide a matching facial image from a database to
enable comparison with a suspect.

12

Dror IE and Charlton D, Why Experts Make Errors (2006) 56(4) Journal of Forensic Identification 600; Coyle IR, Field D
and Starmer G, An Inconvenient Truth: Errors in Breath Alcohol Analysis Arising from Statistical Uncertainty (2009) 41(2)
Australian Journal of Forensic Sciences 1.

13
Field D, Coyle IR, Starmer GA, Miller G and Wilson P, Trust Me Im an Expert (2009) 41(2) Australian Journal of
Forensic Sciences 113.
14

Developed in the physical sciences, this technique was originally called Signal Detection Theory (SDT), which it still is, but
ROC is also used interchangeably with SDT. To avoid confusion, the term ROC is used throughout this article.

15

Green DM and Swets JA, Signal Detection Theory and Psychophysics (John Wiley & Sons, New York, 1966).

(2009) 33 Crim LJ 214

217

Coyle, Field and Wenderoth

response bias on accuracy, in a plethora of fields. In the context of forensic identification evidence,
accuracy defined as the area under the ROC curve measures the observers ability to correctly match
field and exemplar evidence.
The ROC curve combines the concepts of sensitivity (the true positive fraction) and specificity
(the true negative fraction) into a single measure of accuracy. The false positive fraction is
complement of specificity (1-specificity). The area under the curve (AUC) is defined as the diagnostic
accuracy of a test or methodology. It ranges from 0 to 1.0 (perfect); an area of 0.5 indicates that the
observers are guessing. According to Swets,16 AUC values above 0.9 indicate high accuracy, 0.7-0.9
indicates useful for some purposes, and 0.5-0.7 indicates poor accuracy.17 ROC analysis is
routinely used to determine drug safety/efficacy, detection of threats in a military environment,
fundamental studies in perception, and assessment of the utility of psychometric tests. Relatively
recently it has begun to be applied to forensic identification.18
While it is inherent in ROC analysis that accuracy, sensitivity and specificity are interrelated, from
a legal perspective, the probability of a false positive is of the utmost concern since it can result in an
innocent party being found guilty. Of course, focusing on this type of error leaves open the prospect
that more false negative errors will occur, with the result that some villains may escape justice, but it
is axiomatic in our legal system that this is as it should be.
Bearing this in mind, it is obvious that the probative value of any forensic procedure which
purports to be able to identify a suspect by matching impressions found at a crime scene with
exemplar impressions is restricted by the prejudicial probability that a false positive has occurred. Or
rather, it should be obvious, but scientists involved in forensic identification have largely ignored the
effects of false positives on the probative/prejudicial value of their evidence. Even worse, in many
cases, they blithely refuse to accept that such a thing as false positives exist in their specific domain.
And even if they do acknowledge this inconvenient truth, many forensic scientists produce statistical
arguments which purport to render this problem insignificant. These arguments are specious.

USING

STATISTICS LIKE A DRUNKARD USES A LAMP POST; FOR SUPPORT NOT


ILLUMINATION

Curiously, expert LPE witnesses are banned from using probabilities in their testimonies in many
jurisdictions. Thompson and Cole point out that:
[a] 1979 Resolution of the International Association for Identification, the main professional
organization for LPEs in North America, stated, Any member, officer, or certified latent print examiner
who provides oral or written reports, or gives testimony of possible or probable, or likely friction ridge
identification shall be deemed to be engaged in conduct unbecoming such member, officer or certified
latent print examiner.19

Although this rule had its origins in noble intentions (being designed to encourage LPEs to only
give evidence when they are convinced of the accuracy of their conclusion) it has had an inimical
effect. Results are often expressed in terms that imply scientific certainty (ie a probability of 1.0 or
100%). This is indefensible, since it implies that it is possible to prove a hypothesis whereas, as a
matter of logic, it is only possible to disprove a null hypothesis.20 The problem does not end there.
16

Swets JA, Measuring the Accuracy of Diagnostic Systems (1988) 240 Science 1285.

17

Technically, an AUC of 0.5 cannot indicate a ROC curve since this is only a chance response and thus there is no response
curve per se. However, for practical purposes in the context of the arguments espoused herein this can be ignored and an AUC
of 0.5 may be considered equivalent to a diagnostic accuracy of 50%.

18

Phillips VL, Saks MJ and Peterson JL, The Application of Signal Detection Theory to Decision-making in Forensic Science
(2001) 46(2) Journal of Forensic Sciences 294.

19

Thompson and Cole, n 4, pp 45-46.

20

The null hypothesis is a hypothesis of no difference; it is formulated for the express purpose of being capable of rejection. If
rejected, the alternative hypothesis may be accepted. Suppose, eg that one wished to test the hypothesis that all humans born
with hands had five fingers. The only way this could be completely proven would be to observe every human on the planet.
However, one could disprove the opposite or null hypothesis, to a specific degree of certainty or probability, by observing a
sufficiently large sample of humans, which would then imply that the original hypothesis was correct. The degree of certainty in

218

(2009) 33 Crim LJ 214

Pattern recognition and forensic identification

While some forensic disciplines use qualitative assessments of certainty such as source
attributable to reasonable medical certainty,21 others argue that when comparing two items
practitioners assert that they can identify characteristics or patterns of characteristics that are unique.
That is, that they have narrowed down the source of potential donors of the evidence found at a crime
scene to one, and only one, individual or object. This is often referred to as individualisation. This is
so for fingerprint analysis, firearm and tool mark analysis,22 and barefoot morphology, the examination
of the impressions of the weight bearing areas of the human foot.23 The degrees of certainty are
indicated by verbal descriptors. As an example, the degrees of certainty, which are misleadingly
referred to as confidence intervals by practitioners of barefoot morphology are as follows:
Insufficient Detail when there is not enough detail or clarity.
Support agreement or disagreement of details, such as the overall size, the location of the toe
pads, but lack of sufficient quantity and or clarity.
Strong Support agreement or disagreement of all the detail, such as overall size, shape and
location of the toe pads, contour of the metatarsal ridge, and the contour of the ball of the foot,
but with a lack of sharp detail.
Did Make agreement of all detail, such as the overall size, shape and location of the toe pads,
contour of the metatarsal ridge and the contour of the ball of the foot, sharp edge detail, in
combination with random accidental characteristics (damage to the foot, flexion creases etc.)
Did Not Make contains clear detail that shows without doubt that the impression was not
made by the individual in question.
While there is a strong logical argument that the Did Not Make category has a scientific basis,
since one missing element, or an additional one such as polydactyly, can disprove a hypothesis the
category Did Make is also absolute. This, for the reasons advanced earlier, cannot be supported from
a logical or statistical perspective.
To compound this error, the other qualitative categories are, in reports submitted to the court by
practitioners of this technique, set out in a continuous scale whereby the difference between each
category is indicated as being identical. That is, the qualitative categories are assumed to be scored
using an interval scale of measurement.24 This is mathematically impossible, since many of the
individual elements of the footprint impression that are aggregated, through some unexplained mental
process, are not measured on an interval scale. Rather, the crime scene impressions may be simply
referred to as large, curved etc. It is mathematically impossible to sum these individual
components of the pattern, or indeed to perform the other mathematical operations required to result in
equal differences between the various categories used to define confidence intervals that are
employed in this technique. Indeed, the term confidence intervals is nonsensical because it implies a
standardised normal distribution, but the mathematical assumptions underlying this distribution are
violated when making comparisons on the basis of greater, curved, present/non present.
In short, the use of the term confidence interval and the graphical presentation of the categories
used to support or disconfirm the hypothesis that a particular individual did make the footprint
such a process is never absolute; it may approach a probability of 1.0 but it can never obtain this level of certainty unless all
members of a class (in this case, humans) are observed. In fact, this example is not far fetched or fanciful since polydactyly
(having more that five digits on either hand or foot) clearly exists, albeit that it is rare.
21

The American Board of Odontology has promulgated this definition.

22

See Thompson and Cole, n 4, pp 44-45.

23

Yamashita AB, Forensic Barefoot Morphology Comparison (2007) 49 Canadian Journal of Criminology and Criminal
Justice 647; Kennedy RB, Pressman IS, Chen S, Petersen PH and Pressman AE, Statistical Analysis of Barefoot Impressions
(2003) 48(1) Journal of Forensic Science 55.

24

The four scales of measurement, in ascending order of sophistication, are: nominal, ordinal, interval and ratio. In a nominal
scale, one item is simply different from another, eg male or female. In an ordinal scale, one item may be said to be lesser or
greater than another but not by a defined amount, eg a sergeant has a higher rank than a private. In an interval scale, one item
can be defined as being greater or lesser than another by a defined amount, eg 101 km/h is greater than 100 km/h by the same
amount as 101 km/h is greater than 100 km/h. A ratio scale has the same properties as an interval scale with the exception that
it incorporates a true zero point as its origin. The mathematical operations that are admissible increase with the increasing
sophistication of measurement scales.

(2009) 33 Crim LJ 214

219

Coyle, Field and Wenderoth

pattern are not only nonsensical, but they give the grossly misleading impression of scientific accuracy
when none necessarily exists. Similar problems exist with other qualitative scales routinely used in
forensic identification.
What of the probabilities routinely quoted by experts when presenting the results of their
analysis? Often these refer to comparisons between exemplar specimens such as fingerprints or
weight-bearing patterns of the feet. For example, suppose that the probability of a random match
between individuals based on the exemplar studies of Kennedy and colleagues in barefoot morphology
is, as they claim, less than one in a hundred million.25 Then suppose that the false positive rate for any
particular examiner based on repeated trials is, say, 5% when comparing samples found at crime scene
with exemplar prints. The combined probability of error is found by the sum of these error rates, ie the
additive law of probabilities applies. As Koehler26 noted, if experts make false positive errors when
comparing impressions found in situ with exemplar impressions, then that is the rate limiting factor
which determines and controls the match report. Studies of jurors in the United States, however, show
that this basic law of probability is not understood; they give more weight to the low probability of
random matches, which is dwarfed by the false positive rate.27
While there is good reason to be deeply pessimistic about the actual rate of false positive errors in
forensic identification, there is even more reason to be highly sceptical about the claims advanced by
experts of various persuasions that their discipline does not suffer from this defect. For example,
LPEs, whilst grudgingly accepting that some fingerprint identifications have been wrong, argue that
after nearly a century of adversarial challenge there have been relatively few false-positive errors
exposed regarding latent fingerprint identification. Does this prove that false positive errors are few
and far between in LPE forensic evidence presented to courts? No. It merely demonstrates the
unlikelihood of such errors being exposed. In fact, when fingerprint comparison is presumed to be
accurate unless proved otherwise, experience has shown that the process of rebuttal can be fraught
with difficulty.
While there have been few properly controlled studies that provide objective evidence as to the
false positive rate among LPEs, one such study is both noteworthy and disturbing. Dror and
Charlton28 presented expert LPEs from across the world with latent and exemplar prints taken from
actual criminal cases. Half of the prints had been categorised as individualisations whilst the other half
had been excluded. Using a within-subjects experimental design, the same prints were presented to the
same experts, many years after they had originally assessed them. Two-thirds of the experts made
inconsistent decisions; ie they disagreed with themselves.
The percentage of inconsistent decisions ranged from 12% when there was no overtly potentially
biasing contextual information, to 16.6% when contextually biasing information (such as telling the
participants that the suspect was in police custody at the time of the crime or the suspect confessed to
the crime) was provided. The false positive rate (as determined by a rejection of an initial
individualisation which was subsequently disconfirmed) was 10.4%. This study is particularly
noteworthy since the experimental design employed negates the arguments routinely advanced to
support the contention that false positives in fingerprint identification are only possible if the
examiners are not properly trained and not subjected to ongoing proficiency evaluation.
Another study of false positive rates using ROC analysis was conducted with 32 forensic
odontologists who were asked to determine whether or not four sets of photographs actually
represented bite marks, and how certain they were, based on comparison of dental casts used in actual
25

Kennedy et al, n 23 at 62.

26

Koehler JJ, Fingerprint Error Rates and Proficiency Test: What Are They and Why Do They Matter? (2008) 59 Hastings
Law Journal 101.

27

Koehler JJ, Chia A and Lindsey JS, The Random Match Probability (RMP) in DNA Evidence: Irrelevant and Prejudicial?
(1995) 35 Jurimetrics 201.

28

Dror and Charlton, n 12.

220

(2009) 33 Crim LJ 214

Pattern recognition and forensic identification


29

cases, that each set of teeth had made each bite mark. ROC analysis resulted in an accuracy score
(AUC) of 0.86 with a 95% confidence interval ranging from 0.83 to 0.91. In other words, the average
error rate was 14%. While the false positive rate was 0.02 or lower for ratings of probable to
reasonable medical certainty, the results of this study need to be interpreted with considerable caution
owing to the small number of comparisons used and the decision to declare whether a false positive
had occurred on the basis of the evidence given by the original examining dentist in a court case.
An obvious factor limiting the accuracy of comparison of evidence obtained at a crime scene with
exemplar evidence is the often incomplete nature of the former. For example, with respect to
fingerprint evidence, it is usually considered that there are 80 minutiae that can be compared between
exemplar samples; with barefoot morphology it is argued that there are 200 different measurements of
the weight bearing patterns of the feet that can be measured.30 Usually not all of these indicia can be
reliably observed from crime scene evidence. Accordingly, the comparison of such evidence with
exemplar specimens is based on a subset of the potentially available comparisons. This has profound
consequences when considering the claims for probability of matches derived from multivariate
statistic models on exemplar evidence.
Usually these models are based on principal component analysis or factor analysis;31 these
multivariate statistical techniques enable the underlying components or factors to be identified from
the multitude of possible combinations of individual indicia. These factors are a mathematical
construct which, in essence, groups together measurements of individual indicia. A prosaic example is
the use of factor analysis to determine underlying personality factors on the basis of responses to a
large number of individual questions such as is commonly used in psychometric testing.32
Insofar as all the underlying factors derived from these analyses are mathematically
independent,33 the probability of, say, five such factors all agreeing can be determined by application
of the multiplicative law of probabilities. This is the method used to calculate the probability of
getting heads 10 times in row by tossing an evenly weighted coin. However, applying these
statistical techniques is fraught with difficulties when comparing crime scene and exemplar evidence.
This is analogous to trying to ascertain an individuals underlying personality factors when they have
only completed 20% of a personality questionnaire.
Faced with this difficulty, forensic identification often proceeds from a far less rigorous
foundation. Consider the 16-point comparison standard for LPE adopted in most of Europe. Applying
simple combinatorial mathematics, it may be determined that there are 3,160 possible combinations of
comparisons between any two minutiae based on a total number of 80 minutiae found in exemplar
fingerprints. In the face of this fact, what is the probative value in stating that 16 points of comparison
match?
There is, fortunately, a relatively simple means of providing statistical guidance to those charged
with assessing these sorts of situations, although to the authors knowledge it has never been used in
any forensic identification case. Applying the cumulative binomial distribution, and assuming that the
examiner is only guessing when making a decision, the probability of matching 16 or fewer
comparisons in fingerprint analysis is 2.93E-08. Counter-intuitively, in this example the cumulative
probability goes down as the accuracy of each individual comparison goes up goes; with a probability
29

Arheart KL and Pretty IA, Results of the 4th ABFO Bitemark Workshop 1999 (2001) 124 Forensic Science International
104.

30

Kennedy et al, n 23.

31

Principal component analysis and factor analysis are based on common underlying mathematical premises such that discussion
of the two techniques is often conflated in statistical texts.

32

Cattell RB and Krug SE, The Number of Factors in the 16PF: A Review of the Evidence with Special Emphasis on
Methodological Problems (1986) 46 Educational and Psychological Measurement 509.

33

This assumes that the final factor rotation is orthogonal (ie mathematically independent).

(2009) 33 Crim LJ 214

221

Coyle, Field and Wenderoth

of 0.9 of making a correct match for each point of comparison made, the cumulative probability of
only getting 16 matches is 5.13E-49, ie 48 zeros after the decimal point followed by 513.34 This is
scarcely impressive.
In other forensic identification disciplines, the situation is much worse. Returning to the example
of barefoot morphology, the cumulative probability of only getting 10 or fewer matches out of a
possible 200 points of comparison is 7.87E-175 when the probability of making each comparison
accurately is 0.9. This latter example is used since it is precisely this example that has been advanced
in cases involving forensic identification via barefoot morphology.
What is so different between forensic identification evidence and other evidence that enables this
minute probability to be regarded as indicating Strong Support for the hypothesis that the crime
scene evidence matches the exemplar evidence obtained from a suspect? Clearly, other factors such as
the overall pattern must be considered by analysts but, in the absence of all the data used in deriving
principal component/factor analysis models, this cannot be done except intuitively and, as has been
previously pointed out, there is no information on how this intuitive matching of patterns is done. Nor,
in the absence of ROC analysis, is there anything to suggest that this process, per se, has anything like
the same accuracy of comparisons between exemplar evidence.
One thing is, however, made abundantly clear from the statistical analyses set out herein. There is
no scientific foundation to support the claim that categorisations such as Did Not Make, Strong
Support and the like are validated by evidence based on the very small number of comparisons
between elements or minutiae or patterns of whatever type typically cited in forensic identification
reports.

OBSERVER

EFFECTS
Sir George Jessel in Lord Arbinger v Ashton, commenting on expert evidence, expressed the following
view:
An expert is not like an ordinary witness, who hopes to get his expenses, but he is employed and paid
in the sense of gain, being employed by the person who calls him. Now it is natural that his mind,
however honest he may be, should be biased in favour of the person employing him, and accordingly
we do find such bias Undoubtedly there is a natural bias to do something serviceable for those who
employ you and adequately remunerate you.35

Irrespective of financial gain, other psychological processes inevitably play a large part in the
decisions made by expert witnesses of whatever persuasion, as was obliquely recognised in this
opinion by reference to doing something serviceable. These processes have long been recognised
and are repeatedly demonstrated.
The placebo effect is perhaps the best known example of the way in which observers interact
with the object, process or person being observed to affect the observation. Even animals are affected
by observers expectations and change their behaviours accordingly as was demonstrated by the case
of Clever Hans, an Arabian stallion in the late 19th and early 20th century who could, so it
appeared, count, add and subtract,36 and he could still perform these feats when his trainer was absent.
While Hans was undoubtedly clever, he could not count; rather, he responded to subtle changes in the
facial expressions of those observing him as he was tapping out his responses to mathematical
problems with his hoof. If this result has been seen to occur with animals, it is also likely to occur
with that most suggestible of animals humans.
In an exhaustive and seminal review, Risinger and colleagues37 have iterated these and other
problems of expectation and suggestion in forensic science. These include observer effects,
34

See http://www.statrek.com/Tables/Binomial.apsx.

35

Lord Arbinger v Ashton (1873) 17 LR Eq 358 at 374 (emphasis added).

36

Sebeok TA, The Clever Hans Phenomenon: Communication with Horses, Whales, Apes and People (New York Academy of
Sciences, 1970).

37

Risinger DM, Saks M, Thompson WC and Rosenthal R, The Daubert/Kumho Implications of Observer Effects in Forensic
Science: Hidden Problems of Expectation and Suggestion (2002) 90(1) California Law Review 1

222

(2009) 33 Crim LJ 214

Pattern recognition and forensic identification

decision thresholds, anchoring effects, role effects, conformity effects and experimenter
effects. These problems are insidious. They are, in many respects, more troublesome than fraudulent
conduct, since they can lead otherwise competent and honest scientists into offering sincere
conclusions that are inaccurate. While detailed analysis of all these factors is beyond the scope of this
article, it is apposite to consider these effects as they apply to perception and cognition under the
conditions of ambiguity and subjectivity that apply in many cases of forensic identification.
Under conditions of ambiguity and subjectivity, it has been repeatedly demonstrated that decision
thresholds change so that, in response to identical stimuli, response biases arising from expectancies or
reinforcing effects can unconsciously affect accuracy of prediction. Suppose that observers have to
look at a dark visual field and determine whether a very dim light, at or below their visual threshold,
is on or off. Then suppose that when they are correct in the sense of obtaining a true positive (ie when
they say yes when a light is on) they are rewarded with, say, $10. This is a classic experiment used
in perceptual psychology; overwhelmingly, observers respond to this sort of situation by saying yes
more frequently, but their perceptual capacity remains unaltered. They have simply become more
willing to say yes because of reinforcement contingencies.
Reinforcement comes in many non-monetary forms, as the case of Clever Hans demonstrates. In
a forensic context, these may include praise from an investigator for getting it right or relief from
anxiety/pressure by getting it done; in the larger context, a forensic scientist may be regarded as a
subject in an experiment, the setting of which is a forensic laboratory. As Risinger et al noted:
The beliefs and expectancies of superiors, coworkers and external personnel are manifest in their
behaviour toward the forensic scientist subject in turn affecting the behaviour of these subjects
their observations, recordings, computations and interpretations-not to mention the additional impact
role and conformity effects may have.38

The weight of research and epistemological principles render it impractical if not immoral to
ignore these and other elements of the observer effects. Referring to the mandate arising from Kumho
Tire Co v Carmichael,39 to evaluate the reliability of expert opinion evidence whenever their factual
basis, data principles, methods or their application are called sufficiently into question, Risinger et al
put it thus:
For what could more centrally call into question the methodology by which a particular conclusion was
reached than the uncontrolled presence of the precursors of the various observer effects, which render it
impossible to say with confidence whether or not the conclusion is merely an artefact of these
conditions?40

A similar implied warning against the dangers of simply accepting what an expert has to say,
without more closely examining what has led them to say it, was issued by Heydon JA (as he then
was) in Makita (Australia) Pty Ltd v Sprowles:
[T]he experts evidence must explain how the field of specialised knowledge in which the witness is
expert by reason of training, study or experience, and on which the opinion is wholly or substantially
based, applies to the facts assumed or observed so as to produce the opinion propounded.41

This arguably applies, not just to the specialised knowledge which the expert possesses, but the
process by which it has been applied to the facts of the case in hand, in order to leave the court in no
doubt that no errors human, statistical or methodological have been allowed to compromise the
conclusion(s) reached.42
38

Risinger et al, n 37 at 21.

39

Kumho Tire Co v Carmichael 526 US 137 (1999) at 149.

40

Risinger et al, n 37 at 54.

41

Makita (Australia) Pty Ltd v Sprowles (2001) 52 NSWLR 705 at 744.

42

For a recent example of two experts duelling it out over their respective choices of statistical models for the creation of a
population database from which certain DNA comparisons could be made in an issue involving paternity, see Bropho v The
State of Western Australia [No 2] [2009] WASCA 94.

(2009) 33 Crim LJ 214

223

Coyle, Field and Wenderoth

CONCLUSIONS
With the forgoing arguments in mind, it is but a small step to regard a courtroom as the setting for an
experiment in which the forensic scientist subject is reinforced, both subtly and overtly, on the basis
of the opinions proffered. The consequences that proceed from this recognition are unambiguous.
Unreliable forensic identification evidence must not be allowed to contaminate criminal trials, since to
do so will make it more likely that sloppy, invalid and epistemologically flimsy expert evidence will
be nurtured by the legal system. Somewhat bleakly, Edmond et al have argued that, in the Australian
context:
extant admissibility jurisprudence and traditional safeguards associated with expert opinion evidence
and the adversarial system might not adequately protect those accused of committing criminal acts
when they are confronted with incriminating expert identification evidence.43

This is well put. In R v Tang, a case concerning the admissibility of forensic identification
evidence based on facial mapping, Spigelman CJ referred to the issue of the reliability of such
evidence as extraneous.44 However, the concept of reliability of scientific evidence has attracted
more direct judicial comment in a number of appeals and inquiries.
Over a quarter of a century ago, in a case in which the essential issue was whether or not a
psychologist might testify as to the likely emotional effect on a young man of being told by his
girlfriend that she had been unfaithful to him, and that her expected child was not his, the English
appeal judge Lawton LJ observed:
Before a court can assess the value of an opinion it must know the facts upon which it is based It is
wrong to leave the other side to elicit the facts by cross-examination.45

This elementary precaution is, it is submitted, even more appropriate when considering the scientific
principles upon which such an opinion is based. It has long been held that the simplest way to destroy
an opponents expert evidence is to destroy the factual platform upon which it is constructed. Why
should that not also be true of the epistemological underpinnings or the mathematical and statistical
constructs which have led to that same opinion?
There has been no shortage of similarly expressed sentiments from Australian judges. In R v
Anderson, Winneke P, in the Victorian Court of Appeal, pointed out that:
an opinion is only as good as the factual or scientific basis upon which it is expressed; and if no such
basis is given or, if given, can be seen to be speculative or irrelevant to the opinion expressed, then the
opinion will be worthless.46

His Honour might equally have added to the list of factors which can render so-called expert
evidence worthless, bogus or flawed statistical techniques. The argument for rejecting unreliable
methodology is perhaps more obvious when the specialist area in which it is being employed is itself
somewhat marginal in terms of its acceptance by mainstream science, but logically and ethically it
ought to be equally applicable (and is perhaps in more urgent need of reconsideration) in those areas
of forensic science such as fingerprinting and blood grouping which have long since passed into the
daylight of judicial acceptance.
One of the bluntest judicial warnings against being blinded by pseudo-science was issued in the
context of a newly emerging, and somewhat suspect, branch of forensic investigation, namely forensic
odontology. Undeterred by the Queensland Court of Appeals trenchant rejection47 of bite mark
43

Edmond G, Biber K, Kemp R and Porter G, Laws Looking Glass: Expert Identification Evidence Derived from
Photographic and Video Images (2009) 20(3) Current Issues in Criminal Justice 338 at 338.

44

R v Tang (2006) 65 NSWLR 681 at [137]; 161 A Crim R 377.

45

R v Turner [1975] QB 834 at 840.

46

R v Anderson (2000) 1 VR 1 at 25; 111 A Crim R 19.

47

In R v Carroll (1985) 19 A Crim R 410.

224

(2009) 33 Crim LJ 214

Pattern recognition and forensic identification

comparisons as a respected method of identifying a criminal offender, the Northern Territory DPP tried
again in Lewis v The Queen,48 only to have such evidence rejected as subjective. During the course
of this process, Maurice J confirmed the thesis which underlies this article when he opined that:
The inability to articulate the principal tenets that need to be understood, to describe in ordinary
language the methods used and the reasons that point to a particular conclusion, these are the hallmarks
of unreliable science and the not-so-qualified expert.49

This point was more recently and cogently affirmed in a Canadian appellate case concerning the
admissibility of barefoot morphology evidence. In R v Dimitrov, the court noted that novel scientific
theories or techniques are subject to special scrutiny50 and that the burden is on the party putting
forth the expert to establish its reliability on the balance of probabilities. Rejecting the admissibility
of barefoot morphology evidence on the grounds of significant issues as to the reliability of the
evidence,51 the court referred to earlier decisions of the Canadian Supreme Court and commented as
follows:
In R v J-LJ (2000) 148 CCC (3d) 487 (SCC) at paras 34 and 28 respectively, the Supreme Court of
Canada noted that the admissibility of expert evidence is highly case specific and that the trial judge is
to take seriously the role of gatekeeper. The court set out the following factors that should be
considered in determining the threshold reliability: (1) whether the theory or technique can be and has
been tested; (2) whether the theory or technique has been subjected to peer review and publication; (3)
the known or potential rate of error or the existence of standards; and (4) whether the theory or
technique used has been generally accepted within the scientific community.52

It is not only with respect to novel methods of forensic identification that the known or potential
rate of error, from all sources, must be considered when determining the admissibility or otherwise of
expert evidence; it is for all methods of forensic identification, including those that are presumed to be
accurate at law. As has been demonstrated above, the known or potential rate of error in those
scientific methods of forensic identification long accepted by the courts is often higher than
practitioners of various methods accept. Further, the statistical basis for the verbal descriptors used to
define degrees of consistency or confidence is startlingly weak; for the overwhelming majority of
forensic evidence identification techniques, it is non-existent or mathematically invalid. To reiterate, it
is mathematically invalid to refer to categories based on verbal descriptors such as strongly
supports, supports and the like as if the difference between these categories is equal when the
underlying data is based on a nominal scale of measurement. In addition, there is often no statistical
justification for forensic scientists asserting that their evidence provides strong support for the claim
that latent and exemplar evidence matches; rather, these claims are usually ex cathedra statements. To
paraphrase the observations of Dixon CJ in Plomp v The Queen: degrees of consistency or confidence
that are not founded on a scientific basis are not consistent, nor do they afford any reasonable degree
of confidence on which to base a decision.53
It is trite to state that, before admitting scientific evidence, the courts must be satisfied that such
evidence is, indeed, scientific. The problem is making this decision in the face of confident assertions
from experts that are often unwarranted. From a scientific perspective, one obvious way of dealing
with this problem is to ask fundamental questions concerning the scientific method employed by
forensic experts, of whatever persuasion. As a simple prophylactic to unwarranted confidence, if not
outright ignorance, forensic experts should be challenged with the following questions:
(1) What is the diagnostic accuracy of the procedure or method when comparing latent and exemplar
specimens?
48

Lewis v The Queen (1987) 88 FLR 104; 29 A Crim R 267.

49

Lewis v The Queen (1987) 88 FLR 104 at 124 (emphasis added); 29 A Crim R 267.

50

R v Dimitrov (2003) 68 OR (3d) 641 at [37].

51

R v Dimitrov (2003) 68 OR (3d) 641 at [56].

52

R v Dimitrov (2003) 68 OR (3d) 641 at [38] (emphasis added).

53

Plomp v The Queen (1963) 110 CLR 367 at 243.

(2009) 33 Crim LJ 214

225

Coyle, Field and Wenderoth

(2) What is the false positive error rate of the procedure or method when comparing latent and
exemplar specimens?
(3) What statistical bases are there for verbal anchors used to describe the confidence of the expert
vis--vis diagnostic accuracy?
(4) What is the test-retest error rate between and within experts when comparing latent and exemplar
specimens under conditions of double blind testing?
It is predicted that the answers to these obvious questions that are fundamental to establishing the
epistemological basis for forensic identification evidence will prove illuminating, if not alarming, to
judicial gatekeepers.

226

(2009) 33 Crim LJ 214

You might also like