MLO, a right CC, and a right MLO. We also select approx-
imately 5% samples to refine labels by a single reader, i.e.,
a radiologist who is a breast specialist. Half of the 5% sam-
ples are used for Ds, and the rest for a validation set. As
an in-house test set, we have collected 986 cases from an-
other institution from South Korea. The same radiologist
has labeled this test set. To fairly compare ours with other
method, we have collected another test set(external test set),
which comprises 8,206 cases, from a large hospital in the
US. We have extracted the density grade for each case from
CRF field, and use it as a label. Table 1 summarizes our
datasets.
4.1.2 Baseline
For classifier fc, we adopt ResNet-18 and make it produce a
4-dimensional softmax output. We used SGD optimizer and
the learning rate is set to 0.1 in training. The model takes a
single mammogram as input, and four predictions from four
views are averaged to a case-level prediction. We decode
mammograms by the window center and width embedded
in DICOM protocol.
To check the sanity of our baseline networks, the base-
line is compared with other neural-network methods, [7]
and [14]. Roundabout way have to be used for comparison,
since all reported scores in other works are obtained with
different configurations and private datasets. The training
split and the test set split collected in the same site in [7]
and [14], while our training split and test split are from the
other sites. To keep the configuration of the experiment as
same as possible, we have followed same experiment set-
tings proposed in [7] and [14] for this experiment. We split
our main dataset to three parts, Dr for training, Ds for val-
idation and testing set is validation set in original split.
Two metrics are tracked same as [7] and [14]: 4-class
accuracy and 2-class(fatty vs. dense) accuracy. Class-wise
averaged accuracy is used in this sanity check. 4-class ac-
curacy and 2-class accuracy have reported approximately
77% and 87% in their papers (in same), while our base-
line reports 74% and 89%. Our baseline model is inferior
in 4-class accuracy than other models, but it is superior in
2-class accuracy. Interestingly, [7] and [14] also reported
almost same scores each other in their in-house test set.
Putting the above results together, we have concluded that
our model has almost similar accuracy to other works. It
means if generalization for inter-reader variance is not con-
sidered, even they trained with all different datasets and dif-
ferent hyper-parameters, the capability of classifying breast
density scores are almost the same. Note that the above ac-
curacy score is class-wised accuracy, while all reported ac-
curacy scores through the paper are instance-wise accuracy.
This is because make metric comparable to other works.
4.1.3 Metrics
We use the 4-way classification accuracy, as it has been the
common metric for previous works. Unfortunately, class-
averaged accuracy scores may be inaccurate in our test set
since our test set suffers class-imbalance problem, existing
only 9 samples with category a. For example, a sample with
category a contributes to accuracy 455/9 = 50.56 times
more than the another sample that category is c in class-
averaged accuracy metrics. Instead of the class-averaged
accuracy, the instance-wise average accuracy score is used
to relax this problem.
Moreover, the accuracy metric itself is also inaccurate
when it takes into accounts inter-reader variance problem.
This is because breast density prediction is not a typical
classification task. In whatever ways of grading criterion
of choosing a discrete category, the sample that is vague
to classify between two values, since the grade of the den-
sity is discretization of a continual density score that actual
physical quantity is a portion of parenchyma in a breast.
In addition to this issue, there exists a relation between
labels in breast density, where accuracy metric more inac-
curate. For example, the grade a is closer to b, rather than c
and d.
To take these issues into account, we propose a new met-
ric, called density-AUC(dAUC), which is stands for breast
density estimation algorithms. This metric is the aggrega-
tion of AUC scores between the density predictions from a
model and binarized breast density categories. The labels in
AUC should be in forms of binary domain(negative or posi-
tive), so the breast density labels in y ∈ {a, b, c, d} are split
into three ways: [a vs. b, c, d], [a, b vs. c, d], and [a, b, c vs.
d]. In results, we can obtain 3 different labels set for a given
dataset, or 3 sub-problems for dAUC. Samples having left-
side breast density categories are assigned to negative(0),
and the ones belongs to right-side are assigned to the posi-
tive value(1).
Meanwhile, in addition to label binarization, the pre-
dictions of the network needs to be reduced for each sub-
problem, since the prediction scores in AUC should be in
forms of single real value score. The format of outputs in
the proposed model is a vector having length 4, each com-
ponent represent the probability of each class(a, b, c, d). We
take an average of probabilities of each positive categories
in each sub-problems. For instance, when we measure an
AUC score of [a, b vs. c, d], a sample score is defined
as ˆyc+ ˆyd where ˆyc and ˆyd are the two elements of softmax
output y. This metric satisfy our assumption that defined
implicitly – lower value for fatty breast and a higher value
for dense breast. The final dAUC score is calculated by av-
eraging the three sub-problems.
Note that dAUC is just considered as a complement of
accuracy metric, not a substitution. Accuracy metric is also
tracked as an important metrics. Producing density score