This is the html version of the file http://openaccess.thecvf.com/content_ICCVW_2019/html/VRMI/Jaehwan_Photometric_Transformer_Networks_and_Label_Adjustment_for_Breast_Density_Prediction_ICCVW_2019_paper.html.
Google automatically generates html versions of documents as we crawl the web.
Photometric Transformer Networks and Label Adjustment for Breast Density Prediction
  Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Page 1
Photometric Transformer Networks and Label Adjustment
for Breast Density Prediction
Jaehwan Lee
Lunit Inc.
Seoul, Republic of Korea
jhlee@lunit.io
Donggeun Yoo
Lunit Inc.
Seoul, Republic of Korea
dgyoo@lunit.io
Hyo-Eun Kim
Lunit Inc.
Seoul, Republic of Korea
hekim@lunit.io
Abstract
Grading breast density is highly sensitive to normaliza-
tion settings of digital mammogram as the density is tightly
correlated with the distribution of pixel intensity. Also, the
grade varies with readers due to uncertain grading crite-
ria. These issues are inherent in the density assessment of
digital mammography. They are problematic when design-
ing a computer-aided prediction model for breast density
and become worse if the data comes from multiple sites.
In this paper, we proposed two novel deep learning tech-
niques for breast density prediction: 1) photometric trans-
formation which adaptively normalizes the input mammo-
grams, and 2) label distillation which adjusts the label by
using its output prediction. The photometric transformer
network predicts optimal parameters for photometric trans-
formation on the fly, learned jointly with the main predic-
tion network. The label distillation, a type of pseudo-label
techniques, is intended to mitigate the grading variation.
We experimentally showed that the proposed methods are
beneficial in terms of breast density prediction, resulting in
significant performance improvement compared to various
previous approaches.
1. Introduction
Breasts can be categorized as dense or fatty by the por-
tion of parenchyma in the breasts. A fatty breast indicates
that the breast is mostly composed of fat tissue, whereas
a dense breast has more dense tissue that shows dense
parenchymal patterns on mammograms. Readers should be
more careful when dealing with mammograms with dense
parenchymal pattern since suspicious malignant lesions can
be hidden, resulting to a false-negative [6]. Also, it has been
reported that a dense breast has a higher risk of breast cancer
than average [1]. For this reason, BI-RADS [11], which is
a standard protocol for breast imaging, guides the interpret-
ing readers to report density category as an essential field
of case reports form(CRF). In BI-RADS taxonomy, breast
density is categorized into four grades: a, b, c, d, meaning
“almost entirely fatty”, “scattered areas of fibro-glandular
tissue”, “heterogeneously dense”, and “extremely dense”,
respectively.
Based on the collected mammograms and their density
categories in CRFs, it is straight-forward to regard a density
prediction task as classification. However, breast density
prediction is not a typical classification task. The BI-RADS
criteria for breast density are 1) the portion of parenchyma
within a breast, which is discretization of the continual
score, and 2) specific dense parenchyma pattern in part of
the image, determined by the reader. Thus, the density la-
bels in a training dataset will have inter-readers biases.
Intensity normalization of mammograms is an impor-
tant factor when grading the breast density, since the mam-
mographic parenchymal pattern is highly correlated with
the pixel intensity. However, intensity distribution of the
parenchyma and the fat tissue varies according to different
vendors of imaging devices as well as different hospitals. To
compensate these variations, readers often manually adjust
the contrast of each mammogram to determine the grade
properly.
In this paper, we propose two methods that tackle the
problems caused by the normalization and inter-reader
grading variance. The first method is a learnable nor-
malization module, called photometric transformer net-
work (PTN), that predicts normalization parameters of input
mammogram. It is seamless to main prediction network so
that optimal normalization and density grade can be learned
jointly. The second one is a label distillation method, which
is a type of pseudo-label technique, taking the grading vari-
ation into consideration.
Our test shows that proposed two methods help to im-
prove performance, especially in multi-site configurations.
Our final model outperforms other public-available previ-
ous models in a test set with neutral configurations. Experi-
mental results show that the proposed method improves the
accuracy and dAUC (a novel evaluation metric of the den-

Page 2
g
resize
S
xn
fc
PTN
h
x
y
Figure 1. An overview of our model architecture. Before a mammogram is input to the classifier network, the intensity distribution of the
mammogram is changed with S which are outputs of transformer network.
sity grading) from 55% to 79% and from 0.9204 to 0.9663,
respectively. The proposed method also outperforms previ-
ous state-of-the-arts based on an evaluation on external test
data, which is collected from a separate institution for fair
comparison between similar approaches.
2. Related works
With the drastic advance of deep learning, breast den-
sity prediction based on deep neural networks has also been
introduced recently. [4] applied the unsupervised feature
learning based on auto-encoder to predict the breast den-
sity. [7, 9, 14] employed convolutional neural networks
(CNNs) that is learned with a cross-entropy loss for breast
density prediction. Motivated by these approaches, we also
cast the breast density prediction as a CNN-based classifi-
cation task, but address the two practical problems caused
by multi-site configuration.
From the perspective of dynamic estimation of the pa-
rameters which are appropriate for a target task, our PTN
is similar to the spatial transformer network [3]. Spatial
transformer network predicts appropriate geometric trans-
formation parameters, while our PTN tries to find a set of
photometric transformation parameters that is optimal for
breast density prediction.
The proposed label distillation is motivated by pseudo-
labeling techniques, devised especially for handling label
noise [8, 12]. In [8], an auxiliary network trained with
small clean examples were used to predict pseudo-labels,
in addition to the main network trained with large examples
with given pseudo-labels. Similarly in [12], a sub-network
jointly optimized with a main network tries to find appropri-
ate pseudo-labels. Our approach is distinct from [8, 12], in
that pseudo-labels are given to only selected samples and
applied in iterative ways to prevent distillation of model
bias.
3. Methods
A density estimator f is a neural network that predicts
breast density y ∈ {a, b, c, d} from an input mammogram
x ∈ RH×W . The input x is normalized by the PTN denoted
by fn, and the classifier fc estimates density y from the
normalized input as
y = fc(fn(x;θn);θc),
(1)
where fn and fc are parameterized by θn and θc, respec-
tively. Our goal is to learn parameters θ = θn ∪ θc with our
dataset D = {(xi,yi)|i = 1,··· ,N}.
θ= arg min
θ
1
N ∑
(x,y)∈D
L(y, y)
(2)
where L is the loss function. To successfully estimates θ,
we propose the photometric transformer module fn in Sec-
tion 3.1, and a distillation method to handle label grading
variance problem in Section 3.2.
3.1. Photometric transformer networks
The fn normalizes an input x by a function h. The func-
tion h is determined by a parameter set S, and the parameter
set S is predicted by a CNN g from the input x. For a pixel
intensity x(i, j) at a location (i, j), it can be expressed as
xn(i, j) = h(x(i, j),S) where S = g(x;θn),
(3)
and is illustrated in the left of Figure 1.
We introduce the function h that works well in breast
density prediction. Let us assume the intensity range of in-
terests is [u, v)1. We split the range into K sub-intervals,
giving Tk = [u + t(k − 1),u + tk) where t = (v − u)/K
and k = 1,··· ,K. Then, the function h is defined as
h(x(i, j),S) =
⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎩
u + s0(x(i, j) − u)
if x(i, j) ∈ (−∞,u)
u + ∑k−1
l=1 slt + sk(x(i, j) − min(Tk))
if x(i, j) ∈ Tk
u + ∑k
l=1 slt + sK+1(x(i, j) − v)
if x(i, j) ∈ [v,∞)
(4)
1Generally, it is determined by window center & width value in stan-
dard DICOM protocols.

Page 3
x
h(x)
s0
s1
s2
s3
s4
Figure 2. Example graph of the proposed function h. The domain
of h is divided into K +1 intervals. The slope of each interval are
defined with the parameters generated by the network g.
where S is {s0,··· ,sK+1}, and min(Tk) is a minimum
value of an interval Tk. Figure 2 is illustrating this function
h. Each component of S can be interpreted as a slope of the
corresponding line segment.
The function h is continuous but can fluctuate if a part of
S is negative. To make h be an increasing function, we add
a hinge regularization term to the cross entropy loss LCE.
The loss function in Equation (2) is finally defined as
L(y, y) = LCE(y, y) + λ ·
k+1
i=0
min(−si,−ǫ) + ǫ
(5)
where ǫ is a small positive constant and λ is a scaling con-
stant. We have empirically found that adding this regular-
ization term yields better performance.
3.2. Label distillation
We propose a novel approach to train a model in the sit-
uation that having inter-reader (i.e., inter-labeler) grading
variance. The dataset D is divided by Ds which is labeled
by a single reader and the rest Dr. Ds is the set having
small samples but free from the variance, while Dr is the
large set suffering from the variance from arbitrary readers.
The proposed method consists of three steps. In the first
stage, training is performed with Dr, as a typical machine
learning algorithm does. In the second stage, the model is
fine-tuned with Ds. This transfer learning strategy is for re-
ducing the inter-reader variance while utilizing the general
mid-level image representation learned with a large number
of samples in Dr.
In the third stage, labels in Dr are refined to pseudo-
labels generated by the model produced by the second stage.
The new labels {y|x ∈ Dr} have the grading criteria that is
more similar to that of the single reader labels in Ds. This
procedure from the first to the third stage is repeated until
the model converges.
Pseudo-labeling distillates not only the grading criterion
used in labeling Ds, but also the knowledge of the model as
Algorithm 1 Label distillation
θ := arg minθ
1
N (x,y)∈D L(f(x;θ),y)
Split D into Ds and Dr
while not converged do
θ := arg minθ
1
N (x,y)∈Ds L(f(x;θ),y)
for (x, y) ∈ Dr do
if KLD(y, f(x;θ)) is top r% then
y := αy + (1 − α) · f(x;θ)
end if
end for
Dtrain := Ds ∪ Dr
θ := arg minθ
1
N (x,y)∈Dtrain L(f(x;θ),y)
end while
return θ
Table 1. Dataset configurations
Datasets\Grades
a
b
c
d
Total
Training set Dr
1,395
6,905
33,282
4,773
46,355
Training set Ds
72
391
428
255
1,146
Validation set
78
373
421
275
1,147
Test set
9
280
455
242
986
External test set
852
3,130
3,634
590
8,206
argued in [2]. Unfortunately, hard samples that wrongly la-
beled by the model fitted to Ds distillates inaccurate knowl-
edge.
Our method tries to filter out hard samples to prevent
conveying the inaccurate knowledge. To this end, we mea-
sure a divergence KL(y, y) between a one-hot encoded label
y and prediction y for each sample x ∈ Dr. We empirically
found that a hard sample is prone to have a relatively small
divergence value rather than the sample with inaccurate la-
bel. We select top γ-percent samples of KL(y, y) to filter
out the hard samples. After that, for each of the selected
samples, we update y with y by blending operation as
y := αy + (1 − α) · y,
(6)
where α is a constant blending factor. We then continue
training with the updated Dr. This procedure is repeated
until the model converges.
The whole procedure of the proposed method is con-
cretely described in Algorithm 1.
4. Evaluation
4.1. Experimental setup
4.1.1 Datasets
We have collected 48,648 cases of Asian women from 5
separate hospitals from South Korea. Each case comprises
four mammograms with different views of a left CC, a left

Page 4
MLO, a right CC, and a right MLO. We also select approx-
imately 5% samples to refine labels by a single reader, i.e.,
a radiologist who is a breast specialist. Half of the 5% sam-
ples are used for Ds, and the rest for a validation set. As
an in-house test set, we have collected 986 cases from an-
other institution from South Korea. The same radiologist
has labeled this test set. To fairly compare ours with other
method, we have collected another test set(external test set),
which comprises 8,206 cases, from a large hospital in the
US. We have extracted the density grade for each case from
CRF field, and use it as a label. Table 1 summarizes our
datasets.
4.1.2 Baseline
For classifier fc, we adopt ResNet-18 and make it produce a
4-dimensional softmax output. We used SGD optimizer and
the learning rate is set to 0.1 in training. The model takes a
single mammogram as input, and four predictions from four
views are averaged to a case-level prediction. We decode
mammograms by the window center and width embedded
in DICOM protocol.
To check the sanity of our baseline networks, the base-
line is compared with other neural-network methods, [7]
and [14]. Roundabout way have to be used for comparison,
since all reported scores in other works are obtained with
different configurations and private datasets. The training
split and the test set split collected in the same site in [7]
and [14], while our training split and test split are from the
other sites. To keep the configuration of the experiment as
same as possible, we have followed same experiment set-
tings proposed in [7] and [14] for this experiment. We split
our main dataset to three parts, Dr for training, Ds for val-
idation and testing set is validation set in original split.
Two metrics are tracked same as [7] and [14]: 4-class
accuracy and 2-class(fatty vs. dense) accuracy. Class-wise
averaged accuracy is used in this sanity check. 4-class ac-
curacy and 2-class accuracy have reported approximately
77% and 87% in their papers (in same), while our base-
line reports 74% and 89%. Our baseline model is inferior
in 4-class accuracy than other models, but it is superior in
2-class accuracy. Interestingly, [7] and [14] also reported
almost same scores each other in their in-house test set.
Putting the above results together, we have concluded that
our model has almost similar accuracy to other works. It
means if generalization for inter-reader variance is not con-
sidered, even they trained with all different datasets and dif-
ferent hyper-parameters, the capability of classifying breast
density scores are almost the same. Note that the above ac-
curacy score is class-wised accuracy, while all reported ac-
curacy scores through the paper are instance-wise accuracy.
This is because make metric comparable to other works.
4.1.3 Metrics
We use the 4-way classification accuracy, as it has been the
common metric for previous works. Unfortunately, class-
averaged accuracy scores may be inaccurate in our test set
since our test set suffers class-imbalance problem, existing
only 9 samples with category a. For example, a sample with
category a contributes to accuracy 455/9 = 50.56 times
more than the another sample that category is c in class-
averaged accuracy metrics. Instead of the class-averaged
accuracy, the instance-wise average accuracy score is used
to relax this problem.
Moreover, the accuracy metric itself is also inaccurate
when it takes into accounts inter-reader variance problem.
This is because breast density prediction is not a typical
classification task. In whatever ways of grading criterion
of choosing a discrete category, the sample that is vague
to classify between two values, since the grade of the den-
sity is discretization of a continual density score that actual
physical quantity is a portion of parenchyma in a breast.
In addition to this issue, there exists a relation between
labels in breast density, where accuracy metric more inac-
curate. For example, the grade a is closer to b, rather than c
and d.
To take these issues into account, we propose a new met-
ric, called density-AUC(dAUC), which is stands for breast
density estimation algorithms. This metric is the aggrega-
tion of AUC scores between the density predictions from a
model and binarized breast density categories. The labels in
AUC should be in forms of binary domain(negative or posi-
tive), so the breast density labels in y ∈ {a, b, c, d} are split
into three ways: [a vs. b, c, d], [a, b vs. c, d], and [a, b, c vs.
d]. In results, we can obtain 3 different labels set for a given
dataset, or 3 sub-problems for dAUC. Samples having left-
side breast density categories are assigned to negative(0),
and the ones belongs to right-side are assigned to the posi-
tive value(1).
Meanwhile, in addition to label binarization, the pre-
dictions of the network needs to be reduced for each sub-
problem, since the prediction scores in AUC should be in
forms of single real value score. The format of outputs in
the proposed model is a vector having length 4, each com-
ponent represent the probability of each class(a, b, c, d). We
take an average of probabilities of each positive categories
in each sub-problems. For instance, when we measure an
AUC score of [a, b vs. c, d], a sample score is defined
as ˆyc+ ˆyd where ˆyc and ˆyd are the two elements of softmax
output y. This metric satisfy our assumption that defined
implicitly – lower value for fatty breast and a higher value
for dense breast. The final dAUC score is calculated by av-
eraging the three sub-problems.
Note that dAUC is just considered as a complement of
accuracy metric, not a substitution. Accuracy metric is also
tracked as an important metrics. Producing density score

Page 5
(1) Input
(2) Normalized
(3) Input
(4) Normalized
Figure 3. Original mammograms (1), (3) with the density a and d are normalized to (2), (4) by PTN. For both samples, labels are corrected
to b.
with obeying radiologist’s convention is also important in
practical use.
4.2. Results and analysis
4.2.1 Photometric transformer networks
We use 6 convolution layers for transformer networks and
set K to 10. Each mammogram is resized to one-third of the
original size. We also use the instance normalization [13]
for each of the convolutions, rather than batch normaliza-
tion. We empirically found that the result image that trans-
formed by PTN is more well-normalized when PTN con-
tains instance normalization.
The standard deviation of image-level pixel mean values
in the validation set is 0.6850. Once PTN normalizes the
validation images, the standard deviation is significantly re-
duced to 0.2249. This verified that PTN suppresses inter-
image intensity variations and make the image intensities
more consistent. Fig. 3 shows some of normalization exam-
ples.
The upper part of Table 2 shows the results of normaliza-
tion methods. CLAHE [10] is selected for representative of
static normalization approach. CLAHE improves the base-
line, but PTN shows better performance.
4.2.2 Label distillation
For label distillation, we choose the best PTN model as our
baseline. The pre-trained parameters of the first two layers
are fixed, and the rest is trained with a learning rate of 0.01.
We set α and γ as 0.5 and 0.25, respectively. The lower part
of Table 2 shows the results.
23 median results of above PTN results, in perspective of validation
accuracy.
3It produces the percent of density value directly, thus the only dAUC
is reported.
The hard-labeling, which directly uses predictions of
PTN as pseudo labels, shows higher accuracy score than the
baseline but lower dAUC score. [12] is another approach,
which uses soft pseudo labels. To make [12] fairly compa-
rable to our method, we fine-tune with Ds at each epoch,
before giving pseudo labels. This trial improves both accu-
racy and dAUC scores, however, the gains are not signifi-
cant. In contrast, our label distillation method yields clear
performance gains compared to the baseline.
4.3. Comparison with others
We conduct another experiment with different settings to
compare our models to other works fairly. Instead of test set
used in 4.2, another external test set is used for this experi-
ment. (See 4.1.1 for details.) Two of our models are evalu-
ated: baseline, and the proposed PTN and label distillation
applied model. Our model is selected by the median value
of accuracy score in the in-house validation set, among five
trials of experiment. As external algorithms, LIBRA [5],
an open-source density predictor, and some other previous
works [7, 14] who have opened their model parameters in
public are selected.
The results are shown in Table 3. Although our model
is trained with the data consists of different race, our best
model achieves the best performance with large margins in
all metrics.
5. Conclusion
In this paper, we have proposed two methods for breast
density problem: PTN and label distillation. These two
methods can resolve input and label issues in the breast
density prediction task, respectively. For further research,
strict validation of dAUC metric how it is suitable for breast
density tasks is needed. Additionally, our approach should
be looked in broad views, and applied for various medical
imaging problems, since it is not limited to a specific task.

Page 6
Table 2. Breast density estimation performance comparison between methods. The mean and standard deviation of 5 trials are reported.
Validation
Test
Methods
Accuracy
dAUC
Accuracy
dAUC
Baseline
.7015(.0179) .9595(.0153) .5452(.1078) .9204(.0207)
CLAHE [10]
.7374(.0291) .9654(.0154) .7163(.0341) .9357(.0128)
PTN
.7479(.0229) .9755(.0013) .7509(.0103) .9518(.0079)
PTN2
.7512(.0109) .9757(.0014) .7431(.0046) .9470(.0045)
PTN + hard labeling
.7367(.0150) .9715(.0026) .7671(.0039) .9392(.0155)
PTN + [12]
.7428(.0126) .9745(.0018) .7650(.0128) .9482(.0067)
PTN + [12] in Ds
.7576(.0118) .9776(.0015) .7743(.0029) .9442(.0029)
PTN + label distillation .8073(.0043) .9808(.0009) .7941(.0060) .9663(.0033)
Table 3. Breast density estimation performance comparison be-
tween methods on the external test set.
Accuracy dAUC
LIBRA3
-
.8877
[7]
.5860
.9275
[14]
.5419
.8424
Our baseline
.4246
.9185
Ours
.7257
.9481
References
[1] N. F. Boyd, H. Guo, L. J. Martin, L. Sun, J. Stone,
E. Fishell, R. A. Jong, G. Hislop, A. Chiarelli,
S. Minkin, and M. J. Yaffe. Mammographic Density
and the Risk and Detection of Breast Cancer. New
England Journal of Medicine, 356(3):227–236, jan
2007.
[2] G. Hinton, O. Vinyals, and J. Dean. Distilling the
knowledge in a neural network. In NIPS Deep Learn-
ing and Representation Learning Workshop, 2015.
[3] M. Jaderberg, K. Simonyan, A. Zisserman, and
K. Kavukcuoglu. Spatial Transformer Networks. In
Advances in Neural Information Processing Systems
28 (NIPS), pages 2017–2025, jun 2015.
[4] M. Kallenberg, K. Petersen, M. Nielsen, A. Y. Ng,
P. Diao, C. Igel, C. M. Vachon, K. Holland, R. R.
Winkel, N. Karssemeijer, and M. Lillholm. Unsuper-
vised Deep Learning Applied to Breast Density Seg-
mentation and Mammographic Risk Scoring. IEEE
Transactions on Medical Imaging, 35(5):1322–1331,
may 2016.
[5] B. M. Keller, D. L. Nathan, Y. Wang, Y. Zheng, J. C.
Gee, E. F. Conant, and D. Kontos. Estimation of breast
percent density in raw and processed full field digi-
tal mammography images via adaptive fuzzy c-means
clustering and support vector machine segmentation.
Medical Physics, 39(8):4903–17, aug 2012.
[6] K. Kerlikowske, D. Grady, J. Barclay, E. A. Sickles,
and V. Ernster. Effect of Age, Breast Density, and
Family History on the Sensitivity of First Screening
Mammography. The Journal of the American Medical
Association, 276(1):33, jul 1996.
[7] C. D. Lehman, A. Yala, T. Schuster, B. Dontchos,
M. Bahl, K. Swanson, and R. Barzilay. Mam-
mographic Breast Density Assessment Using Deep
Learning: Clinical Implementation.
Radiology,
290(1):52–58, jan 2019.
[8] Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L. J. Li.
Learning from Noisy Labels with Distillation. In 2017
IEEE International Conference on Computer Vision
(ICCV), pages 1928–1936, oct 2017.
[9] A. A. Mohamed, W. A. Berg, H. Peng, Y. Luo, R. C.
Jankowitz, and S. Wu. A deep learning method for
classifying mammographic breast density categories.
Medical Physics, 45(1):314–321, jan 2018.
[10] S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromar-
tie, A. Geselowitz, T. Greer, B. ter Haar Romeny,
J. B. Zimmerman, and K. Zuiderveld. Adaptive his-
togram equalization and its variations. Computer Vi-
sion, Graphics, and Image Processing, 39(3):355–
368, sep 1987.
[11] E. A. Sickles, C. J. D’Orsi, L. W. Bassett, et al.
ACR BI-RADS R Atlas, Breast Imaging Reporting and
Data System. American College of Radiology, Reston,
US, 5th edition, 2013.
[12] D. Tanaka, D. Ikami, T. Yamasaki, and K. Aizawa.
Joint Optimization Framework for Learning with
Noisy Labels. In 2018 IEEE/CVF Conference on
Computer Vision and Pattern Recognition (CVPR),
pages 5552–5560, jun 2018.

Page 7
[13] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance
Normalization: The Missing Ingredient for Fast Styl-
ization. arXiv pre-prints, jul 2016.
[14] N. Wu, K. J. Geras, Y. Shen, J. Su, S. G. Kim, E. Kim,
S. Wolfson, L. Moy, and K. Cho. Breast Density
Classification with Deep Convolutional Neural Net-
works. In IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP), pages
6682–6686, apr 2018.