research-article

Multiple-instance Learning from Triplet Comparison Bags

Authors: Senlin Shu, Deng-Bao Wang, Suqin Yuan, Hongxin Wei, Jiuchuan Jiang, Lei Feng, Min-Ling ZhangAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data, Volume 18, Issue 4

Article No.: 90, Pages 1 - 18

https://doi.org/10.1145/3638776

Published: 12 February 2024 Publication History

Get Access

Abstract

Multiple-instance learning (MIL) solves the problem where training instances are grouped in bags, and a binary (positive or negative) label is provided for each bag. Most of the existing MIL studies need fully labeled bags for training an effective classifier, while it could be quite hard to collect such data in many real-world scenarios, due to the high cost of data labeling process. Fortunately, unlike fully labeled data, triplet comparison data can be collected in a more accurate and human-friendly way. Therefore, in this article, we for the first time investigate MIL from only triplet comparison bags, where a triplet (X^a, X^b, X^c) contains the weak supervision information that bag X^a is more similar to X^b than to X^c. To solve this problem, we propose to train a bag-level classifier by the empirical risk minimization framework and theoretically provide a generalization error bound. We also show that a convex formulation can be obtained only when specific convex binary losses such as the square loss and the double hinge loss are used. Extensive experiments validate that our proposed method significantly outperforms other baselines.

Appendices

A Generation Process of Triplet Comparison Bags

Recall the assumption that three bags in a triplet are sampling independently. Therefore, for a triplet \((X_{a}, X_{b}, X_{c})\) , the bag labels \((Y_a,Y_b,Y_c)\) can only appear to be one of the following cases:

\(\begin{eqnarray*} \nonumber \mathcal {Y}_{1} = \lbrace (+1,+1,+1),(+1,+1,-1),(+1,-1,-1),(-1,+1,+1),(-1,-1,+1),(-1,-1,-1)\rbrace . \end{eqnarray*}\)

Otherwise, the first bag is more similar to the third bag than to the second bag, and in this case, \((Y_a,Y_b,Y_c)\) appears to be one of the following cases:

\(\begin{eqnarray*} \nonumber \mathcal {Y}_{2} = \lbrace (+1,-1,+1),(-1,+1,-1)\rbrace . \end{eqnarray*}\)

According to the above distributions \(\mathcal {Y}_{1}\) and \(\mathcal {Y}_{2}\) , we can actually collect two distinct types of datasets as follows:

\(\begin{eqnarray*} \nonumber \mathcal {D}_{1} = \lbrace (X_{a},X_{b},X_{c})|(Y_a,Y_b,Y_c)\in \mathcal {Y}_{1}\rbrace , \quad \mathcal {D}_{2} = \lbrace (X_{a},X_{b},X_{c})|(Y_a,Y_b,Y_c)\in \mathcal {Y}_{2}\rbrace . \end{eqnarray*}\)

The two types of datasets \(\mathcal {D}_{1}\) and \(\mathcal {D}_{2}\) can be considered to be generated from the following underlying distributions:

\(\begin{eqnarray*} \nonumber p_{1}(X_{a},X_{b},X_{c}) &= \frac{p(X_{a},X_{b},X_{c}, (Y_a,Y_b,Y_c)\in \mathcal {Y}_{1})}{\theta _{T}}, \\ \nonumber p_{2}(X_{a},X_{b},X_{c}) &= \theta _{+}p_{+}(X)p_{+}(X)p_{-}(X) + \theta _{-}p_{-}(X)p_{+}(X)p_{-}(X), \end{eqnarray*}\)

where \(\theta _{T} = 1- \theta _{+}\theta _{-}\) , \(\theta _{+} = p(y=+1)\) and \(\theta _{-} = p(y=-1)\) and \(p_{+}(X)=p(X|y=+1)\) and \(p_{-}(X)=p(X|y=-1)\) . Then, we have

\(\begin{eqnarray*} \nonumber \mathcal {D}_1 = \lbrace (X_{1,a},X_{1,b},X_{1,c})\rbrace ^{m_1}\sim p_{1}(X_{a},X_{b},X_{c}), \quad \mathcal {D}_2 = \lbrace (X_{2,a},X_{2,b},X_{2,c})\rbrace ^{m_2}\sim p_{2}(X_{a},X_{b},X_{c}). \end{eqnarray*}\)

Furthermore, we denote the pointwise data collected from \(\mathcal {D}_1\) and \(\mathcal {D}_2\) by ignoring the triplet comparison relation as \(\mathcal {D}_{1,a} = \lbrace X_{1,a}\rbrace ^{m_{1}}\) , \(\mathcal {D}_{1,b} = \lbrace X_{1,b}\rbrace ^{m_{1}}\) , \(\mathcal {D}_{1,c} = \lbrace X_{1,c}\rbrace ^{m_{1}}\) , \(\mathcal {D}_{2,a} = \lbrace X_{2,a}\rbrace ^{m_{2}}\) , \(\mathcal {D}_{2,b} = \lbrace X_{2,b}\rbrace ^{m_{2}}\) and \(\mathcal {D}_{2,c} = \lbrace X_{2,c}\rbrace ^{m_{2}}\) . From Theorem 1 in Cui et al. [12], samples in \(\mathcal {D}_{1,a}\) , \(\mathcal {D}_{1,c}\) , \(\mathcal {D}_{2,a}\) and \(\mathcal {D}_{2,c}\) are independently drawn from

\(\begin{eqnarray*} \nonumber \tilde{p}_{1}(X) = \theta _{+}p_{+}(X) + \theta _{-}p_{-}(X), \end{eqnarray*}\)

samples in \(\mathcal {D}_{1,b}\) are independently drawn from

\(\begin{eqnarray*} \nonumber \tilde{p}_{2}(X) = \frac{(\theta _{+}^3+2\theta _{+}^2\theta _{-})p_{+}(X) + (2\theta _{+}\theta _{-}^2+ \theta _{-}^3)p_{-}(X)}{\theta _{T}}, \end{eqnarray*}\)

and samples in \(\mathcal {D}_{2,b}\) are independently drawn from

\(\begin{eqnarray*} \nonumber \tilde{p}_{3}(X) = \theta _{-}p_{+}(X) + \theta _{+}p_{-}(X). \end{eqnarray*}\)

Those indicate that from triplet comparison data, we can essentially obtain samples that can be drawn independently from three different distributions. Then, we denote the three aggregated datasets (from respective distributions) as

\(\begin{eqnarray*} \nonumber \tilde{\mathcal {D}}_{1} =\lbrace \tilde{X}_{i}^{1}\rbrace ^{n_1}_{i=1}= \mathcal {D}_{1,a}\cup \mathcal {D}_{1,c}\cup \mathcal {D}_{2,a}\cup \mathcal {D}_{2,c}, \quad \tilde{\mathcal {D}}_{2} =\lbrace \tilde{X}_{i}^{2}\rbrace ^{n_2}_{i=1} = \mathcal {D}_{1,b}, \quad \tilde{\mathcal {D}}_{3} =\lbrace \tilde{X}_{i}^{3}\rbrace ^{n_3}_{i=1} = \mathcal {D}_{2,b}, \end{eqnarray*}\)

where

\(\begin{eqnarray*} \nonumber \tilde{\mathcal {D}}_{1} \sim \tilde{p}_{1}(X), \quad \tilde{\mathcal {D}}_{2} \sim \tilde{p}_{2}(X), \quad \tilde{\mathcal {D}}_{3} \sim \tilde{p}_{3}(X). \end{eqnarray*}\)

Let \(C = \frac{\theta _{+}^3+2\theta _{+}^2\theta _{-}}{\theta _{T}}\) and \(D = \frac{2\theta _{+}\theta _{-}^2+ \theta _{-}^3}{\theta _{T}}\) , we can express the relationship between these densities as

\(\begin{eqnarray*} \nonumber \begin{bmatrix} \tilde{p}_{1}(X)\\ \tilde{p}_{2}(X)\\ \tilde{p}_{3}(X) \end{bmatrix} = \begin{bmatrix} \theta _{+} &\theta _{-}\\ C & D\\ \theta _{-} &\theta _{+} \end{bmatrix}\begin{bmatrix} p_{+}(X)\\ p_{-}(X) \end{bmatrix}. \end{eqnarray*}\)

B Proof of Theorem 1

Recall that by using the loss function that satisfies the linear-odd condition, \(\widehat{R}_{\mathrm{Trip}}(g)\) can be also represented as

\(\begin{align*} \widehat{R}_{\mathrm{Trip}}(g) =&\,\, \frac{1}{n_1}\sum \limits _{i=1}^{n_1}\Big ((\lambda _1+\lambda _2)\ell _+\left(g\left(X_i^1\right)\right)+\lambda _2g(X_i^1)\Big) +\frac{1}{n_2}\sum \limits _{i=1}^{n_2}\Big ((\lambda _3+\lambda _4)\ell _+\left(g\left(X_i^2\right)\right)+\lambda _4 g\left(X_i^2\right)\!\Big)\\ &+\frac{1}{n_3}\sum \limits _{i=1}^{n_3}\Big ((\lambda _5+\lambda _6)\ell _-\left(g\left(X_i^3\right)\right)-\lambda _5 g\left(X_i^3\right)\!\Big). \end{align*}\)

In this way, we can represent \({R}_{\mathrm{Trip}}(g)\) as

\(\begin{align*} {R}_{\mathrm{Trip}}(g) =&\,\, \mathbb {E}_{\widetilde{p}_{1}(X)}\Big [ (\lambda _1+\lambda _2)\ell _+(g(X^1))+\lambda _2 g(X^1)\Big ] +\mathbb {E}_{\widetilde{p}_{2}(X)}\Big [ (\lambda _3+\lambda _4)\ell _+(g(X^2))+\lambda _4 g(X^2)\Big ] \\ &+\mathbb {E}_{\widetilde{p}_{3}(X)}\Big [ (\lambda _5+\lambda _6)\ell _-(g(X^3))-\lambda _5 g(X^3)\Big ], \end{align*}\)

where we assumed that the collected data \(\lbrace X^1_{i}\rbrace _{i=1}^{n_1}\) are independently sampled from \(\widetilde{p}_{1}(X)\) , the collected data \(\lbrace X^2_{i}\rbrace _{i=1}^{n_2}\) are independently sampled from \(\widetilde{p}_{2}(X)\) and the collected data \(\lbrace X^3_{i}\rbrace _{i=1}^{n_3}\) are independently sampled from \(\widetilde{p}_{3}(X)\) . Let us further introduce

\(\begin{eqnarray*} \nonumber \widehat{R}_{1}(g) =\frac{1}{n_1}\sum \limits _{i=1}^{n_1}\left((\lambda _1+\lambda _2)\ell _+\left(g\left(X_i^1\right)\!\right) +\lambda _2g\left(X_i^1\right)\!\right)\!, \quad R_{1}(g) =\mathbb {E}_{\widetilde{p}_{1}(X)}\Big [(\lambda _1+\lambda _2)\ell _+(g(X^1))+\lambda _2 g(X^1)\Big ],\\ \nonumber \widehat{R}_{2}(g) = \frac{1}{n_2}\sum \limits _{i=1}^{n_2}\left((\lambda _3+\lambda _4)\ell _+\left(g\left(X_i^2\right)\!\right)+\lambda _4 g\left(X_i^2\right)\!\right)\!, \quad R_{2}(g) =\mathbb {E}_{\widetilde{p}_{2}(X)}\Big [ (\lambda _3+\lambda _4)\ell _+(g(X^2))+\lambda _4 g(X^2)\Big ],\\ \nonumber \widehat{R}_{3}(g) = \frac{1}{n_3}\sum \limits _{i=1}^{n_3}\left((\lambda _5+\lambda _6)\ell _-\left(g\left(X_i^3\right)\right)-\lambda _5 g\left(X_i^3\right)\!\right)\!, \quad R_{3}(g) =\mathbb {E}_{\widetilde{p}_{3}(X)}\Big [ (\lambda _5+\lambda _6)\ell _-(g(X^3))-\lambda _5 g(X^3)\Big ]. \end{eqnarray*}\)

In this way, we have

\(\begin{eqnarray*} \nonumber \widehat{R}_{\mathrm{Trip}}(g) =\widehat{R}_{1}(g) + \widehat{R}_{2}(g) + \widehat{R}_{3}(g),\quad {R}_{\mathrm{Trip}}(g) = R_{1}(g) + R_{2}(g) + R_{3}(g). \end{eqnarray*}\)

Thus,

\(\begin{eqnarray*} \nonumber \sup _{g\in \mathcal {G}}\left|{R}_{\mathrm{Trip}}(g)-\widehat{R}_{\mathrm{Trip}}(g)\right|\le \sup _{g\in \mathcal {G}}\left|{R}_{1}(g)-\widehat{R}_{1}(g)\right|+\sup _{g\in \mathcal {G}}\left|{R}_{2}(g)-\widehat{R}_{2}(g)\right| +\sup _{g\in \mathcal {G}}\left|{R}_{3}(g)-\widehat{R}_{3}(g)\right|. \end{eqnarray*}\)

Hence, the problem becomes how to find an upper bound of each term in the right hand size of the inequality.

Lemma 1.

With the introduced definitions and conditions in Theorem 1, for any \(\delta \gt 0\) , with probability at least \(1-\delta\) , we have

\(\begin{eqnarray*} \nonumber \sup _{g\in \mathcal {G}}\left| R_{1}(g)-\widehat{R}_{1}(g)\right| \le & (\left|\lambda _1\right|+\left|\lambda _2\right|)\left(\frac{2C_{\mathcal {G}}}{\sqrt {n_1}}+C_{\boldsymbol {w}}C_{\boldsymbol {\phi }}\sqrt {\frac{\log \frac{2}{\delta }}{2n_1}}\right)\!. \end{eqnarray*}\)

Proof.

First, it is easy to verify that the double hinge loss \(\ell _{\mathrm{DH}}\) is 1-Lipschitz. Suppose an example in \(\widehat{R}_{1}(g)\) is replaced by another arbitrary example, then the change of \(\sup _{g\in \mathcal {G}}\big (R_{1}(g)-\widehat{R}_{1}(g)\big)\) is no greater than \((\left|\lambda _1\right|+\left|\lambda _2\right|)C_{\boldsymbol {w}}C_{\boldsymbol {\phi }}/n_{1}\) . Then, by applying McDiarmid’s inequality [28], for any \(\delta \gt 0\) , with probability at least \(1-\frac{\delta }{2}\) ,

\(\begin{eqnarray*} \nonumber \sup _{g\in \mathcal {G}}\big (R_{1}(g)-\widehat{R}_{1}(g)\big) &\le \mathbb {E}\Big [\sup _{g\in \mathcal {G}}\big (R_{1}(g)-\widehat{R}_{1}(g)\big)\Big ]+ (\left|\lambda _1\right|+\left|\lambda _2\right|)C_{\boldsymbol {w}}C_{\boldsymbol {\phi }}\sqrt {\frac{\log \frac{2}{\delta }}{2n_1}}. \end{eqnarray*}\)

Besides, it is routine [30] to show

\(\begin{eqnarray*} \nonumber \mathbb {E}\Big [\sup _{g\in \mathcal {G}}\big (R_{1}(g)-\widehat{R}_{1}(g)\big)\Big ]\le 2(\left|\lambda _1\right|+\left|\lambda _2\right|)\mathfrak {R}_{n_1}(\mathcal {G}), \end{eqnarray*}\)

where we have used the Talagrand’s lemma (Lemma 4.2 in Mohri et al. [30]), i.e., \(\mathfrak {R}_{n}(\ell \circ \mathcal {G})\le \rho \mathfrak {R}_n(\mathcal {G})\) if \(\ell\) is a \(\rho\) -Lipschitz loss function. By considering \(\mathfrak {R}_n(\mathcal {G})\le C_{\mathcal {G}}/\sqrt {n}\) , we have

\(\begin{eqnarray*} \nonumber \sup _{g\in \mathcal {G}}\big (R_{1}(g)-\widehat{R}_{1}(g)\big) \le & (\left|\lambda _1\right|+\left|\lambda _2\right|)\left(\frac{2C_{\mathcal {G}}}{\sqrt {n_1}}+C_{\boldsymbol {w}}C_{\boldsymbol {\phi }}\sqrt {\frac{\log \frac{2}{\delta }}{2n_1}}\right)\!. \end{eqnarray*}\)

By further taking into account the other side \(\sup _{g\in \mathcal {G}}\big (\widehat{R}_{1}(g)-R_{1}(g)\big)\) , we have for any \(\delta \gt 0\) , with probability at least \(1-\delta\) ,

\(\begin{eqnarray*} \nonumber \sup _{g\in \mathcal {G}}\left|R_{1}(g)-\widehat{R}_{1}(g)\right| \le & (\left|\lambda _1\right|+\left|\lambda _2\right|)\left(\frac{2C_{\mathcal {G}}}{\sqrt {n_1}}+C_{\boldsymbol {w}}C_{\boldsymbol {\phi }}\sqrt {\frac{\log \frac{2}{\delta }}{2n_1}}\right)\!, \end{eqnarray*}\)

which completes the proof of Lemma 1. □

Lemma 2.

With the introduced definitions and conditions in Theorem 1, for any \(\delta \gt 0\) , with probability at least \(1-\delta\) , we have

\(\begin{eqnarray*} \nonumber \sup _{g\in \mathcal {G}}\left| R_{2}(g)-\widehat{R}_{2}(g)\right| \le & (\left|\lambda _3\right|+\left|\lambda _4\right|)\left(\frac{2C_{\mathcal {G}}}{\sqrt {n_2}}+C_{\boldsymbol {w}}C_{\boldsymbol {\phi }}\sqrt {\frac{\log \frac{2}{\delta }}{2n_2}}\right)\!. \end{eqnarray*}\)

Lemma 3.

With the introduced definitions and conditions in Theorem 1, for any \(\delta \gt 0\) , with probability at least \(1-\delta\) , we have

\(\begin{eqnarray*} \nonumber \sup _{g\in \mathcal {G}}\left| R_{2}(g)-\widehat{R}_{2}(g)\right| \le & (\left|\lambda _5\right|+\left|\lambda _6\right|)\left(\frac{2C_{\mathcal {G}}}{\sqrt {n_3}}+C_{\boldsymbol {w}}C_{\boldsymbol {\phi }}\sqrt {\frac{\log \frac{2}{\delta }}{2n_3}}\right)\!. \end{eqnarray*}\)

Lemmas 2 and 3 can be proved similarly as Lemma 1; hence, we omit the proof. By combining Lemmas 1, 2, and 3, Theorem 1 is immediately proved. \(\Box\)

References

[1]

Jaume Amores. 2013. Multiple instance classification: Review, taxonomy, and comparative study. Artific. Intell. 201 (2013), 81–105.

Abstract

A Generation Process of Triplet Comparison Bags

B Proof of Theorem 1

References

Cited By

Index Terms

Recommendations

Multiple-Instance Learning from Similar and Dissimilar Bags

Multiple instance learning

Multiple-Instance Learning From Unlabeled Bags With Pairwise Similarity

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Full Text

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations