A sentence is known by the company it keeps: Improving Legal Document Summarization Using Deep Clustering

Jain, Deepali; Borah, Malaya Dutta; Biswas, Anupam

doi:10.1007/s10506-023-09345-y

A sentence is known by the company it keeps: Improving Legal Document Summarization Using Deep Clustering

Original Research
Published: 01 February 2023

Volume 32, pages 165–200, (2024)
Cite this article

Artificial Intelligence and Law Aims and scope Submit manuscript

1742 Accesses
9 Citations
3 Altmetric
Explore all metrics

Abstract

The appropriate understanding and fast processing of lengthy legal documents are computationally challenging problems. Designing efficient automatic summarization techniques can potentially be the key to deal with such issues. Extractive summarization is one of the most popular approaches for forming summaries out of such lengthy documents, via the process of summary-relevant sentence selection. An efficient application of this approach involves appropriate scoring of sentences, which helps in the identification of more informative and essential sentences from the document. In this work, a novel sentence scoring approach DCESumm is proposed which consists of supervised sentence-level summary relevance prediction, as well as unsupervised clustering-based document-level score enhancement. Experimental results on two legal document summarization datasets, BillSum and Forum of Information Retrieval Evaluation (FIRE), reveal that the proposed approach can achieve significant improvements over the current state-of-the-art approaches. More specifically it achieves ROUGE metric F1-score improvements of (1−6)% and (6−12)% for the BillSum and FIRE test sets respectively. Such impressive summarization results suggest the usefulness of the proposed approach in finding the gist of a lengthy legal document, thereby providing crucial assistance to legal practitioners.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing legal judgment summarization with integrated semantic and structural information

Article 26 November 2023

A Comprehensive Analysis of Indian Legal Documents Summarization Techniques

Article 11 August 2023

A Hybrid Summarization Method for Legal Judgment Documents Based on Lawformer

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

https://huggingface.co/nsi319/legal-pegasus.

References

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, . . . others (2016). Tensorflow: A system for large-scale machine learning. 12th USENIX symposium on operating systems design and implementation (OSDI 16) (pp. 265-283)
Acharya A, Goel R, Metallinou A, Dhillon I (2019). Online embedding compression for text classification using low rank matrix factorization. Proceedings of the aaai conference on artificial intelligence (Vol. 33, pp. 6196-6203)
Akter S, Asa AS, Uddin MP, Hossain MD, Roy SK, Afjal MI (2017). An extractive text summarization technique for bengali document (s) using k-means clustering algorithm. 2017 ieee international conference on imaging, vision & pattern recognition (icivpr) (pp. 1-6)
Alguliyev RM, Aliguliyev RM, Isazade NR, Abdi A, Idris N (2019) Cosum: text summarization based on clustering and optimization. Expert Syst 36(1):e12340
Article Google Scholar
Alqaisi R, Ghanem W, Qaroush A (2020) Extractive multi-document arabic text summarization using evolutionary multi-objective optimization with k-medoid clustering. IEEE Access 8:228206–228224
Article Google Scholar
Anand D, Wagh R (2019) Effective deep learning approaches for summarization of legal texts. J King Saud University-Computer Inf Sci 2:51
Google Scholar
Beltagy, I., Peters, M.E., Cohan, A. (2020). Longformer: the long-document transformer. http://arxiv.org/abs/2004.05150
Bhattacharya, P., Hiware, K., Rajgaria, S., Pochhi, N., Ghosh, K., Ghosh, S. (2019). A comparative study of summarization algorithms applied to legal case judgments. European conference on information retrieval (pp. 413-428)
Bhattacharya, P., Paul, S., Ghosh, K., Ghosh, S., Wyner, A. (2019). Identification of rhetorical roles of sentences in indian legal judgments. http://arxiv.org/abs/1911.05405
Bhattacharya, P., Poddar, S., Rudra, K., Ghosh, K., Ghosh, S. (2021). Incorporating domain knowledge for extractive summarization of legal case documents. Proceedings of the eighteenth international conference on artificial intelligence and law (pp. 22-31)
Bonhard P, Sasse MA (2006) knowing me, knowing you-using profiles and social networking to improve recommender systems. BT Technol J 24(3):84–98
Article Google Scholar
Carmel, D., Zwerdling, N., Guy, I., Ofek-Koifman, S., Har’El, N., Ronen, I., . . . Chernov, S. (2009). Personalized social search based on the user’s social network. Proceedings of the 18th acm conference on information and knowledge management (pp. 1227-1236)
Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I. (2020). Legal-bert: The muppets straight out of law school. http://arxiv.org/abs/2010.02559
Clarke J, Lapata M (2008) Global inference for sentence compression: an integer linear programming approach. J Artif Intell Res 31:399–429
Article Google Scholar
Cohan, A., Beltagy, I., King, D., Dalvi, B., Weld, D.S. (2019). Pretrained language models for sequential sentence classification. http://arxiv.org/abs/1909.04054
Devlin J, Chang M-W, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. http://arxiv.org/abs/1810.04805
Duan X, Zhang Y, Yuan L, Zhou X, Liu X, Wang T, Wu F (2019) Legal summarization for multi-role debate dialogue via controversy focus mining and multi-task learning. Proceedings of the 28th acm international conference on information and knowledge management (pp. 1361-1370)
Edmundson HP (1969) New methods in automatic extracting. J ACM 16(2):264–285
Article Google Scholar
Eidelman V (2019) Billsum: a corpus for automatic summarization of us legislation. Proceedings of the 2nd workshop on new frontiers in summarization (pp. 48-56)
Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479
Article Google Scholar
Guo X, Liu X, Zhu E, Yin J (2017) Deep clustering with convolutional autoencoders. International conference on neural information processing (pp. 373-382)
Gupta S, Narayana N, Charan VS, Reddy KB, Borah MD, Jain D (2022) Extractive summarization of indian legal documents. Edge analytics (pp. 629-638). Springer
Hachey B & Grover C (2004) A rhetorical status classifier for legal text summarisation. Text summarization branches out (pp. 35-42)
Haghighi A, & Vanderwende L (2009) Exploring content models for multi-document summarization. Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics (pp. 362-370)
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Oliphant TE (2020) Array Programming with NumPy. Nature 585(7825):357–362
Article CAS PubMed PubMed Central ADS Google Scholar
Honnibal M, Montani I, Van Landeghem S, Boyd A (2020) spaCy: Industrial-strength Natural Language Processing in Python. Zenodo
Huang L, Cao S, Parulian N, Ji H, Wang L (2021) Efficient attentions for long document summarization. Proceedings of the 2021 conference of the north American chapter of the association for computational linguistics: Human language technologies (pp. 1419-1436)
Jain D, Borah MD, Biswas A (2020) Fine-tuning textrank for legal document summarization: A bayesian optimization based approach. In: Forum for information retrieval evaluation (pp. 41–48)
Jain D, Borah MD, Biswas A (2021a) Automatic summarization of legal bills: A comparative analysis of classical extractive approaches. In: 2021 International Conference on Computing, Communication, and Intelligent Systems (ICCCIS) (pp. 394–400)
Jain D, Borah MD, Biswas A (2021b) Cawesumm: A contextual and anonymous walk embedding based extractive summarization of legal bills. In: Proceedings of the 18th International Conference on Natural Language Processing (ICON) (pp. 414–422)
Jain D, Borah MD, Biswas A (2021c) Summarization of indian legal judgement documents via ensembling of contextual embedding based mlp models. FIRE
Jain D, Borah MD, Biswas A (2021d) Summarization of legal documents: Where are we now and the way forward. Computer Sci Rev 40:100388
Jing H (2000) Sentence reduction for automatic text summarization. Sixth applied natural language processing conference (pp. 310-315)
Kanapala A, Jannu S, Pamula R (2019) Summarization of legal judgments using gravitational search algorithm. Neural Comput Appl 31(12):8631–8639
Article Google Scholar
Kanapala A, Pal S, Pamula R (2019) Text summarization from legal documents: a survey. Artif Intell Rev 51(3):371–402
Article Google Scholar
Kingma DP, & Ba J (2014) Adam: A method for stochastic optimization. http://arxiv.org/abs/1412.6980
Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries acl. Proceedings of workshop on text summarization branches out post conference workshop of acl (pp. 2017-05)
Louis A, Joshi AK, Nenkova A (2010) Discourse indicators for content selection in summaization
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Develop 2(2):159–165
Article MathSciNet Google Scholar
Ma T, & Nakagawa H (2013) Automatically determining a proper length for multi-document summarization: A bayesian nonparametric approach. Proceedings of the 2013 conference on empirical methods in natural language processing (pp. 736-746)
Mallick C, Das AK, Ding W, Nayak J (2021) Ensemble summarization of bio-medical articles integrating clustering and multi-objective evolutionary algorithms. Appl Soft Comput 106:107347
Article Google Scholar
Mihalcea R, Tarau P (2004) Textrank: Bringing order into text. Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 404-411)
Mishra SK, Saini N, Saha S, Bhattacharyya P (2022) Scientific document summarization in multi-objective clustering framework. Appl Intell 52(2):1520–1543
Article Google Scholar
Moradi M, & Samwald M (2019) Clustering of deep contextualized representations for summarization of biomedical texts. http://arxiv.org/abs/1908.02286
Nallapati R, Zhai F, Zhou B (2017) Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. Thirty-first aaai conference on artificial intelligence
Nenkova A, & Vanderwende L (2005) The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005 , 101
Parikh V, Bhattacharya U, Mehta P, Bandyopadhyay A, Bhattacharya P, Ghosh K, Majumder P (2021a) Fire 2021 aila track: Artificial intelligence for legal assistance. Proceedings of the 13th forum for information retrieval evaluation
Parikh V, Bhattacharya U, Mehta P, Bandyopadhyay A, Bhattacharya P, Ghosh K, Majumder P (2021b, December) Overview of the third shared task on artificial intelligence for legal assistance at fire 2021. Fire (working notes)
Parikh V, Mathur V, Mehta P, Mittal N, Majumder P (2021) Lawsum: A weakly supervised approach for indian legal document summarization. http://arxiv.org/abs/2110.01188v3
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830
MathSciNet Google Scholar
Polsley S, Jhunjhunwala P, Huang R (2016) Casesummarizer: a system for automated summarization of legal texts. Proceedings of coling 2016, the 26th international conference on computational linguistics: System demonstrations (pp. 258-262)
Rehurek R, & Sojka P (2010) Software framework for topic modelling with large corpora. In proceedings of the lrec 2010 workshop on new challenges for nlp frameworks
Ren Y, Hu K, Dai X, Pan L, Hoi SC, Xu Z (2019) Semi-supervised deep embedded clustering. Neurocomputing 325:121–130
Article Google Scholar
Saini N, Saha S, Chakraborty D, Bhattacharyya P (2019) Extractive single document summarization using binary differential evolution: Optimization of different sentence quality measures. PloS One 14(11):e0223477
Article CAS PubMed PubMed Central Google Scholar
Saravanan M, Ravindran B, Raman S (2006) Improving legal document summarization using graphical models. Front Artif Intell Appl 152:51
Google Scholar
Shetty K, & Kallimani JS (2017) Automatic extractive text summarization using k-means clustering. 2017 international conference on electrical, electronics, communication, computer, and optimization techniques (iceeccot) (pp. 1-9)
Srikanth A, Umasankar AS, Thanu S, Nirmala SJ (2020) Extractive text summarization using dynamic clustering and co-reference on bert. 2020 5th international conference on computing, communication and security (icccs) (pp. 1-5)
Steinberger J, Jezek K et al (2004) Using latent semantic analysis in text summarization and summary evaluation. Proc ISIM 4:93–100
Google Scholar
Tajaddodianfar F, Stokes JW, Gururajan A (2020) Texception: a character/word-level deep learning model for phishing url detection. Icassp 2020-2020 ieee international conference on acoustics, speech and signal processing (icassp) (pp. 2857-2861)
Umer M, Ashraf I, Mehmood A, Kumari S, Ullah S, Sang Choi G (2021) Sentiment analysis of tweets using a unified convolutional neural network-long short-term memory network model. Comput Intell 37(1):409–434
Article MathSciNet Google Scholar
Vanderwende L, Suzuki H, Brockett C, Nenkova A (2007) Beyond sumbasic: task-focused summarization with sentence simplification and lexical expansion. Inf Process Manage 43(6):1606–1618
Article Google Scholar
Verma S, & Nidhi V (2017) Extractive summarization using deep learning. http://arxiv.org/abs/1708.04439
Wang D, Zhu S, Li T, Chi Y, Gong Y (2011) Integrating document clustering and multidocument summarization. ACM Trans Knowl Discov Data (TKDD) 5(3):1–26
Article Google Scholar
Xiao W, & Carenini G (2019) Extractive summarization of long documents by combining global and local context. http://arxiv.org/abs/1909.08089
Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. International conference on machine learning (pp. 478-487)
Zaheer M, Guruganesh G, Dubey KA, Ainslie J, Alberti C, Ontanon S et al (2020) Big bird: transformers for longer sequences. Adv Neural Inf Process Syst 33:17283–17297
Google Scholar
Zhang J, Zhao Y, Saleh M, Liu P (2020) Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. International conference on machine learning (pp. 11328-11339)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology Silchar, Silchar, Assam, India
Deepali Jain, Malaya Dutta Borah & Anupam Biswas

Authors

Deepali Jain
View author publications
You can also search for this author in PubMed Google Scholar
Malaya Dutta Borah
View author publications
You can also search for this author in PubMed Google Scholar
Anupam Biswas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Deepali Jain.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A. Qualitative analysis

1.1 A.1 Best and worst sample predictions

Table 12 Sample Predicted summary of maximum ROUGE score from CA test

Full size table

Table 13 Sample Predicted summary of minimum ROUGE score from CA test

Full size table

1.2 A.2 Sample predicted summaries after postprocessing

Table 14 shows the scores obtained by the sample with the lowest ROUGE score among all the samples in the case of US Test data. More specifically, it shows the scores of top 15% sentences. These sentences are then sorted as they appear in the original document to form a summary. From these scores, we see that not all the scores are good to be included into the summary. Similar trend has been shown in the case of US Test and CA Test dataset as shown in Tables 15 and 16. Table 17 shows the predicted summaries of worst sample after postprocessing step. This shows that, postprocessing step can actually be very helpful to improve the quality of the predicted summaries further. Table 18 shows the ROUGE scores on those samples which has obtained the minimum ROUGE scores. From this table, we see that after postprocessing which includes picking the one ore two top scoring sentences helps in improving the quality of summary and hence ROUGE scores.

Table 14 Top 15% sentences from US Test data which has obtained the lowest ROUGE scores

Full size table

Table 15 Top 15% sentences from CA Test data which has obtained the lowest ROUGE scores

Full size table

Table 16 Top 40% sentences from FIRE Test data which has obtained the lowest ROUGE scores

Full size table

Table 17 Postprocessing Summary sample for the worst Summary

Full size table

Table 18 ROUGE scores with the proposed approach on the worst sample after postprocessing

Full size table

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Jain, D., Borah, M.D. & Biswas, A. A sentence is known by the company it keeps: Improving Legal Document Summarization Using Deep Clustering. Artif Intell Law 32, 165–200 (2024). https://doi.org/10.1007/s10506-023-09345-y

Download citation

Accepted: 03 January 2023
Published: 01 February 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s10506-023-09345-y

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A sentence is known by the company it keeps: Improving Legal Document Summarization Using Deep Clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Enhancing legal judgment summarization with integrated semantic and structural information

A Comprehensive Analysis of Indian Legal Documents Summarization Techniques

A Hybrid Summarization Method for Legal Judgment Documents Based on Lawformer

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A. Qualitative analysis

1.1 A.1 Best and worst sample predictions

1.2 A.2 Sample predicted summaries after postprocessing

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

A sentence is known by the company it keeps: Improving Legal Document Summarization Using Deep Clustering

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Enhancing legal judgment summarization with integrated semantic and structural information

A Comprehensive Analysis of Indian Legal Documents Summarization Techniques

A Hybrid Summarization Method for Legal Judgment Documents Based on Lawformer

Explore related subjects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix A. Qualitative analysis

Appendix A. Qualitative analysis

1.1 A.1 Best and worst sample predictions

1.2 A.2 Sample predicted summaries after postprocessing

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation