Implementation-of-NLP-based-automatic-text-summarization-using-spacy
Implementation-of-NLP-based-automatic-text-summarization-using-spacy
net/publication/362581063
CITATIONS READS
9 1,411
6 authors, including:
All content following this page was uploaded by Achyutha Prasad N on 22 September 2022.
Introduction
Method
Unsupervised Extracts
The next step is to use n-grams to select features and use boolean weight (BOOL),
term frequency (TF), inverse document frequency (IDF), or TFIDF to find weights.
The next step is to apply KMeans to the clustering set. KMeans is an iterative
process that plots values up to the nearest centroid (the average of all values) and
then calculates a new centroid. In the proposed method, the first sentence is
considered as the baseline and the similarity between the sentences is plotted
using Euclidean distance. After clustering is done using the K-cluster, the set that
is closest to the centroid (aka the most representative set) is selected. The
proposed method achieves better results than other prior art methods.
carried out first. Then a graph of the textual representation is created with nouns
as nodes and non-noun words as edges. There are "S#" and "E#" nodes that
indicate the beginning and end of the sentence. For each node, the weight is
calculated by counting the number of times it occurs. When choosing a sentence,
it is assumed that all nouns represent different topics. First, search for the most
common words and phrases and make a list of selected nodes and edges. To
select the source and destination nodes, the score must be higher than the
average score for all nodes. To select an edge, you must select both the source
node and the target node. If the candidate summary (the summary generated by
the algorithm) exceeds the user limit, the candidate summary will be scored and
ranked in ascending order. Then apply KMeans clustering to group similar
statements and select the top statements from each cluster to create the final
summary.
Ozsoy & Alpaslan introduced Latent Semantics (LSA) for text summarization. This
is an algebraic static method for finding hidden logical patterns between words
and sentences. The input matrix is created to display the text. Rows represent
words and columns represent sentences. The cell shows the TF-IDF value of the
word. Single Value Decomposition (SVD) is used to model the relationship
between words and sentences. The result of the SVD is useful for selecting
instructions using the cross method. The set with the longest vector is selected.
7512
Discussion
The study of natural language processing (NLP) dates from the 1950s. NLP
understand and processes languages using sentence grammar, ontology,
language models, analysis trees and similar methods. NLG (Natural Language
Generation) does the opposite and generates natural language from machine
rendering. NLP / NLG-based summaries are sometimes referred to as "semantics-
based or ontology-based" (Allahyari et al., 2017) rather than "knowledge-based.
Before the advent of deep learning systems, NLP and ontology-based solutions
have been the most common ways to do abstract transformations. For example,
sentences can be grouped by mechanical conjunction rules. This type of
abstraction is primarily grammatical and may not provide integration of
document ideas. Think of this "abstract light".
Some researchers have combined NLP with deep learning to "encode" "linguistic
information" such as Part-of-Speech (POS) tags and Named Entity Extraction
(NER) tags as a lexical function as part of an encoder-decoder neural network. .
(Zhou, Yang, Wei, Tan & Bao, 2017). I agree with Allahyari and others. (2017)
"The step to building a more accurate summarization system is to combine the
summarization method with knowledge-based and semantics-based ontology-
based summarization." The trend that can be seen in the comparison matrix is
away from NLP and towards deep learning
B. Spacy
The project architecture is shown in the figure 3 above. As you can see, the text
document is uploaded to the application first. The text document is then
preprocessed, including the removal of stop words and punctuation, by finding
the word Frequency and the Sentence Frequency. Finally, create a text summary.
Document pre-processing
Due to the Excess sources of information in today's world, the input documents
we receive may not be in the correct English format, which may contain audio.
Sounds include various special characters, unwanted spaces, newlines, stops,
and more. Therefore, perform the following tasks on the input file to get only the
useful parts of the document.
Step 1: All line breaks are removed.
Step 2: All corner brackets and special numbers are removed.
Step 3: All commas, extra spaces and repeating sentences are removed.
In this step it will remove all subtitles from your input according to your native
language. These stop words do not provide reliable information about a particular
context. It does not convey any information about this emotion as it builds a
collection of emotions like "is", "am", "who" to create an illustration.
Tokenization
Previously, sentences were split into several words. Basically, this token model is
used to do the activity in the form of a pipelined NLP natural language processing
process. This is useful at two stages, word level and sentence level. The first is a
standard word mark that restores a set of words in a given sentence.
7515
We need a way to verify the value of the text in the scroll. The following
calculations are performed to extract the key phrase from the document and the
same is shown in figure 4 & 5.
Step 1: Frequency of all words in the previously reviewed text is calculated.
Step 2: The weight of each word by dividing the frequency of the words by the
maximum frequency is calculated.
Step 3: Review all key phrases for the specified input.
Step 4: The sentence's score by adding the weighted frequencies of the words
contained in the sentence is calculated.
Step 5: Sort the set's token list in descending order based on points
Step 6: Remove the "n" statement from the token list
Figure 4: Preprocessing
Conclusion
Acknowledgments
I would like to thank all the Teaching, Technical faculty and supporting staff
members of Department of Computer Science and Engineering, East West
Institute of Technology, Bengaluru, for their valuable suggestions and support.
References
Sagar, Y.S. and Achyutha Prasad, N., CHARM: A Cost-Efficient Multi-Cloud Data
Hosting Scheme With High Availability, International Journal for Technological
Research In Engineering, Volume 5, Issue 10, June-2018, ISSN (Online): 2347
– 4718.
Suryasa, I. W., Rodríguez-Gámez, M., & Koldoris, T. (2022). Post-pandemic health
and its sustainability: Educational situation. International Journal of Health
Sciences, 6(1), i-v. https://doi.org/10.53730/ijhs.v6n1.5949
Towards a unified approach to simultaneous single-document and multi-
document summarizations. In Proceedings of the 23rd international conference
on computational linguistics, pages 1137–1145. Association for Computational
Linguistics.
Udit Shinghal, Yashwanth A V Mowdhgalya, Vaibhav Tiwari, Achyutha Prasad
N "Centaur - A Self-Driving Car" International Journal of Computer Trends and
Technology 68.4 (2020):129-131.
Udit Shinghal, Yashwanth A V Mowdhgalya, Vaibhav Tiwari, Achyutha Prasad
N "Home Automation using HTTP and MQTT Server" International Journal
of Computer Trends and Technology 68.4 (2020):126-128.
Verma, P., Pal, S., and Om, H. (2019). A comparative analysis on hindi and
english extractive text summarization. ACM Transactions on Asian and Low-
Resource Language Information Processing (TALLIP), 18(3):30. Wan, X. (2010).
Widyaningrum, I. ., Wibisono, N. ., & Kusumawati, A. H. . (2020). Effect of
extraction method on antimicrobial activity against staphylococcus aureus of
tapak liman (elephantopus scaber l.) leaves. International Journal of Health &
Medical Sciences, 3(1), 105-110. https://doi.org/10.31295/ijhms.v3n1.181