Content Significance Distribution of Sub-Text Blocks in Articles and Its Application to Article-Organization Assessment,arXiv - CS - Computer Vision and Pattern Recognition

当前位置： X-MOL 学术 › arXiv.cs.CV › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Content Significance Distribution of Sub-Text Blocks in Articles and Its Application to Article-Organization Assessment
arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2023-11-03 , DOI: arxiv-2311.01673
You Zhou, Jie Wang

We explore how to capture the significance of a sub-text block in an article and how it may be used for text mining tasks. A sub-text block is a sub-sequence of sentences in the article. We formulate the notion of content significance distribution (CSD) of sub-text blocks, referred to as CSD of the first kind and denoted by CSD-1. In particular, we leverage Hugging Face's SentenceTransformer to generate contextual sentence embeddings, and use MoverScore over text embeddings to measure how similar a sub-text block is to the entire text. To overcome the exponential blowup on the number of sub-text blocks, we present an approximation algorithm and show that the approximated CSD-1 is almost identical to the exact CSD-1. Under this approximation, we show that the average and median CSD-1's for news, scholarly research, argument, and narrative articles share the same pattern. We also show that under a certain linear transformation, the complement of the cumulative distribution function of the beta distribution with certain values of $\alpha$ and $\beta$ resembles a CSD-1 curve. We then use CSD-1's to extract linguistic features to train an SVC classifier for assessing how well an article is organized. Through experiments, we show that this method achieves high accuracy for assessing student essays. Moreover, we study CSD of sentence locations, referred to as CSD of the second kind and denoted by CSD-2, and show that average CSD-2's for different types of articles possess distinctive patterns, which either conform common perceptions of article structures or provide rectification with minor deviation.

中文翻译：

文章子文本块的内容重要性分布及其在文章组织评估中的应用

我们探索如何捕获文章中子文本块的重要性以及如何将其用于文本挖掘任务。子文本块是文章中句子的子序列。我们制定了子文本块的内容重要性分布（CSD）的概念，称为第一类 CSD，用 CSD-1 表示。特别是，我们利用 Hugging Face 的 SentenceTransformer 生成上下文句子嵌入，并使用文本嵌入上的 MoverScore 来衡量子文本块与整个文本的相似程度。为了克服子文本块数量的指数爆炸，我们提出了一种近似算法，并表明近似的 CSD-1 几乎与精确的 CSD-1 相同。在这种近似下，我们表明新闻、学术研究、论证和叙事文章的 CSD-1 平均值和中位数具有相同的模式。我们还表明，在一定的线性变换下，具有特定 $\alpha$ 和 $\beta$ 值的 beta 分布的累积分布函数的补集类似于 CSD-1 曲线。然后，我们使用 CSD-1 提取语言特征来训练 SVC 分类器，以评估文章的组织程度。通过实验，我们表明该方法在评估学生论文方面具有很高的准确性。此外，我们研究了句子位置的 CSD，称为第二类 CSD，并表示为 CSD-2，并表明不同类型文章的平均 CSD-2 具有独特的模式，这些模式要么符合对文章结构的普遍看法，要么提供对微小偏差进行整改。

更新日期：2023-11-06

点击分享查看原文

点击收藏

阅读更多本刊新发论文