On The Discoursive Structure of Computer Graphics Research Papers
On The Discoursive Structure of Computer Graphics Research Papers
42
43
categories while AZ-II covers aspects of knowledge
claims that permeate across several CoreSC con-
cepts.
Corpora annotated with Argumentative Zoning-II
(Teufel et al., 2009) and Core Scientific Concepts
(Liakata et al., 2010) have been exploited to build
automatic rhetorical sentence classifiers.
44
Figure 2: Description of the 5 categories of our Simplified Discourse Annotation Scheme
5 The Annotation Process was not possible (for example, metadiscourse or ac-
knowledgements) or as Sentence when the selected
5.1 Annotators text was characterized by segmentation or character
The annotators are not domain experts. Two of them encoding problems (for example, when a footnote
are computationally oriented linguists and the third appears incorrectly in the text flow).
is both a linguist and the developer of the annotation
scheme. Each of them has annotated the whole set 5.3 Annotation Support
of documents. Therefore, the annotation outcome is In order to ensure the quality of the annotation,
a collection of 40 papers whose sentences have been the annotators were provided with the following
annotated by the three annotators. The categories support: an introductory training session, a visual
associated to each sentence by each annotator are schema of the proposed discourse structure, guide-
then merged to create the Gold Standard version of lines for the annotation, a series of conflict resolu-
the corpus. tion criteria and recommendations. Moreover, two
follow-up conflict-resolution meetings were sched-
5.2 Annotation Task uled to perform inter-annotator agreement checks
The 40 documents selected for our corpus were pro- along the first stages of the annotation process.
vided to each annotator so as to start the sentence an-
notation process. All the annotators use GATE v.7.1 5.4 Annotation Workflow
as annotation tool, with a customized view where After the training session, the annotators were en-
they have a window with the ready-to-annotate doc- couraged to test the tool, and try the schema with a
uments, segmented into sentences. Their task is to couple of documents before the annotation task re-
select a sentence and choose the appropriate cate- ally started. Once the process was triggered, two
gory from a pop-up list. conflict resolution meetings were scheduled after the
Each sentence of each document of the Corpus annotation of the first 5 papers, and after the subse-
is classified as belonging to a category among: Ap- quent 10 papers. Agreement was measured in these
proach, Background, Challenge, Challenge Goal, two milestones in order to detect deviations in an
Challenge Hypothesis, FutureWork, Outcome or early stage. The articles were sorted by subject, to
Outcome Contribution. Sentences were classified as facilitate the better comprehension of the text for the
Unspecified when the identification of the category annotators, as articles concerning the same subject
45
Category Annotated Sent. % κ N n k domain
Approach 5,038 46.70 Liakata 0.57 255 11 9 Biochem.
Background 1,760 16.32 Liakata 0.50 5022 11 9 Biochem.
Challenge 351 3.25 Teufel 0.71 3745 15 3 Chemistry
Challenge Goal 91 0.84 Teufel 0.65 1629 15 3 Comp.Ling.
Challenge Hypotesis 7 0.06 Teufel 0.71 3420 7 3 Comp.Ling.
FutureWork 136 1.26
Outcome 1,175 10.89 Table 2: Summary of κ values in previous works:
Outcome Contribution 219 2.03 N=#sentences, n=#categories, k=#annotators
Unspecified 759 7.04
Sentence 1253 11.61
sentially treats the corpus as one large document,
Total 10,789 100
whereas macro averaging calculates on a per doc-
ument basis, and then averages the results. Macro
Table 1: Number/Percentage of sentences per category
averaging tends to increase the importance of shorter
documents.
deal with similar concepts and terminology. In our corpus, the κ value of inter-annotator
agreement (Cohen’s κ), averaged among all anno-
6 Annotation Results
tators’ pairs, considering the 5 categories and the 3
6.1 Annotated corpus description subcategories of our Simplified Annotation Schema
The Corpus includes 10,789 sentences, with an av- (see Figure 1) is equal to 0.6567 for the macro aver-
erage of 269.7 sentences per document. age and 0.6667 if the micro average is computed. If
We are currently defining the best approach to we consider only the 5 top categories of our Simpli-
make Corpus annotations available to the research fied Annotation Schema the inter-annotator agree-
community, since most of its 40 documents are pro- ment grows: the macro average becomes 0.674 and
tected by copyright. the micro average 0.6823. In both cases, the mi-
The Gold Standard was built with the following cro average is slightly greater than the macro aver-
criteria for each sentence: If all annotators or two of age since there are documents with a number of sen-
them assigned the same category to the sentence, it tences below the mean (269.7 sentences per docu-
was included in the Gold Standard version with such ment) that are characterized by low κ values, thus
category; otherwise, the category selected by the an- negatively affecting the macro-averaged computa-
notator who designed the scheme was preferred and tion of κ.
used in the Gold Standard. Table 1 details the num- These κ values are comparable to those achieved
ber of sentences of each category in the Gold Stan- by Teufel for 1,629 sentences in the domain of Com-
dard version of the annotated corpus and its percent- putational Linguistics, with an annotation scheme of
age in reference to the total number of annotated 15 categories and 3 annotators (see Table 2). The
sentences in the whole corpus. micro average κ achieves the cut-off point of 0.67,
over which agreement is considered difficult to reach
6.2 Inter-annotator Agreement Values in linguistic annotation tasks (Teufel, 2010).
We used Cohen κ (Cohen et al., 1960) to mea- The agreement measures in the 2 milestones,
sure the inter-annotator agreement. Cohen κ is an showed evolution of the inter-annotator agreement
extensively adopted measure to quantify the inter- throughout the annotation process: Cohen’s κ is
annotator agreement, previously exploited in several substantially stable between two of the annota-
other annotation efforts, including the corpora cre- tors, while the third annotator sensibly improves his
ated by Liakata and Teufel, previously introduced. agreement with the other two very quickly in the
Depending on how documents are combined, first 5 documents and remains stable after the sec-
there are several options for calculating the agree- ond milestone. In particular, the annotator with the
ment measures over a corpus. Micro averaging es- lowest agreement in the initial stage increased his
46
agreement with the other two annotators respecively
from 0.59 for the first 5 documents to 0.68 for the
last 25 documents and from 0.56 for the first 5 doc-
uments to 0.66 for the last 25 documents.
An analysis of the sentence distribution accord-
ing to their agreement degree results in the fol-
lowing values: totally agreed sentences (65.09%),
partially agreed sentences (31.24%) and totally dis-
agreed sentences (3.66%).
Not all the categories are equally distributed, as
each one of them has its own characteristics in terms
of number of sentences, ambiguity or conflicts with
other categories.
Background and Approach, the most highly repre-
sented categories, are highly reliable. In fact, more Figure 3: Box plots that show the distribution of the sen-
than 45% of the sentences of the corpus were tagged tences of the 5 main categories of the Scientific Discourse
Annotated Corpus
with agreement by the three annotators pairs as Ap-
proach or Background. If we also take into account
the sentences with partial agreement (2 annotators
agreed), then sentences classified as Approach and notation classes in the Corpus.
Background are more than 60% in the Gold Stan-
Agreement improves as the number of sentences
dard version of our annotated corpus.
of the category increases, getting close to 0.80 for
FutureWork and Outcome are quite reliable, al-
the most frequent categories.
though the difference between them is that the ra-
tio of totally agreed/partially agreed is considerably
higher in FutureWork compared to the same ratio in
6.3 Discoursive Structure Analysis
Outcome (3.3 vs 0.9). This is due to the fact that
although FutureWork sentences (1.3%) are much
fewer than Outcome sentences (10.9%), those are The box plots of the 5 main categories (Fig. 3) give a
much more easily recognized, as they include spe- clear picture of the discoursive structure of an aver-
cific lexical clues (for further research, in future in- age scientific paper in the Computer Graphics do-
vestigation, more research is needed in, it could be main. In fact, the 5 main categories show a neat
interesting to, a better understanding,etc.). layout of the main zones (inside the box) in the ar-
Clearly, Challenge is the category where the pro- gumentative structure distributed along the article.
portion of total disagreement is higher. This cate- Even if one can find all types of sentences along the
gory which tends to appear at the beginning of a sci- whole document, the central 50% of each category
entific paper shows more than any other the author’s seems clearly limited to a zone with little overlap-
skills in writing, synthesis and ability to communi- ping of one another. When searching for information
cate the scope of the challenge they are presenting. about one of these categories, a reader or researcher
Authors must be able to provide a context and out- will find the central 50% of the sentences of each
line the situation in order to attract the attention of category in the following article length ranges: Chal-
the reader, who must understand the goal and com- lenge in between the 3% and 23%, Background in
plexity of the research. between the 11% and 29%, Approach in between
When studying the relation between the number the 35% and 70%, Outcome in between the 70% and
of sentences of a category and the annotation match 92%, FutureWork in between the 88% and 97%.
between annotators, data reveal that the observed The identification of these ranges will allow read-
agreement among annotator pairs varies consider- ers, scientists, search engines, etc. to focus the ex-
ably according to the relative frequency of the an- ploring effort in a specific area of the article.
47
7 Automatic sentence classification: initial excluded from the dependency tree if they have no
experiments syntactic functions in the sentence where they are
present. After dependency parsing is performed, it
Recently several approaches to the automatic classi-
is possible to identify the token of each sentence to-
fication of the discursive function of textual excerpts
gether with their Part-Of-Speech and syntactic rela-
from research papers have been proposed (Merity et
tions.
al., 2009; Liakata et al., 2012; Guo et al., 2013).
We present our initial experiments of automatic sen- • unigrams, bigrams and trigrams built from
tence classification with our Corpus. We describe the lemmas of each sentence, lowercased and
the set of features we use to model and thus to char- without considering stop-words. We included
acterize the contents of each sentence in order to only unigram, bigrams and trigrams with
enable the execution of proper classification algo- corpus-frequency equal or greater than 4;
rithms. In particular, by relying on these features,
we compare the performances of two classifiers: Lo- • depth and number of edges by edge type of
gistic Regression (Wright, 1995) and Support Vector the dependency tree;
Machine (Suykens and Joos, 1999).
• dependency tree tokens with corpus-
7.1 Description of sentence features frequency equal or greater than 4. Each
dependency tree token is the result of the
In order to support the extraction of the features that
concatenation of three parts: kind of de-
should characterize each sentence, we mine its con-
pendency relation, lowercased lemma of the
tents by means of a pipeline of natural language pro-
source and lowercased lemma of the target
cessing tools, properly customized so as to deal with
of the dependency relation. For instance,
several peculiarities of scientific texts. As a con-
one of the dependency tree tokens of the
sequence we are able to automatically extract from
sentence We demonstrate the theorem is:
each sentence:
SBJ we demonstrate, because ”we” is the
• inline citation markers - like (AuthorA et al., subject (SBJ) of the verb ”demonstrate”;
2010) or [11];
• number of inline citation markers;
• inline citation spans that are text spans made
of one or more contiguous inline citation mark- • number of inline citation spans that include
ers. Examples of inline citation spans including two or more contiguous inline citation mark-
one incline citation marker are: (ALL2011) or ers;
[11]. Examples of text spans including more • number of citations with a syntactic role;
than one inline citation marker are: [10, 12] or
(AuthorA. and AuthorB, 2010; AuthorC, 2014); • position of the sentence in the document, by
dividing the document in 10 unequal segments
• for each inline citation span, if it has or not a (referred to as Loc. feature in (Teufel, 1999));
syntactic role. For instance, in the sentence
[11, 12] demonstrate the theorem, the inline ci- • position of the sentence in the section, by
tation span [11, 12] has a syntactic role since it dividing the section into 7 unequal slices (re-
is the subject of the sentence. In the sentence ferred to as Struct-1 feature in (Teufel, 1999));
We exploited the ABA method [14], the inline
citation span [14] has no syntactic role. • category of the previous sentence. We use
gold standard previous sentence categories in
We process each sentence by a MATE depen- our experiments.
dency parser (Bohnet, 2010) to determine its syn-
tactic structure. A customized version of the parser 7.2 Classification experiments
is exploited to properly deal with the presence of in- By relying on the features just described, we com-
line citations. In particular, inline citations spans are pare the sentence classification performances of two
48
Category Logistic SVM our corpus.
Regression
Approach 0.876 0.851 8 Conclusions and Future Work
Background 0.778 0.735
Challenge 0.466 0.430 We have developed an annotation scheme for scien-
Future Work 0.675 0.496 tific discourse, adapted to a non-explored domain,
Outcome 0.679 0.623 Computer Graphics. We relied on the 5 categories
Avg. F1: 0.801 0.764 and 3 subcategories of our annotation schema to
manually annotate the sentences of a scientific dis-
Table 3: F1 score of 10-fold cross validation of Logistic course corpus made of 40 papers.
Regression and SVM - 10 fold cross validation over 8,777
manually classified sentences. We have observed that the larger categories (in
terms of number of sentences) - Approach, Back-
ground and Outcome - are highly predictable, while
classifiers: Logistic Regression and Support Vec- Challenge, which corresponds mainly with the in-
tor Machine with linear kernel. From our corpus troductory part of the scientific discourse is more
we consider the set of 8,777 sentences that have heterogeneous and highly dependable of the author’s
been manually associated to one of the 5 high level style. Sentences classified as FutureWork have spe-
classes of our scientific discourse annotation schema cial lexical characteristics as confirmed by the re-
(see Figure 1): Background, Challenge, Approach, sults of our automatic classification experiments.
Outcome, and Future Work. We collapse the sub- We have also characterized specific zones for each
categories Hypothesis and Goal into the parent cat- of the 5 categories, thus contributing to a deeper
egory Challenge and the sub-category Contribution knowledge of the internal structure of the scientific
into the parent category Outcome. We perform a 10- discourse in Computer Graphics.
fold cross validation of the two classification algo- In future we plan to focus on the characteriza-
rithms, over the collection of 8,777 sentences. The tion of other peculiarities of scientific text, includ-
results are shown in the Table 3. ing citations, thus properly extending our annota-
tion schema. We are also confident that our Simpli-
The Logistic Regression classifier outperforms
fied Annotation Scheme will be suitable in other do-
the SVM one both globally and by considering each
mains, and are therefore planning to verify it. A two-
single category. We can note that in general the
layered annotation scheme could then be applicable
F1 score obtained in each category decreases as the
to most domains, the first layer being coarse-grained
number of training instances does. This trend is not
and general, and a second layer being finer-grained
confirmed by the category Future Work. The corpus
and domain-dependent for certain categories.
includes 136 sentences that belong to the category
Future Work. This number is considerably lower As future venues of research concerning auto-
than the 449 examples of Challenge sentences and matic sentence classification, we are planning to
the 1,175 examples of Outcome sentences. Anyway, carry out more extensive experiments and evalua-
the Logistic Regression F1 score of the category Fu- tions by increasing the set of features that describe
ture Work (0.675) is almost equal to the one of the each sentence, evaluating the contributions of sin-
category Outcome (0.679) and considerably higher gle features and considering new classification algo-
than the F1 score of the category Challenge (0.446). rithms.
This happens because some linguistic features that
characterize Future Work sentences are strongly dis- Acknowledgments
tinctive with respect to the elements of this class.
For instance, the use of the future as verb tense as The research leading to these results has received
well words like plan, future, venue, etc. consistently funding from the European Project Dr. Inventor
contribute to automatically distinguish Future Work (FP7-ICT-2013.8.1 - grant agreement no 611383).
sentences, even if we have few training examples in
49
References Maria Liakata and Larisa Soldatova. 2008. Guide-
lines for the annotation of general scientific con-
Bernd Bohnet. 1999. Very high accuracy and fast depen- cepts. Aberystwyth University, JISC Project Report,
dency parsing is not a contradiction. Proceedings of http://ie-repository. jisc. ac. uk/88.
the 23rd International Conference on Computational
Maria Liakata, Simone Teufel, Advaith Siddharthan and
Linguistics. Association for Computational Linguis-
Colin Batchelor. 2010. Corpora for the Conceptuali-
tics, 2010.
sation and Zoning of Scientific Papers. Proceedings of
Paolo Ciccarese, Elizabeth Wu, Gwen Wong, Marco the Seventh conference on International Language Re-
Ocana, June Kinoshita, Alan Ruttenberg, and Tim sources and Evaluation (LREC’10). Valletta, Malta,
Clark. 2008. The SWAN biomedical discourse ontol- may 2010 Nicoletta Calzolari et al., European Lan-
ogy. Journal of Biomedical Informatics, 41, (5):739– guage Resources Association (ELRA).
751. Jimmy Lin, Damianos Karakos, Dina Demner-Fushman,
J. Cohen. 1960. A Coefficient of Agreement for Nominal and Sanjeev Khudanpur. 2006. Generative Content
Scales. Educational and Psychological Measurement, Models for Structural Analysis of Medical Abstracts.
20(1):(37). In Proceedings of the Workshop on Linking Natural
Anita de Waard, Paul Buitelaar and Thomas Eigner 2009 Language Processing and Biology: Towards Deeper
Identifying the Epistemic Value of Discourse Seg- Biological Literature Analysis, BioNLP ’06, p.65–72,
ments in Biology Texts Proceedings of the Eighth In- Stroudsburg, PA, USA. Association for Computa-
ternational Conference on Computational Semantics, tional Linguistics.
IWCS-8 ’09, 351–354,Stroudsburg,PA, USA, Associ- Stephen Merity, Tara Murphy, and James R. Curran
ation for Computational Linguistics. 2009. Accurate argumentative zoning with maximum
Eugene Garfield 1965. Can Citation Indexing Be Au- entropy models. Proceedings of the 2009 Workshop
tomated?, Statistical Association Methods for Mech- on Text and Citation Analysis for Scholarly Digital Li-
anized Documentation, Symposium Proceedings, Na- braries Association for Computational Linguistics
tional Bureau of Standards Miscellaneous Publication Yoko Mizuta and Nigel Collier. 2004. Annotation
volume 269, 189–192 Prentice-Hall, Englewood scheme for a rhetorical analysis of biology articles.
Cliffs, NJ. Proceedings of the Fourth International Conference on
Claire Grover, Ben Hachey, and Ian Hughson. 2004. Language and Evaluation (LREC2004),1737–1740 ,
The HOLJ Corpus: supporting summarisation of le- Lisbon, Portugal. European Language Resources As-
gal texts.. Proceedings of the 5th International Work- sociation (ELRA).
shop on Linguistically Interpreted Corpora (LINC-04) Yoko Mizuta, Anna Korhonen, Tony Mullen and Nigel
Geneva, Switzerland. Collier. 2006. Zone analysis in biology articles as a
Yufan Guo, Anna Korhonen, Maria Liakata, Ilona Silins, basis for information extraction. International Journal
Lin Sun and Ulla Stenius. 2010. Identifying the In- of Medical Informatics, 75(6):468–487.
formation Structure of Scientific Abstracts: An Inves- Raheel Nawaz, Paul Thompson, John McNaught, and
tigation of Three Different Schemes. Proceedings of Sophia Ananiadou 2010 Meta-Knowledge Annota-
the 2010 Workshop on Biomedical Natural Language tion of Bio-Events, Proceedings of the Seventh Inter-
Processing:99–107, Uppsala, Sweden. Association national Conference on Language Resources and Eval-
for Computational Linguistics. uation (LREC’10), Valletta, Malta, May, 2010. Euro-
Yufan Guo, Ilona Silins, Ulla Stenius, and Anna Korho- pean Language Resources Association (ELRA).
nen 2013. Active learning-based information struc- Patrick Ruch, Celia Boyer, Christine Chichester, Imad
ture analysis of full scientific articles and two applica- Tbahriti, Antoine Geissbhler, Paul Fabry, Julien Gob-
tions for biomedical literature review. (Bioinformatics eill, Violaine Pillet, Dietrich Rebholz-Schuhmann,
29.11): 1440-1447 Christian Lovis and Anne-Lise Veuthey. 2007. Using
Kenji Hirohata, Naoaki Okazaki, Sophia Ananiadou, and argumentation to extract key sentences from biomed-
Mitsuru Ishizuka. 2008. Identifying sections in sci- ical abstracts. International Journal of Medical
entific abstracts using conditional random fields. In Informatics,76,(2-3):195–200.
Proceedings of the IJCNLP 2008, p.381–388. Hagit Shatkay, Fengxia Pan, and Andrey Rzhetsky, and
Maria Liakata, Shyamasree Saha, Simon Dobnik, Colin W. John Wilbur. 2008. Multi-dimensional classifica-
Batchelor, and Dietrich Rebholz-Schuhmann. 2012. tion of biomedical text: Toward automated, practical
Automatic recognition of conceptualization zones in provision of high-utility text to diverse users. Bioin-
scientific articles and two life science applications. formatics, 24, 18:2086–2093.
Bioinformatics, 28,(7):991–1000. Larisa N. Soldatova and Ross King. 2006. An ontology
50
of scientific experiments. Journal of the Royal Society
Interface, 3 (11):795–803.
Ina Spiegel-Rösing. 1977. Science Studies: Bibliomet-
ric and Content Analysis, 7 (1). Social Studies of
Science, 97–113.
Johan AK Suykens and Vandewalle Joos. 1999. Least
squares support vector machine classifiers. Neural
processing letters 9.3 (1999): 293-300.
Simone Teufel 1999. Argumentative Zoning: Informa-
tion Extraction from Scientific Text, School of Cogni-
tive Science, University of Edinburgh, UK.
Simone Teufel, 2010 The Structure of Scientific Arti-
cles: Applications to Citation Indexing and Summa-
rization, CSLI Publications (CSLI Studies in Compu-
tational Linguistics), Stanford, CA.
Simone Teufel and Marc Moens 2002 Summarizing
Scientific Articles: Experiments with Relevance and
Rhetorical Status. Computational Linguistics, 28, (4),
409–445.
Simone Teufel, Advaith Siddharthan, and Colin Batch-
elor. 2009. Towards Discipline-independent Ar-
gumentative Zoning: Evidence from Chemistry and
Computational Linguistics, Proceedings of the 2009
Conference on Empirical Methods in Natural Lan-
guage Processing: Volume 3,EMNLP ’09, Singapore,
1493–1502, Association for Computational Linguis-
tics, Stroudsburg, PA, USA.
Paul Thompson, Syed A. Iqbal, John McNaught, and
Sophia Ananiadou. 2009. Construction of an anno-
tated corpus to support biomedical information extrac-
tion. BMC Bioinformatics, 10:349.
Melvin Weinstock 1971. Citation indexes, Encyclope-
dia of Library and Information Science, 5,16–40. Mar-
cel Dekker, Inc., New York.
Elizabeth White, K. Bretonnel Cohen, and Larry Hunter.
2011. Hypothesis and Evidence Extraction from
Full-Text Scientific Journal Articles. Proceedings of
BioNLP 2011 Workshop:134–135, Portland, Oregon,
USA, Association for Computational Linguistics.
W.John Wilbur, Andrey Rzhetsky, and Hagit Shatkay.
2006. New directions in biomedical text annotation:
definitions, guidelines and corpus construction. BMC
Bioinformatics, 7:356.
Raymond E. Wright, 1995. Logistic regression.
51