Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2232817.2232872acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Web-based citation parsing, correction and augmentation

Published: 10 June 2012 Publication History

Abstract

Considering the tremendous value of citation metadata, many methods have been proposed to automate Citation Metadata Extraction (CME). The existing methods primarily rely on the content analysis of citation text. However, the results from such content-based methods are often unreliable. Moreover, the extracted citation metadata is only a small part of the relevant metadata that spreads across the Internet. As opposed to the content-based CME methods, this paper proposes a Web-based CME approach and a citation enriching system, called as BibAll, which is capable of correcting the parsing results of content-based CME methods and augmenting citation metadata by leveraging relevant bibliographic data from digital repositories and cited-by publications on the Web. BibAll consists of four main components: citation parsing, Web-based bibliographic data retrieval, irrelevant bibliographic data filtering, and relevant bibliographic data integration. The system has been tested on the publicly available FLUX-CIM dataset. Experimental results show that BibAll significantly improves the citation parsing accuracy and augments the metadata of the original citation.

References

[1]
Aumuller, D., Rahm, E. 2011. PDFMeat: Managing Publications on the Semantic Desktop. In Proceedings of the 20th ACM Int. Conf. on Information and Knowledge Management (Glasgow, Scotland, UK, October 24 - 28, 2011). ACM Press, New York, NY, 2565--2568.
[2]
Besagni, D., Belaid, A., and Benet, N. 2003. A segmentation method for bibliographic references by contextual tagging of fields. In Proceeding of the ICDAR '03 (Edinburgh, Scotland, August 3 - 6, 2003). IEEE Computer Society Press, 384--388.
[3]
Bollacker, K. D., Lawrence, S., and Giles, C. L. 1998. CiteSeer: an autonomous web agent for automatic retrieval and identification of interesting publications. In Proceeding of the Agents '98 (Minneapolis, MN, USA, May 10 - 13, 1998). ACM Press, New York, NY, 116--123.
[4]
Chen, C. C., Yang, K.H., Kao, H. Y., and Ho, J. M. 2008. BibPro: A citation parser based on sequence alignment techniques. In Proceeding of the IEEE AINA '08 (Okinawa, Japan, March 25 -28, 2008). 1175--1180.
[5]
Councill, C. L. G. I. and M.-Y. Kan. 2008. ParsCit: an open-source CRF reference string parsing package. In Proceedings of the Sixth International Language Resources and Evaluation (Marrakech, Morocco, May 38 - 30, 2008). LREC '08. European Language Resources Association (ELRA). 661--667.
[6]
Day, M. Y., et al. 2007. Reference metadata extraction using a hierarchical knowledge representation framework. Decision Support Systems. 41, 1 (Feb. 2007), 152 - 167.
[7]
Eli, C., Altigran S. da Silva, Marcos A. G., Filipe M., and Edleno S. de Moura. 2007. FLUX-CIM: flexible unsupervised extraction of citation metadata. In Proceedings of the JCDL '07 (Vancouver, BC, Canada, June 17--23, 2007). ACM Press, New York, NY, 215--224.
[8]
Gao, L., Tang, Z. and Lin, X. 2009. CEBBIP: A parser of bibliographic information in Chinese electronic books. In Proceeding of the JCDL '09 (Austin, TX, USA, June 15--19, 2009). IEEE Computer Society, Washington, DC, 73--76.
[9]
Greenberg, J. 2003. Metadata extraction and harvesting: A comparison of two automatic metadata generation applications. Journal of Internet Cataloging. 6, 4 (2003), 59--82.
[10]
Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., and Fox, E.A. 2003. Automatic document metadata extraction using Support Vector Machines. In Proceeding of the JCDL '03 (Houston,TX, USA, May 27--31, 2003). IEEE Computer Society, Washington, DC, 37 - 48.
[11]
Hetzner, E. 2008. A simple method for citation metadata extraction using Hidden Markov Models. In Proceeding of the JCDL '08 (Pittsburgh, Pennsylvania, June 16 - 20, 2008). ACM Press, New York, NY, 280--284.
[12]
http://paracite.eprints.org/.
[13]
Huang, A., Ho, J. M., Kao, H. Y., and Lin, S. H. 2004. Extracting citation metadata from online publication lists using BLAST. In Proceedings of the PAKDD '04 (Sydney, Australia, May 26--28, 2004). Springer, Berlin, vol. 3056, 539--548.
[14]
Lee, D., J. Kang, P. Mitra, C. L. Giles, and B.-W. 2007. Are your citations clean?. Communications of the ACM. 50, 12 (2007), 33--38.
[15]
Marinai, S. 2009. Metadata Extraction from PDF Papers for Digital Library Ingest. In Proceeding of the ICDAR '09 (Barcelona, Spain, July 26--29, 2009). IEEE Computer Society Press, 251--255.
[16]
Page, L.; Brin, S.; Motwani, R.; and Winograd, T. 1998. The PageRank citation ranking: Bringing order to the web. Stanford Digital Library Technologies Project, 1998.
[17]
Patashnik, O. 1988. Bibtexing. In Proceedings of the IEEE 77 Hidden Markov Models and Selected Applications in Speech Recognition. 257--286.
[18]
Peng, F., and McCallum, A. 2004. Accurate information extraction from research papers using conditional random fields. In Proceeding of the HLTNAACL '04 (Boston, MA, USA, May 2 -7, 2004). 329--336.
[19]
Prime-Claverie, C. and Beigbeder, M. and Lafouge, T. 2005. Metadata propagation in the web using co-citations. In Proceedings of the 2005 IEEE/ACM In-ternational Conference on Web Intelligence (Compiegne University of Technology, France, September 19--22, 2005). IEEE Computer Society. 602--605.
[20]
Rodriguez, M. A.; Bollen, J.; and Van de Sompel, H. 2009. Automatic metadata generation using associative networks. ACM Trans. On Information Syst. 27, 2 (Feb. 2009), 1--20.
[21]
Seymore, K., McCallum, A., and Rosenfeld, R. 1999. Learning Hidden Markov Model structure for information extraction. In Proceeding of the AAAI '99 (Orlando, FL, USA, July 18 - 22, 1999). 37--42.
[22]
Takasu, A. 2003. Bibliographic attribute extraction from erroneous references based on a statistical model. In Proceeding of the JCDL '03 (Houston, TX, USA, May 27--31, 2003). IEEE Computer Society, Washington, DC, 49--60.
[23]
Wei, W., King, I., and Lee, J.H.-M. 2007. Bibliographic attributes extraction with layer-upon-layer tagging. In Proceeding of the ICDAR '07. (Curitiba, Paraná, Brazil, September 23 - 26, 2007). IEEE Computer Society Press, 804--808.
[24]
Wenneker, B. 2010. BibMix: Enrichment of citation metadata based on integration of bibliographic data. Master's Thesis. Delf University of Technology.

Cited By

View all
  • (2023)Machine Learning Approaches for Entity Extraction from Citation StringsDecision Intelligence10.1007/978-981-99-5997-6_25(287-297)Online publication date: 25-Nov-2023
  • (2015)Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFsProceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries10.1145/2756406.2756946(61-64)Online publication date: 21-Jun-2015
  • (2015)Two-Tier Machine Learning Using Conditional Random Fields with ConstraintsKnowledge Discovery, Knowledge Engineering and Knowledge Management10.1007/978-3-662-46549-3_6(80-95)Online publication date: 25-Apr-2015
  • Show More Cited By

Index Terms

  1. Web-based citation parsing, correction and augmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    JCDL '12: Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
    June 2012
    458 pages
    ISBN:9781450311540
    DOI:10.1145/2232817
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 10 June 2012

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. automatic metadata generation
    2. citation extraction and augmentation
    3. digital libraries
    4. web-based extraction

    Qualifiers

    • Research-article

    Conference

    JCDL '12
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 415 of 1,482 submissions, 28%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 18 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Machine Learning Approaches for Entity Extraction from Citation StringsDecision Intelligence10.1007/978-981-99-5997-6_25(287-297)Online publication date: 25-Nov-2023
    • (2015)Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFsProceedings of the 15th ACM/IEEE-CS Joint Conference on Digital Libraries10.1145/2756406.2756946(61-64)Online publication date: 21-Jun-2015
    • (2015)Two-Tier Machine Learning Using Conditional Random Fields with ConstraintsKnowledge Discovery, Knowledge Engineering and Knowledge Management10.1007/978-3-662-46549-3_6(80-95)Online publication date: 25-Apr-2015
    • (2014)A Web Service for Scholarly Big Data Information ExtractionProceedings of the 2014 IEEE International Conference on Web Services10.1109/ICWS.2014.27(105-112)Online publication date: 27-Jun-2014
    • (2014)Extracting bibliographical data for PDF documents with HMM and external resourcesProgram10.1108/PROG-12-2011-005948:3(293-313)Online publication date: Jul-2014
    • (2013)Extracting and matching authors and affiliations in scholarly documentsProceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries10.1145/2467696.2467703(219-228)Online publication date: 22-Jul-2013

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media