SciPlore’s technologies Citation-based Plagiarism Detection and Citation Proximity Analysis depend on the availability of bibliographic metadata. Author names, title information, references and citations must be accessible and ideally error-free. We improved existing tools, and developed our own tools to extract all the required information from PDF files.

Headerdata Extraction Framework

To obtain general article metadata, such as title [1], authors, affiliations, journal and DOI from PDF documents, we reviewed the most promising tools available for this task and found that each tool comes with its individual strengths and weaknesses [2].

Instead of picking only a single tool for the entire task, we developed a framework to select and combine the best metadata extraction tool for the individual tasks.

The framework takes PDF documents as input and returns the extracted metadata as a unified data structure. By handling the execution of specific tools through modules of the framework one can change and substitute specific tools easily. Currently, we are working on using the framework to construct a hybrid approach that combines the best results yielded by the different extraction tools.

Advanced Automated Citation Extraction

Accurate information on citation position (location in the full-text) is required to perform Citation-based Plagiarism Detection and Citation Proximity Analysis. In our review of available citation extraction tools, we found that none of them allow for a sophisticated position analysis.

We chose to enhance existing Open Source tools with methods which identify the position of citations at the character, sentence and section level of the text. We developed an enhanced version of the Open Source tool ParsCit, since it yielded very good parsing results. In the future, we intend to improve more tools in a similar manner.

Related Publications

[1] [pdf] [doi] J. Beel, B. Gipp, A. Shaker, and N. Friedrich, “SciPlore Xtract: Extracting Titles from Scientific PDF Documents by Analyzing Style Information (Font Size),” in Research and Advanced Technology for Digital Libraries: Proceedings of the 14th European Conference on Digital Libraries (ECDL’10), Glasgow, UK, 2010.
  Title                    = {{S}ci{P}lore {X}tract: {E}xtracting {T}itles from {S}cientific {PDF} {D}ocuments by {A}nalyzing {S}tyle {I}nformation ({F}ont {S}ize)},
  Author                   = {{B}eel, {J}oeran and {G}ipp, {B}ela and {S}haker, {A}mmar and {F}riedrich, {N}ick},
  Booktitle                = {{R}esearch and {A}dvanced {T}echnology for {D}igital {L}ibraries: {P}roceedings of the 14th {E}uropean {C}onference on {D}igital {L}ibraries ({ECDL}'10)},
  Year                     = {2010},
  Address                  = {Glasgow, UK},
  Editor                   = {Lalmas, M. and Jose, J. and Rauber, A. and Sebastiani, F. and Frommholz, I.},
  Month                    = {Sep.},
  Publisher                = {Springer},
  Series                   = {Lecture Notes of Computer Science (LNCS)},
  Volume                   = {6273},
  Doi                      = {10.1007/978-3-642-15464-5_45}
[2] [pdf] [doi] M. Lipinski, K. Yao, C. Breitinger, J. Beel, and B. Gipp, “Evaluation of Header Metadata Extraction Approaches and Tools for Scientific PDF Documents,” in Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), Indianapolis, IN, USA, 2013.
  Title                    = {{E}valuation of {H}eader {M}etadata {E}xtraction {A}pproaches and {T}ools for {S}cientific {PDF} {D}ocuments},
  Author                   = {{L}ipinski, {M}ario and {Y}ao, {K}evin and {B}reitinger, {C}orinna and {B}eel, {J}oeran and {G}ipp, {B}ela},
  Booktitle                = {{P}roceedings of the 13th {ACM}/{IEEE}-{CS} {J}oint {C}onference on {D}igital {L}ibraries ({JCDL})},
  Year                     = {2013},

  Address                  = {Indianapolis, IN, USA},
  Month                    = {Jul. 22 - 26},
  Publisher                = {ACM},

  Doi                      = {10.1145/2467696.2467753}