Analyzing Non-Textual Content Elements to Detect Academic Plagiarism by Norman Meuschke

Analyzing Non-Textual Content Elements to Detect Academic Plagiarism by Norman Meuschke

Author:Norman Meuschke
Language: eng
Format: epub
ISBN: 9783658420628
Publisher: Springer Fachmedien Wiesbaden


5.3 Evaluation Dataset

To evaluate math-based detection methods and compare their effectiveness to citation-based and text-based methods, we created a new dataset because no existing dataset offers mathematical content. See Section 2.​5.​2 for summaries of existing datasets. Figure 5.3 illustrates the process for creating the dataset.

We selected 10 publications as test cases, to which we refer as C1…C10. Selecting only 10 cases had four reasons. First, we chose cases from research fields within our area of expertise to enable us to assess the relevance of identified similarities. Second, we chose cases most representative of the types of mathematical similarity we observed. Third, our preprocessing of documents required manual checks of automatically extracted mathematical content, as we explain in Section 5.3.1. The effort required for this step prevented converting more cases. Fourth, we restricted the test cases to disciplines covered by the NTCIR-11 MathIR Task dataset [9]. Appendix B in the electronic supplementary material, describes the test cases.

We used the topically related NTCIR-11 MathIR Task dataset to create a reference collection. The NTCIR dataset includes about 60 million formulae from 105,120 scientific publications in computer science, mathematics, physics, and statistics. The dataset creators retrieved the publications from the arXiv [93] preprint repository in LaTeX format. We embedded the confirmed source documents for each of the test cases into the NTCIR dataset.

Figure 5.3 Creation of the evaluation dataset



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.