JETIR2306482

© 2023 JETIR June 2023, Volume 10, Issue 6 www.jetir.
org (ISSN-2349-5162)
Plagiarism Detection For Content

Sukrit Kumar Sahu and Vinayak Singh
Under the guidance of Faculty
Mr Advin Manhar
Amity School of Engineering and Technology,
Amity University Chhattisgarh
Abstract : Plagiarism is a problem that is getting worse. It is commonly understood to be literary theft and academic dishonesty in
the literature, and it ought to be avoided at all costs. A survey of plagiarism detection tools is presented in this publication. Common
characteristics of several detecting systems are outlined.
I. INTRODUCTION
Plagiarism, which is typically characterized as literary theft, is one of the rising difficulties facing publishers, researchers, and
educational institutions on a global scale. This entails presenting someone else's ideas, papers, codes, photos, etc. as one's own
creations. Plagiarism refers to this. This demonstrates a dishonest deed in academia and literature, hence it must be avoided.
The student population, particularly undergraduates to postgraduate who modify the original work and offer it as their own, is
where plagiarism in papers occurs most frequently for academic purposes. As a result, the student's capacity to evaluate their own
performance is compromised, and this must be avoided. Because of this, it is necessary to identify plagiarism in documents up front.
Systems that may be utilised in this regard include:
1. Web enabled systems
2. Stand-alone systems
I.I.1. Web enabled systems

Web-enabled technologies are more commonly used since they make it easier and more reliable to search the internet for
stolen materials.
I.I.2. Stand-alone systems

These are the detection systems that can be installed in the systems.
I.II. Plagiarism in Programing Code

Different methods have been developed to identify source codes created in C, C++, or Java. These methods are employed
to compare source codes created in various programming languages. However, it takes more time to identify sophisticated
modifications in the codes when using them. The structure-based method is the best strategy for identifying plagiarism in computer
code. To find commonalities, this technique employs string matching and tokenization methods.
II. LITERATURE REVIEW

A survey on plagiarism conducted by the University of California, Berkley, revealed that its prevalence throughout the four-year
period, from 1993 to 1997, grew to 74.4%. Also, it was discovered through other research that high school students make up over
90% of the population. As a result, plagiarism can be divided into several categories.
JETIR2306482 Journal of Emerging Technologies and Innovative Research (JETIR) www.jetir.org e395
© 2023 JETIR June 2023, Volume 10, Issue 6 www.jetir.org (ISSN-2349-5162)
The feature-based techniques [1][3] using software metrics, create a feature vector from the input programmed that can be
translated to an n-dimensional point in a cartesian space.
he distance between the points establishes how similar the two programmers are to one another. Lutz Prechill et al. [8] have noted
that feature-based solutions do not take into account important structural information of the programmers and that adding more
metrics for comparison does not increase accuracy.
We have had a peek of the algorithms that can be used to identify plagiarism thanks to Alan Parker et al. [4]. The algorithm that is
the subject of this study is based on string comparisons. It eliminates comments and white spaces, compares the string, keeps track
of the proportion of characters that are the same, and removes blank spaces.
In the research, the authors discussed six different levels of plagiarism and provided examples for each level. These algorithms
were created using Halstead's metrics theories, which have a close relationship with software metrics.
Since this was an older work, the investigation found that textual plagiarism might be automated, saving time and resources.
Students' plagiarism in their assignments has caused a lot of problems for the assessors, thus the author Michael J. Wise et al. [5]
devised a method known as YAP3, which is the third version of YAP and generally operates in two parts. It also deletes the token
that is not a reserved word from the programmed and eliminates the comments and string constants. It also transforms from uppercase
to lowercase, maps synonyms to a common form, rearranges the calling order of the functions, and maps synonyms to a common
form. The report also emphasizes Running-Karp-Rabin Greedy-String-Tiling (RKR-GST), a system for detection developed after
YAP and other detection systems were observed. The technique can also be used to find transposed subsequence. The report also
discusses YAP usage on YAP
on English texts which was a success.
III. METHODOLOGY TYPES

As it is challenging to manually identify plagiarism, it must be automated to be effective. There are various methods and
approaches that can be used to accomplish this, such as:
• Document comparison algorithms.
• Crawler for website data searches
• Techniques utilizing the linguistic frameworks, among other things.
IV. PROPOSED METHODOLOGY

Techniques utilizing the linguistic frameworks, among other things.
Although many data mining tasks adhere to tradition, evaluating data using a hypothesis-based approach provides a framework
for the implementation of flexible data driven techniques that assist algorithms for pattern detection. In essence, there are two
categories of data mining approaches, each of which has a different approach to building models or identifying patterns.
The following methodology is suggested for this purpose:
A. Collection of assignments: All the assignments or documents will be collected in electronic format. So that plagiarism can be
detected efficiently.
B. Pre-processing: Pre-processing is a major step in the process in which all the assignments are converted into a appropriate format.
All the assignments collected must be in the same format. Numbers, figure values, pictures and all those things which are not
from a-z group should be excluded from the documents.
C. Classification: Text classification should be performed to extract and separate the parts of a sentence into alternative words. With
the help of this key words from a sentence can be found.
D. Text analysis: Further, the data will be passed through the text analyzing step. This process can be repeated, sometimes, according
to the need. Moreover different text analyzing techniques can be used according to the nature of text and aims of the institutes.
E. Processing and analyzing the tri-grams: Sequences of three successive words will be considered as tri-grams in every line. They
are created through the cluster of the tri-grams from collection of assignments.
F. Similarity measures: Further in the process, comparison is performed upon the sequence of tri-grams created from the processed
documents, with the help of sequences comparing methods.
G. Clustering the plagiarized data: Clusters are created from the similar tri-grams to calculate the similarity score. Clusters will help
in the calculations and will accelerate the process.
H. Similarity score: Similarity score will be calculated through the clustering of the similar tri-grams. Similarity will be calculated in
the form of percentage. High value of percentage depicts the high similarity score.
V. RESULTS & CONCLUSIONS

A summary of a survey on plagiarism detection software. Plagiarism is still a problem for universities, professors, policy-
makers, and students despite the development of the internet and the need for information. In conclusion, the need for plagiarism
detection systems has become a crucial issue. The use of plagiarism detection systems in online learning improves academic
integrity, and using plagiarism detection systems can also successfully lower the incidence of plagiarism.
According to other study papers and data from other sources, we are generating 98% correct results by applying techniques that
make use of linguistic frameworks.
VI. REFERENCES
[1] Keerthana T V 1, Pushti Dixit 2, Rhuthu Hegde 3, Sonali S K 4, Prameetha Pai5, “A Literature Review on Plagiarism Detection
in Computer Programming Assignments”, IRJET, Volume 9, Issue 3, March. 2022, Pages 1-4.
[2] Joseph L.F. De Kerf, “APL and Halstead's theory of software metrics”, APL '81: Proceedings of the international conference
on APL, October 1981, Pages 89–93.
[3] John L Donaldson, Ann Marie Lancaster and Paula H Sposato, “A plagiarism detection system”, SIGCSE '81: Proceedings of
the twelfth SIGCSE technical symposium on Computer science education, February 1981, Pages 21–25.
[4] Alan Parker and James O. Hamblen, “Computer Algorithms for Plagiarism Detection”, IEEE Transactions On Education, Vol.
32, No. 2. May 1989.
[5] Michael J Wise, “YAP3: improved detection of similarities in computer program and other texts”, SIGCSE '96: Proceedings of
the twenty-seventh SIGCSE technical symposium on Computer science education, March 1996, Pages 130–134.
[6] Saul Schleimer, Daniel S. Wilkerson and Alex Aiken, “Winnowing: Local Algorithms for Document Fingerprinting”, SIGMOD
2003, June 9-12, 2003, San Diego, CA. Copyright 2003 ACM 1-58113-634-X/03/06.
[7] Richard M. Karp and Michael O. Rabin, “Efficient randomized pattern-matching algorithms”, Published in: IBM Journal of
Research and Development (Volume: 31, Issue: 2, March 1987), Page(s): 249 - 260.
[8] Lutz Prechelt and Guido Malpohl, “Finding Plagiarisms among a Set of Programs with JPlag”, March 2003, Journal Of Universal
Computer Science 8(11).
[9] Sven Meyer zu Eissen and Benno Stein, “Intrinsic Plagiarism Detection”, M. Lalmas et al. (Eds.): ECIR 2006, LNCS 3936, pp.
565–569, 2006.
[10] Liang Zhang, Yue-ting Zhuang and Zhen-ming Yuan, “A Program Plagiarism Detection Model Based on Information Distance
and Clustering”, Published in: The 2007 International Conference on Intelligent Pervasive Computing (IPC 2007), Date Added
to IEEE Xplore: 22 January 2008,Print ISBN:978-0-7695-3006-2.

JETIR2306482

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

JETIR2306482

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

JETIR2306482

Uploaded by

Copyright:

Available Formats

© 2023 JETIR June 2023, Volume 10, Issue 6 www.jetir.

Plagiarism Detection For Content

I.I.1. Web enabled systems

I.I.2. Stand-alone systems

I.II. Plagiarism in Programing Code

II. LITERATURE REVIEW

III. METHODOLOGY TYPES

IV. PROPOSED METHODOLOGY

V. RESULTS & CONCLUSIONS

You might also like