Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3314221.3314629acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article
Public Access

SemCluster: clustering of imperative programming assignments based on quantitative semantic features

Published: 08 June 2019 Publication History
  • Get Citation Alerts
  • Abstract

    A fundamental challenge in automated reasoning about programming assignments at scale is clustering student submissions based on their underlying algorithms. State-of-the-art clustering techniques are sensitive to control structure variations, cannot cluster buggy solutions with similar correct solutions, and either require expensive pair-wise program analyses or training efforts. We propose a novel technique that can cluster small imperative programs based on their algorithmic essence: (A) how the input space is partitioned into equivalence classes and (B) how the problem is uniquely addressed within individual equivalence classes. We capture these algorithmic aspects as two quantitative semantic program features that are merged into a program's vector representation. Programs are then clustered using their vector representations. The computation of our first semantic feature leverages model counting to identify the number of inputs belonging to an input equivalence class. The computation of our second semantic feature abstracts the program's data flow by tracking the number of occurrences of a unique pair of consecutive values of a variable during its lifetime. The comprehensive evaluation of our tool SemCluster on benchmarks drawn from solutions to small programming assignments shows that SemCluster (1) generates far fewer clusters than other clustering techniques, (2) precisely identifies distinct solution strategies, and (3) boosts the performance of clustering-based program repair, all within a reasonable amount of time.

    Supplementary Material

    WEBM File (p860-perry.webm)
    MP4 File (3314221.3314629.mp4)
    Video Presentation

    References

    [1]
    [n. d.]. CodeChef. https://www.codechef.com/.
    [2]
    [n. d.]. Codeforces. http://codeforces.com/.
    [3]
    [n. d.]. HackerRank. https://www.hackerrank.com//.
    [4]
    2017. The 50 Most Popular MOOCs of All Time. https://www.onlinecoursereport.com/the-50-most-popular-moocs-of-all-time/.
    [5]
    Boris Beizer. 2003. Software Testing Techniques. Dreamtech Press.
    [6]
    Judith Bishop, R. Nigel Horspool, Tao Xie, Nikolai Tillmann, and Jonathan de Halleux. 2015. Code Hunt: Experience with Coding Contests at Scale. In Proceedings of the 37th International Conference on Software Engineering - Volume 2 (ICSE '15). IEEE Press, Piscataway, NJ, USA, 398-407. http://dl.acm.org/citation.cfm?id=2819009.2819072.
    [7]
    Dmitry Chistikov, Rayna Dimitrova, and Rupak Majumdar. 2017. Approximate Counting in SMT and Value Estimation for Probabilistic Programs. Acta Informatica 54, 8 (2017), 729-764.
    [8]
    Loris D'Antoni, Roopsha Samanta, and Rishabh Singh. 2016. Qlose: Program Repair with Quantitative Objectives. In International Conference on Computer Aided Verification. Springer, Toronto, Ontario, Canada, 383-401.
    [9]
    Leonardo De Moura and Nikolaj Bjørner. 2008. Z3: An Efficient SMT Solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, Warsaw, Poland, 337-340.
    [10]
    Anna Drummond, Yanxin Lu, Swarat Chaudhuri, Christopher Jermaine, Joe Warren, and Scott Rixner. 2014. Learning to Grade Student Programs in a Massive Open Online Course. In Proceedings of the 2014 IEEE International Conference on Data Mining (ICDM '14). IEEE Computer Society, Washington, DC, USA, 785-790.
    [11]
    Matthew Fredrikson and Somesh Jha. 2014. Satisfiability Modulo Counting: A New Approach for Analyzing Privacy Properties. In Proceedings of the Joint Meeting of the Twenty-Third EACSL Annual Conference on Computer Science Logic (CSL) and the Twenty-Ninth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS) (CSL-LICS '14). ACM, New York, NY, USA, Article 42, 10 pages.
    [12]
    Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable Detection of Semantic Clones. In Proceedings of the 30th International Conference on Software Engineering (ICSE '08). ACM, New York, NY, USA, 321-330.
    [13]
    Elena L Glassman, Jeremy Scott, Rishabh Singh, Philip J Guo, and Robert C Miller. 2015. OverCode: Visualizing Variation in Student Solutions to Programming Problems at Scale. ACM Transactions on Computer-Human Interaction (TOCHI) 22, 2 (2015), 7.
    [14]
    Sumit Gulwani, Ivan Radi?ek, and Florian Zuleger. 2018. Automated Clustering and Program Repair for Introductory Programming Assignments. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, New York, NY, USA, 465-480.
    [15]
    Andrew Head, Elena Glassman, Gustavo Soares, Ryo Suzuki, Lucas Figueredo, Loris D'Antoni, and Björn Hartmann. 2017. Writing Reusable Code Feedback at Scale with Mixed-Initiative Program Synthesis. In Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale (L@S '17). ACM, New York, NY, USA, 89-98.
    [16]
    Jonathan Huang, Chris Piech, Andy Nguyen, and Leonidas Guibas. 2013. Syntactic and Functional Variability of a Million Code Submissions in a Machine Learning MOOC. In AIED 2013 Workshops Proceedings Volume, Vol. 25.
    [17]
    Jeong-Hoon Ji, Gyun Woo, and Hwan-Gue Cho. 2007. A Source Code Linearization Technique for Detecting Plagiarized Programs. In Proceedings of the 12th Annual SIGCSE Conference on Innovation and Technology in Computer Science Education (ITiCSE '07). ACM, New York, NY, USA, 73-77.
    [18]
    Shalini Kaleeswaran, Anirudh Santhiar, Aditya Kanade, and Sumit Gulwani. 2016. Semi-supervised Verified Feedback Generation. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016). ACM, New York, NY, USA, 739-750.
    [19]
    Dohyeong Kim, Yonghwi Kwon, Peng Liu, I Luk Kim, David Mitchel Perry, Xiangyu Zhang, and Gustavo Rodriguez-Rivera. 2016. Apex: Automatic Programming Assignment Error Explanation. ACM SIGPLAN Notices 51, 10 (2016), 311-327.
    [20]
    Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04). IEEE Computer Society, Washington, DC, USA, 75-. http://dl.acm.org/citation.cfm?id=977395.977673.
    [21]
    Chao Liu, Chen Chen, Jiawei Han, and Philip S. Yu. 2006. GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '06). ACM, New York, NY, USA, 872-881.
    [22]
    Lannan Luo and Qiang Zeng. 2016. SolMiner: Mining Distinct Solutions in Programs. In Proceedings of the 38th International Conference on Software Engineering Companion (ICSE '16). ACM, New York, NY, USA, 481-490.
    [23]
    Feifei Ma, Sheng Liu, and Jian Zhang. 2009. Volume Computation for Boolean Combination of Linear Arithmetic Constraints. In International Conference on Automated Deduction. Springer, Montreal, Canada, 453-468.
    [24]
    Andy Nguyen, Christopher Piech, Jonathan Huang, and Leonidas Guibas. 2014. Codewebs: Scalable Homework Search for Massive Open Online Programming Courses. In Proceedings of the 23rd International Conference on World Wide Web (WWW'14). ACM, New York, NY, USA, 491-502.
    [25]
    Sagar Parihar, Ziyaan Dadachanji, Praveen Kumar Singh, Rajdeep Das, Amey Karkare, and Arnab Bhattacharya. 2017. Automatic Grading and Feedback Using Program Repair for Introductory Programming Courses. In Proceedings of the 2017 ACM Conference on Innovation and Technology in Computer Science Education (ITiCSE '17). ACM, New York, NY, USA, 92-97.
    [26]
    F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825-2830.
    [27]
    Hao Peng, Lili Mou, Ge Li, Yuxuan Liu, Lu Zhang, and Zhi Jin. 2015. Building Program Vector Representations for Deep Learning. In International Conference on Knowledge Science, Engineering and Management. Springer, Chongqing, China, 547-553.
    [28]
    Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, and Leonidas Guibas. 2015. Learning Program Embeddings to Propagate Feedback on Student Code. In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37 (ICML'15). JMLR.org, Lille, France, 1093-1102. http://dl.acm.org/citation.cfm?id=3045118.3045235.
    [29]
    Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding Plagiarisms Among a Set of Programs with JPlag. Journal of Universal Computer Science 8, 11 (2002), 1016.
    [30]
    Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, and Regina Barzilay. 2016. SkP: A Neural Program Corrector for MOOCs. In Companion Proceedings of the 2016 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity (SPLASH Companion 2016). ACM, New York, NY, USA, 39-40.
    [31]
    Zvonimir Rakamari? and Alan J. Hu. 2009. A Scalable Memory Model for Low-Level Code. In Proceedings of the 10th International Conference on Verification, Model Checking, and Abstract Interpretation (VMCAI '09). Springer-Verlag, Berlin, Heidelberg, 290-304.
    [32]
    Kelly Rivers and Kenneth R Koedinger. 2013. Automatic Generation of Programming Feedback: A Data-driven Approach. In The First Workshop on AI-supported Education for Computer Science (AIEDCS 2013), Vol. 50.
    [33]
    Kelly Rivers and Kenneth R. Koedinger. 2015. Data-Driven Hint Generation in Vast Solution Spaces: a Self-Improving Python Programming Tutor. International Journal of Artificial Intelligence in Education (2015), 1-28.
    [34]
    Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. 2003. Winnowing: Local Algorithms for Document Fingerprinting. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03). ACM, New York, NY, USA, 76-85.
    [35]
    Rishabh Singh, Sumit Gulwani, and Armando Solar-Lezama. 2013. Automated Feedback Generation for Introductory Programming Assignments. ACM SIGPLAN Notices 48, 6 (2013), 15-26.
    [36]
    Alexander Strehl and Joydeep Ghosh. 2002. Cluster Ensembles--a Knowledge Reuse Framework for Combining Multiple Partitions. Journal of machine learning research 3, Dec (2002), 583-617.
    [37]
    Haruaki Tamada, Keiji Okamoto, Masahide Nakamura, Akito Monden, and Ken-ichi Matsumoto. 2004. Dynamic Software Birthmarks to Detect the Theft of Windows Applications. In International Symposium on Future Software Technology, Vol. 20. Citeseer.
    [38]
    Marc Thurley. 2006. sharpSAT-Counting Models with Advanced Component Caching and Implicit BCP. In International Conference on Theory and Applications of Satisfiability Testing. Springer, 424-429.
    [39]
    Nghi Truong, Paul Roe, and Peter Bancroft. 2004. Static Analysis of Students' Java Programs. In Proceedings of the Sixth Australasian Conference on Computing Education - Volume 30 (ACE '04). Australian Computer Society, Inc., Darlinghurst, Australia, Australia, 317-325. http://dl.acm.org/citation.cfm?id=979968.980011.
    [40]
    Ke Wang, Rishabh Singh, and Zhendong Su. 2018. Search, Align, and Repair: Data-driven Feedback Generation for Introductory Programming Exercises. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2018). ACM, NewYork, NY, USA, 481-495.
    [41]
    Ke Wang, Zhendong Su, and Rishabh Singh. 2018. Dynamic Neural Program Embeddings for Program Repair. In International Conference on Learning Representations.
    [42]
    Xinran Wang, Yoon-Chan Jhi, Sencun Zhu, and Peng Liu. 2009. Behavior Based Software Theft Detection. In Proceedings of the 16th ACM Conference on Computer and Communications Security (CCS '09). ACM, New York, NY, USA, 280-290.
    [43]
    Xinran Wang, Yoon-Chan Jhi, Sencun Zhu, and Peng Liu. 2009. Detecting Software Theft via System Call Based Birthmarks. In Proceedings of the 2009 Annual Computer Security Applications Conference (ACSAC '09). IEEE Computer Society, Washington, DC, USA, 149-158.
    [44]
    Wei Wei and Bart Selman. 2005. A New Approach to Model Counting. In International Conference on Theory and Applications of Satisfiability Testing. Springer, 324-339.
    [45]
    Songwen Xu and Yam San Chee. 2003. Transformation-based Diagnosis of Student Programs for Programming Tutoring Systems. IEEE Transactions on Software Engineering 29, 4 (2003), 360-384.
    [46]
    Wuu Yang. 1991. Identifying Syntactic Differences Between Two Programs. Software: Practice and Experience 21, 7 (1991), 739-755.

    Cited By

    View all
    • (2024)PyDex: Repairing Bugs in Introductory Python Assignments using LLMsProceedings of the ACM on Programming Languages10.1145/36498508:OOPSLA1(1100-1124)Online publication date: 29-Apr-2024
    • (2024)Strider: Signal Value Transition-Guided Defect Repair for HDL Programming AssignmentsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334175043:5(1594-1607)Online publication date: May-2024
    • (2023)Who Judges the Judge: An Empirical Study on Online Judge TestsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598060(334-346)Online publication date: 12-Jul-2023
    • Show More Cited By

    Index Terms

    1. SemCluster: clustering of imperative programming assignments based on quantitative semantic features

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        PLDI 2019: Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation
        June 2019
        1162 pages
        ISBN:9781450367127
        DOI:10.1145/3314221
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 08 June 2019

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Program analysis
        2. Program clustering
        3. Quantitative reasoning

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        PLDI '19
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 406 of 2,067 submissions, 20%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)100
        • Downloads (Last 6 weeks)14
        Reflects downloads up to 11 Aug 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)PyDex: Repairing Bugs in Introductory Python Assignments using LLMsProceedings of the ACM on Programming Languages10.1145/36498508:OOPSLA1(1100-1124)Online publication date: 29-Apr-2024
        • (2024)Strider: Signal Value Transition-Guided Defect Repair for HDL Programming AssignmentsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.334175043:5(1594-1607)Online publication date: May-2024
        • (2023)Who Judges the Judge: An Empirical Study on Online Judge TestsProceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3597926.3598060(334-346)Online publication date: 12-Jul-2023
        • (2023)Using Machine Learning to Identify Patterns in Learner-Submitted Code for the Purpose of AssessmentPattern Recognition10.1007/978-3-031-33783-3_5(47-57)Online publication date: 21-Jun-2023
        • (2022)WOJR: A Recommendation System for Providing Similar Problems to Programming AssignmentsApplied System Innovation10.3390/asi50300535:3(53)Online publication date: 31-May-2022
        • (2022)Automated Feedback Generation for Competition-Level CodeProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering10.1145/3551349.3560425(1-13)Online publication date: 10-Oct-2022
        • (2022)Use of Machine Learning Methods in the Assessment of Programming AssignmentsText, Speech, and Dialogue10.1007/978-3-031-16270-1_13(151-159)Online publication date: 5-Sep-2022
        • (2022)Automatic Repair for Network ProgramsTools and Algorithms for the Construction and Analysis of Systems10.1007/978-3-030-99527-0_19(353-372)Online publication date: 30-Mar-2022
        • (2021)PaCon: a symbolic analysis approach for tactic-oriented clustering of programming submissionsProceedings of the 2021 ACM SIGPLAN International Symposium on SPLASH-E10.1145/3484272.3484963(32-42)Online publication date: 20-Oct-2021
        • (2021)Swarmbug: debugging configuration bugs in swarm roboticsProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468601(868-880)Online publication date: 20-Aug-2021
        • Show More Cited By

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Get Access

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media