Automatic summarization of source code for novice programmers.

Date of Award


Document Type



College of Liberal Arts

Degree Name

Bachelor in Arts


The process of generating part-of-speech information is a well established problem in the field of computer science. A wide variety of taggers exist, and have been trained to use English text, and extract this information automatically. However, these taggers are traditionally only used for parsing information from traditional written English, such as news articles. Many of these taggers are evaluated on the Wall Street Journal corpus, which consists of many such articles. However, natural language artifacts also appear in the corpus of software source code, such as in method names. This thesis proposes a methodology for comparing these taggers on source code artifacts, and evaluating their overall accuracy. Additionally, a potential application of part-of-speech tagging source code is presented in this thesis. Specifically, a tool for novice programmers is developed and shown how this could be improved using this linguistic information to generate better, and more detailed summaries for novices, by extracting information from method names. These types of summaries would allow beginning programmers to learn how to read and work with code written by others. This is a major component of learning to work with code, especially with the collaborative nature of many modern software projects. By generating summaries automatically, the daunting appearance of production level source code becomes easier to broach and understand for a novice.