Codebook: Discovering and Exploiting Relationships in
Software Repositories
Andrew Begel
Khoo Yit Phang
Thomas Zimmermann
Microsoft Research
Redmond, WA, USA
andrew.begel@microsoft.com
University of Maryland
College Park, MD, USA
khooyp@cs.umd.edu
Microsoft Research
Redmond, WA, USA
tzimmer@microsoft.com
ABSTRACT
Large-scale software engineering requires communication and collaboration to successfully build and ship products. We conducted
a survey with Microsoft engineers on inter-team coordination and
found that the most impactful problems concerned finding and keeping track of other engineers. Since engineers are connected by their
shared work, a tool that discovers connections in their work-related
repositories can help.
Here we describe the Codebook framework for mining software
repositories. It is flexible enough to address all of the problems
identified by our survey with a single data structure (graph of people and artifacts) and a single algorithm (regular language reachability). Codebook handles a larger variety of problems than prior
work, analyzes more kinds of work artifacts, and can be customized
by and for end-users. To evaluate our framework’s flexibility, we
built two applications, Hoozizat and Deep Intellisense. We evaluated these applications with engineers to show effectiveness in addressing multiple inter-team coordination problems.
Categories and Subject Descriptors:
D.2.9 [Software Engineering]: Management—productivity H.5.2
[Information Systems]: User Interfaces—User-centered design
General Terms: Management, Human Factors
Keywords: Knowledge management, Social networking, Mining
software repositories, Inter-team coordination, Regular expression,
Regular language reachability
1.
INTRODUCTION
Coordination between software teams is a persistent problem in
software engineering. Teams are dependent on one another for
code, APIs, features, schedules, bugs and documentation [7], and
require frequent and effective communication and cooperation to
accomplish their tasks [13]. Unfortunately, poor execution in these
areas is often a cause of inter-team conflict. The industry’s move
towards distributed development and increased use of technologymediated communication only exacerbates the problems [15].
We conducted a survey at Microsoft to learn about coordina-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICSE ’10, May 2-8 2010, Cape Town, South Africa
Copyright 2010 ACM 978-1-60558-719-6/10/05 ...$10.00.
tion problems between software development teams. We asked
survey respondents—which included software developers, testers,
program managers—to prioritize 31 different information needs
around inter-team coordination. The results, which we present in
this paper (Section 2), show that engineers want new solutions for
finding people, discovering and tracking dependencies, learning
about the status of work items, and learning the rationale behind
changes. What is most interesting about this list is that the majority of indicated needs are about discovering, meeting, and
keeping track of people, not just code.
Software engineers are connected to one another in many ways,
directly through face-to-face interactions and communication technologies, and indirectly through their shared work artifacts, which
are stored and maintained in software repositories. Tools that discover connections inside these repositories can help address many
of the engineers’ coordination needs. This is an approach used
by many applications in the field of mining software repositories
(MSR) [14]. MSR applications typically address one particular
information need at a time, for example, assigning developers to
bugs [3], detecting duplicate bugs [27], or recommending related
changes [32]; for more applications, see Kagdi et al.’s survey [19].
Most applications read data from only one or two software repositories and are built on very different infrastructures. These characteristics make it to difficult to address multiple information needs
with a single tool or framework.
Previously, in our ICSE NIER paper, we proposed the idea of
Codebook as a social networking web service that helps engineers
to find and maintain connections with colleagues [6]. In this paper, we contribute the design, implementation, and evaluation of
the Codebook framework. On top of this, we have built several
applications which address the information needs identified by our
survey.
Codebook discovers transitive relationships between people, code,
bugs, test cases, specifications, and other related artifacts by mining any kind of software repository (Section 3). It extensively
supports multiple information needs with one data structure (a directed graph) and one algorithm (regular language reachability).
Codebook is designed to be customizable by local domain experts,
who have the most accurate knowledge about their teams’ information needs and software development practices. These experts
codify their knowledge into regular expressions that describe paths
through the nodes and edges in the graph. Codebook takes care of
the processing and optimization that is needed for efficient crawling, analyzing and querying of the data, even for information that
is indirectly linked across repositories, a task which is inadequately
addressed in prior work.
The following examples illustrate how a domain expert can write
paths to help teammates discover who works most closely together:
1. Which developer owns some piece of code?
Regular expression: Person Commits Changeset Modifies
FileRevision Modifies Code
2. Which program manager wrote the specification for that
code?
Regular expression: Code MentionedBy WordDocument
AuthoredBy Person
3. Which program managers and developers on the team
work together (combines 1 and 2)?
Regular expression: Person Commits Changeset Modifies
FileRevision Modifies Code MentionedBy WordDocument
AuthoredBy Person
The results of computing these paths are pairs of people, code,
bugs, test cases, specs, etc., and are revealed to front-end applications via web services. To evaluate the flexibility of our framework
to address the multiple information needs identified by our survey,
we have built two applications on top of Codebook. Both were designed and evaluated in consultation with the Microsoft software
engineers the tools were created to help.
Our first application is Hoozizat (Section 4.1), which addresses
four of the top ten information needs identified in our survey. Hoozizat is a web-based search portal that helps engineers find their
counterparts who are responsible for a particular feature, API, product or service. Given some search terms, Hoozizat returns a set of
related people, work items, code and files from the repositories.
Next to each result is a second, shorter list of the engineers Codebook has found to be associates for people in the results, or owners
for other items.
The second application is Deep Intellisense (Section 4.2), which
addresses another information need from the top ten list. Deep
Intellisense was first built [16] on a prior implementation [31] of
Codebook. It is a Visual Studio add-in that shows a complete history of events for any program identifier that the user clicks on in
the editor, including code changes, filed bugs, and forum discussions developers had about the code in question.
In addition, we specify two more applications addressing three
more of the top ten information needs (Section 5). Of the remaining
two needs, one is addressed in previous work [25], and one we keep
for our future work.
1.1
Contributions
This paper makes the following contributions:
∙ The results of a survey of inter-team coordination needs for
a variety of software team roles. (Section 2)
∙ A novel, flexible, customizable framework for mining software repositories, which can support multiple applications
with single data structure and algorithm. (Section 3)
∙ We demonstrated the flexibility of our framework by building
two applications which address five of the top ten information needs reported by our survey. (Section 4)
In addition, we specify two additional applications, to address three more of the top ten needs. (Section 5)
∙ Microsoft engineers evaluated the usefulness of our applications in satisfying inter-team coordination information needs.
(Section 4)
2.
INTER-TEAM COORDINATION SURVEY
We conducted an anonymous web-based survey in order to learn
how software engineers prioritize 31 different information needs
about inter-team coordination. In June 2009 inside Microsoft Corporation, we sent an email invitation to 1,000 developers, testers,
and program managers (consisting of a 3% random sample of employees in each job role). Respondents were offered a chance to
win a single $250 gift certificate as incentive for completing the
survey. 11% of the invitees responded to the survey.
The survey was divided into a demographic section and a section that asked respondents to check any number of 31 inter-team
coordination information needs derived from previous studies [23,
21, 6, 16] and interviews with software engineers (Section 4.1).
These needs were organized into eight categories: change notification, finding dependents, finding other people, finding artifacts,
awareness of other teams, artifact history, work planning, and social networking. Respondents were asked to “pick the tasks that are
most important to you, and where if you had a new tool that could
make this task easier, it would have a big positive impact on your
work day.” Respondents chose an average of 12.5 tasks (SD = 5.5).
The ten most indicated coordination information needs amongst
Microsoft software engineers are listed below, along with the percentage of respondents who indicated that response and a reference
to where they are addressed in this paper.
1. Given a feature, API, product or service, finding out who the
most relevant engineers (developers, testers, program managers, operations, leads, etc.) are in order to contact them.
(83%)
→ Hoozizat (Section 4.1)
2. Finding an expert to talk to who knows a lot about a feature,
API, product or service. (67%)
→ Related work [25]
3. Given a feature, API, product or service from another team,
getting a list of servers, directories and repositories where the
related code, bug reports, work items, specifications, etc. are
located. (64%)
→ Future work
4. Finding out why a recent change was made, e.g., the related
bug report/work item, specifications, or conversation threads
in discussion lists. (62%)
→ Deep Intellisense (Section 4.2)
5. Being notified that there is a recent change that affects my
code or work items. (60%)
→ “Anxious for Awareness” (Section 5.2)
6. Finding out who might be affected by a change I make to my
code/API. (57%)
→ “Who is using our code?” (Section 5.1)
7. Finding out who owns some code or has ever worked on it in
the past. (56%)
→ Hoozizat (Section 4.1)
8. Finding out who owns a specification or knows the most
about it. (56%)
→ Hoozizat (Section 4.1)
9. Finding out which teams own the feature, product or service
I or my team depend on. (53%)
→ Hoozizat (Section 4.1)
10. Finding out everyone outside my team who depends on my
feature, API, product or service. (50%)
→ “Who is using our code?” (Section 5.1)
Notice that the top two needs are about finding the people responsible or knowledgeable in a feature, API, product or service.
Five more of the top ten needs are also about finding people (5–
10)! This may be to report a bug, to get programming advice, to
learn about a scheduling change, to request a new feature, or perhaps to ask for a code review. Through interviews with thirteen engineers having various job roles at Microsoft, we found that most
Crawlers
AD
AD Crawler
TFS
TFS Crawler
Share
point
Sharepoint
Crawler
Mailing
lists
Discussion
Crawler
...
...
Web Portal
SQL Server
App Plugins
Path Analyzer
Web
Service
Visual Studio, Outlook
Your App
Window Name
File
Edit
Name
Code
Recognizer
Work item
Recognizer
Associate
Paths
Contexts
Paths
People
Recognizer
...
Ownership
Paths
...
Recognizers
Object 1
Object 2
Object 3
Object 4
Object 5
Object 6
Object 7
Object 8
Object 9
Object 10
Object 11
View
Tools
Help
Create Date
Mod. Date
1/1/1
1/1/1
1/1/1
1/1/1
1/1/1
1/1/1
1/1/1
1/1/1
1/1/1
1/1/1
1/1/1
2/2/2
2/2/2
2/2/2
2/2/2
2/2/2
2/2/2
2/2/2
2/2/2
2/2/2
2/2/2
2/2/2
Summary
Test Object
Test Object
Test Object
Test Object
Test Object
Test Object
Test Object
Test Object
Test Object
Test Object
Test Object
Path Queries
Figure 1: The architecture of the Codebook framework. Any
number of repositories may be crawled, analyzed and stored
as a graph in a SQL Server database. A set of paths created
by domain experts is compiled, uploaded into the system, and
used to discover the paths that exist in the graph. Front end
applications then use web services to query the database for
relevant people, artifacts, and paths that answer its end users’
information needs.
ask their friends or colleagues to direct them to the people they are
looking for. If their friends do not know the answer, they usually
know someone else who may. This process of asking friends often
resembles a game of six degrees of separation.
The people they seek can be found in the electronic repositories
used by software engineers in their daily work practice. Much of
that information, however, is hidden inside of these repositories and
difficult to discover behind opaque query interfaces. Even after
information is dug up, correlating it with information from other
repositories is quite difficult [16].
Our goal for Codebook is to build a framework general enough
to answer inter-team coordination information needs directly, without requiring users to conduct in-depth investigations of raw data
sources on their own. In the next section, we describe the Codebook
framework and discuss our solution for binding raw information together to discover and exploit the connections found within.
3.
THE CODEBOOK ARCHITECTURE
The key data structure behind Codebook is a graph of typed
nodes, which represent repository objects (such as people, changesets, work items, files, and source code), and typed edges, which label the relationships of the nodes to one another (such as commits,
bug assignments, caller/callee, use/def, textual allusions).1 Codebook’s algorithm then walks the graph from one set of interesting
objects to another via the relationship edges and discover which
objects are ultimately transitively connected to each other.
To ensure scalability, it is important to not compute transitive
closure blindly (which would take O(𝑛3 ) time and produce many
many useless results), but instead focus on useful paths between
pairs of nodes in the graph. A path through the graph recognized
by the regular expression is useful if, by computing the path and
its endpoint nodes, it answers a question posed by a domain expert.
Paths are defined by regular expressions whose alphabet is com1
This graph model was initially proposed by Venolia in a workshop
paper [31]. Codebook is built on a direct descendent of that model.
posed of the node and edge labels in the graph. We achieve scalability by using an 𝑂(𝑛2 ) all-pairs regular language reachability
algorithm combined with an optimization specific to software engineering that drastically reduces the number of nodes considered
by the algorithm.
Several steps are involved in creating a Codebook graph useful for answering end-users’ questions about their team’s software
development activities. First, as illustrated in Figure 1, a set of
crawlers mine objects from various software repositories (for example, a revision control repository, a work item database and an
employee directory) and store them in the database as nodes in the
graph. Relationships between these objects are derived from structure, metadata, or textual allusions and stored in the database as
edges in the graph. Paths defined by regular expressions are written
by domain experts and compiled by Codebook into state machines
and uploaded into the database. The Path Analyzer runs a regular language reachability algorithm on the database to compute and
store the start and end nodes for each path recognized by the regular expressions. To support search applications, a full text index
is created for each object in the graph and its surrounding context
objects (as defined by another set of regular expression paths). Finally, Codebook’s data is exposed to applications via web services,
enabling many different front ends to answer end-user questions
with facts from Codebook.
In the next few sections, we elaborate on these steps to explain
architecture and design rationale for the various parts of Codebook.
3.1
Crawlers
Codebook is designed to be an extensible system, consisting of
a family of crawlers for different kinds of repositories and various types of data. As part of the prototypes we have built, Codebook can index source code repositories (Microsoft Visual Studio
Team Foundation Server (TFS) and a Microsoft internal source
code repository server), work item databases (TFS and a Microsoft
internal work item server), employee directories (Active Directory),
emails from public mailing lists (Outlook and Exchange), source
code assemblies (using .NET Reflection on DLL assemblies), and
web sites (Sharepoint and discussion forums).
All nodes and edges, which represent objects and the relationships between them, are stored in the database with a start date,
end date, last modified date, and are uniquely named by their URI.
Each object additionally contains a bag of words used for the search
index, consisting of a concatenation of several strings of metadata
specific to each object type. Relationships are stored unidirectionally, but paths may be defined to traverse edges in the forward or
backward direction.
3.1.1
Source Code Repository Crawler
The source code repository crawlers start at the first checkin and
proceed until the most recent checkin. For each checkin, the list
of changed files is enumerated, and each file’s differences are analyzed. Before and after snapshots of the edited files are parsed
with a code analysis and compared. Whenever the differences overlap a source code element, that element is considered to have been
changed by the checkin. We do not currently track code that is
renamed or moved between files.
Our Codebook prototype can analyze C, C++, C#, and VBScript
source code. All of the symbols (i.e., identifiers) contained within,
including inside method bodies and field initializers, are stored as
nodes in a database table, with metadata columns for symbol name,
fully qualified name, kind (e.g., class, field, method, operator, etc.),
programming language, and nesting depth. A distinct bag of words
is also stored for the bodies of the definitions of symbols, such as
class definitions and method definitions, to enable scoped searches.
Simple relationships between source code symbols, such as “LexicallyEnclosed” (lexical enclosure), “Superclass” (superclass and
subclass links), “Calls” (method calls), “Assigns” (variable assignment), “Names” (labels), “Parameter” (parameter of a method or
generic type), and “References” (appears in an expression) are stored
as edges in the graph.
Relationships requiring name resolution are not directly connected. Without perfect build- or run-time information and fully
linked DLLs, it is not possible to uniquely link callers and callees,
or uses to their definitions. Instead, Codebook creates an intermediate non-qualified SourceCode Identifier node, and connects it to
the incoming and outgoing links. This gives the added benefit that
when a new definition of a method is found, Codebook does not
have to add edges from all the callers of the methods with the same
name to the new definition; Codebook merely connects the new
definition to the already existing SourceCode Identifier node.
An additional crawler cracks open .NET assemblies and uses
the .NET reflection API to read out all of the source code symbols in each DLL. The advantages of reading DLLs is that withinDLL linking is already resolved, making it possible to resolve some
caller/callee and def/use relationships more precisely than when
reading source code alone.
3.1.2
Employee Directory Crawler
A company’s employee directory is crawled as people’s names or
email addresses are found in other crawlers. Each person is looked
up in the directory and their name, email, title, role, department,
office address, phone number, picture, and manager are stored in
the Codebook database. Each person’s manager is looked up as
well, to create a subgraph of “Manages” relationships, all the way
to the root of the management hierarchy.
3.1.3
Work Item Crawler
The work item database crawler begins at the first work item and
proceeds to the most recently created work item. Since each work
item may have been revised multiple times, each revision is processed separately. All work items in TFS consist of a title, a description, and a set of people who have “changed” it. The rest of
the work item consists of a property bag of fields and values which
should be stored in the metadata for a work item. These fields are
defined by a process template custom to the organization which deployed the TFS repository, meaning that work item crawling must
be configured by domain experts in order to understand what the
fields mean. Even discovering who a bug is assigned to requires
understanding the process template definition. In the repository
we studied, this is in the field labeled “System.AssignedTo” whose
value can be any string, not just a person.
Our prototype repository uses the Microsoft Process Template
(shipped with TFS) which supports six types of work items (Value
Proposition, Feature Group, Feature, Deliverable, Task, and Bug),
each of which defining its own custom fields. Despite the presence
of a template, a team using it can put any data they want into the
fields (subject to very loose constraints). This requires that Codebook be further customized by each individual team. For example, the process template suggests the use of “System.AreaPath” to
specify the component of a work item, but our team uses the field to
specify the milestone for which this work item is active. They use a
custom field, “Custom.01”, to indicate the work item’s component.
To analyze relationships in work items, Codebook must determine which fields have people, code, other work items, URLs, test
cases, or a pointer to any other object inside. Codebook lets domain experts define a configuration file to specify which fields of
the process template should be analyzed, and what data types each
is likely to have for their team. For each field where the type is
known, for example, a person field like “System.ClosedBy”, Codebook looks in the employee database repository for a person matching that name or email address and creates a relationship edge between the work item and that person with the “ClosedBy” label.
For fields where the type is unknown, or those which may hold natural language (like the title or description field), a set of regular
expressions for each object type is run over the text. If a word is
found that looks like a source code identifier (e.g. “AnnotateStringWithImage”), a “Mentions” edge is created in the Codebook graph
between the work item and a non-qualified SourceCode Identifier
node. We do not connect the work item directly to the SourceCode because it may not yet have been discovered by the Source
Code Crawler. When new Source Code nodes are created, they are
connected, if possible, to the appropriate preexisting Source Code
Identifier node using a “Names” edge.
3.1.4
Textual Allusions
At Microsoft, a person’s email address is often used to name the
person in natural language documents, such as emails or bug reports. Whenever any short up-to-8 character word is seen in a document, Codebook looks for the word as an email address in the
employee directory. If found, Codebook links that person to the
object where the word was found with a “Mentions” relationship.
Textual allusions like this sometimes results in false positives.
For example, an employee named “William Jones” may have the
email address “will”, which unfortunately, is a common English
word that shows up in many emails and bug reports, not just in
those that refer to William Jones. To address these overzealous
connections, each Codebook graph edge has a Confidence field (a
floating point number ranging from 0.0 to 1.0), that indicates the
likely accuracy of the edge. Structurally-defined edges (such as
lexical enclosure or bug assignment) receive a 1.0 confidence score,
while other edges that derive from using regular expressions or linguistic analysis to discover email addresses or source code symbols
in natural language text, receive a lower score.
3.1.5
Other Crawlers
Many useful connections can be derived from public mailing lists
and discussion forums, inferring both affinity groups as well as expertise. Each message is crawled in chronological order, processing
the sender and receivers of the message, as well as running regular
expressions over the text to find textual allusions to other objects.
Web sites, such as Sharepoint repositories, can be crawled to find
documents relevant to software development. For example, many
teams at Microsoft store their specifications, meeting notes, marketing information, and legal documents in Sharepoint. The titles
and contents of these documents are mined and stored in the Codebook graph and linked to the authors (who, at Microsoft, is likely to
be a program manager in charge of that feature). In addition, specification documents are often constructed from templates which indicate which developer and tester will be working on the feature.
Codebook’s text recognizers can be customized to read that section
of the document to identify the owners of the feature. The rest of
the document usually contains the names of classes, methods and
fields, which can be connected to the source code that eventually is
written to implement the feature.
3.2
Graph Paths
A graph of related objects is the central component of many applications. For example, Facebook is centered around a graph of
people who are declared to be “friends” with one another. Face-
book’s graph is simple; each node is a person, and each edge is
labeled “friend”. Codebook’s graph is more complex — there are
9 node types and 18 edges types, for a total of 29 possible triples
(most are shown in Figure 2).
Due to the complexity of the underlying data that Codebook is
representing, two objects may be related to one another even if they
are not directly connected in the graph. For example, to find the
program manager responsible for the Square method in Figure 2
(follow the bold nodes and edges), one needs to look for any specification documents that contain the signature of the Square method,
and find its author, which in this case turns out to be Patty the Program Manager. We can further learn that Pam works with Dave the
Developer because the Square identifier that Pam’s specification
points to was named by a method checked in by Dave. In addition,
Pam created Bug #673 which is assigned to Dave. Bug #673 also
includes a stack trace mentioning the implementation of the Square
method written by Dave.
Though these domain-specific connections hop across many edges
in the graph, they can be described succinctly by regular expressions over the node and edge labels between the paths’ endpoints.
One of the key contributions of our work is to recognize that many
applications previously implemented in one-off data mining software can be represented by regular expressions in this graph.
The paths above can be written as regular expressions, starting
with “Person Authors SpecificationDocument Mentions SourceCodeIdentifier NamedBy SourceCodeMethod.” To connect that method to
Dave, we add “SourceCodeMethod ModifiedBy FileRevision ModifiedBy Changeset CommittedBy Person.” Another way to connect Pam to Dave is via Bug #673: “Person Created WorkItem AssignedTo Person.”
The alphabet of our regular expressions are the node and edge labels from the graph. Sequences of these labels can include optional
elements (?), loops (+, *), alternation (|) and grouping ((. . . )).
After each edge label, the author can write a label-specific suffix
(e.g. ModifiedBy, or ContainedWithin) token to indicate the direction of the relationship in the regular expression. For example, in
the regular expression “Person ManagedBy Person”, the person on
the right is the manager of the person on the left.
Domain experts can both read and write these regular expressions based on their knowledge of the software development team’s
work practices and procedures. The paths of activity in a team
where engineers work closely together in feature crews (a trio of
a developer, tester and program manager) will look different than
an Agile team that has no distinction between developer and tester
and whose developers often pair up with new partners each day.
In addition, since groups that employ the same process templates
in their work item databases do not utilize the fields of these templates in the same way, having knowledge of a particular team’s
practices is crucial to understanding how their work is represented
electronically [4].
Computing paths using regular language reachability provides
only the endpoints of the accepted paths. Thus, one can answer the
question “is there any path between A and B?”, but not enumerate
those paths (of which there may be an infinite number due to using loops in a regular expression). Adding in more discriminatory
power to report any single path between two nodes requires a more
complex algorithm, such as all-pairs shortest path constrained by a
regular language, but this raises the complexity of the algorithm to
𝑂(∣𝑉 ∣3 ∣𝑆∣) which is impractical for large graphs of the sort Codebook creates. To compensate for our algorithm’s inability to return
an exact set of paths, we have found it is useful to create many
short regular expressions with descriptive names. For example, the
regular expression that connects a bug to a piece of code may be
called “BugToCodeViaStackTraceInReport.” This is almost always
good enough for an end-user to believe the connection is real, and
if desired, discover the exact path through inspection, now that he
knows it exists.
Our implementation of regular language reachability has been
executed on graphs of up to 100,000 nodes and 100,000 edges. A
53-state state machine takes about 50 minutes to compute on a dual
core Intel Xeon E5450 virtual machine with 2 GB RAM, and 1
GB available to SQL Server 2008 SP1, running in Windows Server
2008 SP2 with Hyper-V.
3.3
3.4
Regular Language Reachability
Once regular expressions have been defined, Codebook computes the set of paths in the graph that conform to the regular expression. We use a modification of breadth-first search constrained
by the regular expressions, an algorithm known as regular language reachability. This algorithm runs in 𝑂((∣𝑉 ∣ + ∣𝐸∣)∣𝑆∣) time
for a single origin, and in time 𝑂(∣𝑉 ∣(∣𝑉 ∣ + ∣𝐸∣)∣𝑆∣) for all origins. Codebook graphs have a power-law edge distribution — a
few nodes have many many edges and the rest have few, with a
long tail [31]. In the all of the graphs we have seen, ∣𝐸∣ is within
two to three times ∣𝑉 ∣, thus we could surmise that the time complexity is 𝑂(∣𝑉 ∣2 ∣𝑆∣) for all-pairs regular language reachability.
3.3.1
Optimization
The graph described above actually contains 200,000 nodes and
350,000 edges, but we have optimized much of it away because it
does not contribute to any “useful” paths. The crux of the optimization lays in pruning the SourceCode Identifier nodes.
For every source code symbol definition, the source code crawler
creates two nodes, a Definition node, and an Identifier node. The
Identifier node is used to connect caller/ callee, def/use chains, and
“Mentions” relationships between text and code, in the face of imperfect name resolution (especially for code found in text fields).
There are two cases where the Identifier node (and its adjacent
edges) are not useful for path computations. An Identifier node can
exist without a definition, if a textual allusion to that identifier was
made in a source code comment, work item or email, but the identifier was never realized in source code. In this case the Identifier
was a mistake, and should be pruned. Second, if an Identifier node
is created for a definition, but there are no “Calls,” “References,” or
“Mentions” links to it, then the node and its edges should be pruned
as well.
Performing this pruning results in a 65-75% reduction in the
number of graph nodes and edges used in the algorithm. We calculated this reduction on each month of data entered into our Codebook prototype, and for each month the reduction in nodes and
edges was roughly constant (+/- 5% on edges and +/- 10% on
nodes). A 3/4 reduction in ∣𝑉 ∣ and ∣𝐸∣ results in a 10-14x speedup
in the running time of our algorithm.
Search
A fundamental part of the Codebook architecture is search. Abstractly, a search takes a set of keywords and returns a ranked list
of Codebook nodes whose metadata best matches the keywords.
We initially employed a simple search algorithm, TF-IDF (term
frequency-inverse document frequency) on the node metadata’s text.
However, this offered poor results, since it was not possible to
search for a function by the author’s name, or find a person who
worked with someone else who was not in his management chain.
Fortunately, the Codebook graph is much like the web in which
the link structure is semantically meaningful. Thus, a better search
algorithm could take advantage of this to improve the search re-
Committer
Person
Dave the Developer
Modifies
FileRevision
$/Foo/Bar.cs#4
Modifies
File
$/Foo/Bar.cs
Contains
Folder
$/Foo
Changeset
#45
Contains
Contains
File
$/Foo/Drawing.cs
Modifies
Assigned To
Person
Pam the Program Manager
Created
FileRevision
$/Foo/Moo.cs#6
Author
Work Item
Bug #673
Work Item
Feature: Rectangles
Folder
$
Contains
Modifies
SourceCode
Namespace Foo
SourceCode
Class Art
Mentions
LexicallyEnclosed
Superclass
LexicallyEnclosed
Mentions
SourceCode
Field Perimeter
Mentions
Modifies
Closed
Lexically
Enclosed
SourceCode
Class Drawing
SourceCode
Method Canvas
Mentions
LexicallyEnclosed
Person
Tessa the Tester
Test Result
SourceCode
Identifier Square
SourceCode
Method Square
Mentions
Mentions
Calls
SourceCode
Identifier Canvas
Figure 2: A canonical graph of possible relationships between objects in Codebook. “Mentions” edge labels are written in italics to
indicate that they may derive from an incorrect inference. The bolded paths help explain a scenario written in the prose.
sults. Unlike the web, however, Codebook graphs do not have
anchor text that describes the target of a link between two nodes,
depriving us of useful context to improve search accuracy. In addition, the immediate neighbors of a node, while structurally valuable
for understanding the process of the team’s development tools, do
not often contribute useful contextual meaning to the node.
We can use path regular expressions to find other nodes in the
graph whose metadata can be used to substitute for the anchor text
and immediate context we lack. Using 25 more path regular expressions, Codebook enumerates the relevant anchor nodes for each
node in the graph. For example, anchor nodes include the owner of
a piece of code, the person responsible for tracking a work item
task, the filename where a particular source code symbol is defined, the specification document that describes a piece of code,
etc. Nodes with a high degrees of “anchor” edges convey authority
about a node the way that a high-degree hub does in the web.
Codebook uses SQL Server’s full-text search algorithm [2] on
the node and anchor meta-data to come up with a list of results.
The calulcation gives us a score that we can combine with domainspecific knowledge that we learned from our interviews with developers to derive a ranking function. For example, symbol definitions are ranked higher than symbols that appear as references.
Open work items are ranked higher than closed work items. People
who are individual contributors rank higher than managers, since
they are more likely the ones who have done the work and therefore know the most about it. Edges with lower confidence (such
as connecting a bug description using the word ’will’ to the Person William Jones) contribute to a lower ranking for the nodes they
connect to. The specific values for these rankings can be tuned
manually by the team’s domain expert to conform to their software
development processes.
3.5
Web Services
The final component of the Codebook architecture is a set of web
services that expose the graph, the data contained within, and the
computed paths to front-end clients. All nodes are referred to by
URI, and their metadata can be fetched on demand. To connect
a new application to the Codebook web service, the application
developer uploads his path regular expressions to the system, where
they are compiled and computed periodically, perhaps once per day.
The developer the queries Codebook to retrieve the computed data
in the form of tuples of node URIs. The developer can render his
application in any form, such as a web form, a Sharepoint web part,
an Outlook plugin, or a Visual Studio client add-in.
3.6
Statistics
Our Codebook repository was created from six months of development time for a medium-sized team at Microsoft. We mined their
TFS source code and work item repositories, and mined employee
information from the Microsoft Active Directory. The repositories
we read from contained around 42 GB of data, resulting a Codebook database of 3.5 GB. It contains 420 people, 19,000 source
code definitions, 9,700 files (including source code files), and 3,400
work items. The graph for this data contains 200,000 nodes and
350,000 edges and scales linearly with the age of the repository.
The numbers reported throughout the paper are based on this database.
4.
CURRENT APPLICATIONS
Codebook was designed to be flexible enough to satisfy a large
variety of information needs using a single abstraction. In this section, we demonstrate this flexibility by describing two applications,
Hoozizat and Deep Intellisense, that we built on top of Codebook.
4.1
Hoozizat: Finding People with Codebook
In the survey described in Section 2, four information needs (1,
7–9) concerned finding the people who own and are responsible
for a feature, API, product or service. We built Hoozizat, a web
search portal Codebook application, specifically to help engineers
find these feature owners. Hoozizat was built in consultation with
six engineers at Microsoft; we interviewed them prior to building
our application to find out how inter-team coordination needs arise
in their work. A typical scenario that we discovered from our interview is the following:
Xin, the software developer, has found a bug in his code.
Only Xin didn’t do anything to cause the bug, other than to
update a library he was using that was written by another
team within his software company. He is pretty sure the bug
is caused by some change to this library, but does not know
whether the bug is due to his own misconceptions in using
the code, or a bug in the library itself. A typical software
developer looks up the problem on an intranet or web search
engine to find the answer, but in this case, since the code is
not public, and the product has not yet been shipped, there is
nothing to find.
Xin would like to find someone from the library team who
can look at his code and tell him what is wrong with it. If
it is a bug in the library’s implementation, he would like to
tell someone on that team to file a bug. If instead, it is a bug
in the specification, he needs to find the person on the team
responsible for managing the library’s specification to report
the problem.
Finding the right person to answer these questions involves searching through various intranet web portals such as Sharepoint, intranet search, full-text search over the codebase, and search over
the company employee directory. From our interviews, we found
that although one or more of these portals may point to the right
answer, it is time-consuming to search multiple repositories and
sift through each result list. Furthermore, to find a person that can
answer Xin’s questions, he would have to delve into each result to
locate a related person. We found that engineers would typically
spend no more than 10 minutes trying to search these web portals
before giving up.
More commonly, we found that engineers such as Xin would ask
their colleagues or managers if they knew a person they should talk
to. When they did not, they might direct him to another person
who might know. This process may repeat several times, just like a
game of six degrees of separation. Xin may eventually find the right
person, as long as he is motivated to put in the time and legwork.
While this process is inefficient, our interviews suggest that it is
much easier to find a person by asking friends than by searching
through web portals.
We surmise that there are two ways we can greatly improve the
search experience: first, we should search across multiple repositories at once, so that Xin would only have to query one database
to find relevant answers; second, we should return not just a list
of artifacts in the results, but also people that can answer Xin’s
questions, so Xin would not have to dig into each result to find the
related person.
4.1.1
Finding the Answer with Codebook
Codebook’s design is ideal for addressing this scenario. First, it
crawls multiple repositories and can perform searches across all of
them simultaneously. Second, Codebook can return the people Xin
can talk to, not just their artifacts, by using regular language paths
to describe relationships from artifacts to people. With Hoozizat,
we use Codebook’s built-in search, described in Section 3.4, to find
a list of matching people and artifacts based on a text query.
Artifact Ownership. For artifacts returned in the search results,
we augment each artifact result with a list of owners, so a user such
as Xin can quickly find a person who can answer questions about
that artifact. We define owners to be simply people who have made
changes or have been assigned to an artifact:
∙ File ModifiedBy FileRevision ModifiedBy Changeset
CommittedBy Person
∙ SourceCode ModifiedBy FileRevision ModifiedBy
ChangeSet CommittedBy Person
∙ WorkItem ( (Mentions ∣ . . . 2 ∣ DuplicateOf) WorkItem)∗
(AssignedTo ∣ CreatedBy ∣ . . . ∣ ResolvedBy ∣ ClosedBy) Person
2
. . . represents additional edge labels.
Figure 3: Screenshot of Hoozizat search results. Each column
shows a different type of result, from left to right: people, work
items, source code (partly hidden), and files (not shown). Small
photos next to each result shows the associates for people or
owners for artifacts; hovering the mouse cursor over the photos
shows a tooltip with contact information for that person.
Associates. For people returned in the search results, we augment each person result with a list of associates, as this may help a
user such as Xin determine the team that person belongs to, or perhaps discover another person he might know personally. We define
associates to be people who work closely together, and from examining our Codebook repository and working with domain experts,
we derived a total of 13 regular expressions for discovering them.
Some relationships are simple, e.g., Alice and Bob are associates
if they modified the same artifact, or more precisely, if Alice committed a changeset that modified a source code that is modified by
another changeset committed by Bob:
∙ Person Commits Changeset Modifies FileRevision Modifies
SourceCode ModifiedBy FileRevision ModifiedBy ChangeSet
CommittedBy Person
Other relationships are more complex, e.g., Alice and Bob are associates if Alice created a work item that may be a duplicate of one
or more work items that mention source code that has been edited
by Bob:
∙ Person Created WorkItem (DuplicateOf WorkItem)∗ Mentions
SourceCodeIdentifier NamedBy SourceCode ModifiedBy
FileRevision ModifiedBy Changeset CommittedBy Person
Note that in the above examples, the path regular expressions are
almost literal translations of their descriptions. Based on our experiences and interviews with domain experts, we believe it should be
easy for a domain expert to describe such relationships.
4.1.2
Presenting Search Results with People
We present the search results in a web-based interface shown
Figure 3. The first column shows people results, while the remaining columns show other artifact types: work items, source code
(partly hidden), and files (not shown).
Next to each search result, we show a small list of photos corresponding to the associates for people results, and owners for other
artifacts. Hovering the mouse cursor over a photo brings up a
tooltip with contact information for that person including action
items to send an email or an IM to that person. This allows the user
to quickly identify the people most closely related to the search results and establish communication with that person. Additionally,
hovering over a photo also highlights all occurrences of that person
in the search result, allowing the user to discover other people or
artifacts to which that person may be related.
4.1.3
Evaluation
To evaluate the correctness of the Hoozizat application, we interviewed five engineers whose data is contained within our prototype
Codebook repository. We continued our interview with these five,
and an additional nine, engineers to evaluate the utility and design
of the user interface.
During each interview, we explained the Codebook project and
demonstrated the functionality of the Hoozizat interface using a
variety of searches we had learned during our testing showed a lot
of varied results. With each of the five stakeholders, we asked them
to type in searches for their name, some function names from their
code, and some keywords from their features and bugs. Participants
each did 5-6 searches, and in each case, pointed out to us that the
people in the result list were their colleagues in development or
program management (none of their searches resulted in any testers
being returned). One said “All of these people work near me in
the same hallway.” They indicated that all of the code they saw
was part of their project and was the code they themselves wrote.
Likewise, for the work items (features and bugs), they told us that
they were indeed assigned to work on those features, or were in fact
the owners of the work items. This shows that Codebook was able
to return results that were recognizably correct to people whose
data was in the system.
While the interviewees said that no results were missing from
Codebook’s result set, four out of the five interviewees indicated
that several people and work items returned from their Codebook
searches should not have been in the list. These false positives
showed up due to inaccurate textual allusions between English language words and people’s email addresses (e.g., William Jones’
email address is “will,” and he showed up in a lot of searches).
We have used the incidence of these false positives to improve our
search ranking function and penalize links from common English
words to people or source code identifiers.
We learned more about the team’s development practices from
talking to the stakeholders. One program manager explained to us
how we could tell from who was assigned to a feature where it was
in the process of being implemented. A work item of type feature
with a program manager (PM) was likely just an idea that got cut.
A feature with a PM and a developer meant that it was being implemented. A feature with a PM, developer and a tester meant that
the feature was complete, and in testing and stabilization.
Hoozizat’s interface was well received. All 14 engineers found
the ability to search across multiple repositories and artifact types
to be very useful. Six asked why these particular results were chosen to be returned. They wanted to see not just that two items were
related to one another, but the path that connected them. Everyone
liked that Hoozizat shows the associates and owners for each result,
noting that it resembles social networking applications like Facebook which are popular today. One developer said, “You feel like
you’re alone coding in your office, but now with Codebook when
something happens to the code or bugs you’ll feel less bored.” A
manager commented on cutting out the middleman in his quest for
answers, “the more you can help me get my job done without talking to [too many] people, the faster I can go.” Nine out of 14 of the
engineers expressed a remarkable amount of devotion to Hoozizat
at the end of their interviews. One engineer said to us, “I don’t
know how to live without this.”
Each of the engineers had different opinions about how to rank
the search results from each category, though generally people from
the same team explained similar ranking beliefs. This reinforces
our belief that tools for software engineers must be customizable by
domain experts and end-users if they are to be successfully adopted.
4.2
Deep Intellisense
Deep Intellisense [16] is a Visual Studio add-in to aid code investigation, which was ranked fourth on our survey of information
needs. When the user clicks on any source code symbol in the editor, it displays a reverse chronologically sorted list of events showing everything that has happened to that source code symbol in the
development history, including code changes, work items and messages from discussion forums that refer to it. These are discovered
by the following paths, all starting from SourceCode nodes.:
∙ SourceCode ModifiedBy FileRevision ModifiedBy Changeset
∙ SourceCode MentionedBy SourceCodeIdentifier MentionedBy
Changeset
∙ SourceCode MentionedBy SourceCodeIdentifier MentionedBy
WorkItem
∙ SourceCode MentionedBy SourceCodeIdentifier MentionedBy
DiscussionForumPost
Deep Intellisense also displays the people associated with each
artifact, including their role and contact information. Additional
information about this scenario can be found in our MSR paper [6].
Like the process we undertook with Hoozizat, Deep Intellisense
was designed with participation from five developers and testers at
Microsoft, who were interviewed to understand their work practices and information needs around code investigation, and to give
us feedback on mockups of our user interface.
Deep Intellisense was prototyped on three large projects (CKS,
Rawr, and AjaxControlToolkit) from Microsoft’s open-source repository, CodePlex.com, and demoed to software developers at Microsoft’s Professional Developer Conference in September 2008.
Feedback was universally positive, with most participants eager to
see the feature deployed on their own company’s software projects.
5.
OTHER APPLICATIONS
In addition to our current applications, Codebook can be used
to build many other applications. In this section, we describe two
other Codebook applications, “Who is using our code?” and “Anxious for Awareness,” and the path regular expressions required to
implement them.
5.1
Who is Using Our Code?
In a company that produces both applications and frameworks, a
framework team may not be aware of every other individual or team
who is using their framework. This makes it difficult to notify dependents when breaking changes must be made (information needs
#6 and #10 on our inter-team coordination survey). Codebook can
be used to mitigate this issue by discovering everyone who may
be affected by breaking changes, e.g., by discovering when a person (in the team) edited some source code which is called by code
edited by another person (outside the team):
∙ Person Committed ChangeSet Modifies FileRevision
Modifies SourceCode CalledBy SourceCode NamedBy
SourceCode ModifiedBy FileRevision ModifiedBy
ChangeSet CommittedBy Person
Once these paths are computed, Codebook can easily filter the
results of this regular expression to restrict the person at the beginning of the path to be people inside a team, and the person at the
end at the path to be people outside that team. The user interface
would also provide action items, such as a link for contacting the
owners of all calling methods in order to inform them of breaking
changes.
5.2
Anxious for Awareness
When teams collaborate as part of a large project, a member of
one team will often assign a work item to a member of another
team. Tracking the status of work items assigned across teams is
frustrating because the teams’ independent work is not transparent
to each other (information need #5 from our survey). The work
item can be delayed due to poor communication, differing priorities, or forgotten altogether because no one advocates for it [7].
Codebook can help increase transparency between teams by discovering people who have referred to the work item from another
work item, people who have worked on code mentioned by related
work items, or source code changed by related work items:
∙ WorkItem ( (Mentions ∣ . . . ∣ DuplicateOf) WorkItem)∗
(AssignedTo ∣ CreatedBy ∣ . . . ∣ ResolvedBy ∣ ClosedBy) Person
∙ WorkItem ( (Mentions ∣ . . . ∣ DuplicateOf) WorkItem)∗
Mentions (SourceCode ModifiedBy FileRevision ModifiedBy)?
Changeset CommittedBy Person
∙ WorkItem ( (Mentions ∣ . . . ∣ DuplicateOf) WorkItem)∗
Mentions Changeset Modifies FileRevision
Modifies SourceCode
Once the work item has been assigned, one could follow a newsfeed of the assignee’s activities and watch his progress on the work
item. Browsing the assignee’s team’s newsfeed could provide context about the team’s changing deadlines and priorities.
6.
RELATED WORK
We first describe our own related work which motivated and led
to the Codebook framework. Other related work falls into fields of
Semantic Web and software engineering.
Codebook is not the first work in the software engineering field
to mine software repositories and to query graphs. However, Codebook is the first work to combine multiple repositories within one
graph to support multiple applications, while still allowing powerful analyses (multi-hop relationships), customizability, and extensibility. Other related work has satisfied some, but not all of these
criteria.
6.1
Own work
The Bridge was our team’s first prototype of a graph consisting of
people, code, bugs and emails derived by crawling software development repositories. The Bridge exposed its data via a strongly
typed API which made access to values in the graph straightforward, but incremental changes to the schema difficult. Codebook
has inherited the Bridge’s ability to scale to large repositories (millions of nodes and edges), but has been modified to enable easier
access to the underlying data and greater customizability. End-user
application methods can now be implemented directly via regular
expression paths.
Deep Intellisense, described in Section 4.2, was originally built
on the Bridge. During this effort, we learned that applications
must be customized to the distinct needs of each development role
(developer, tester, manager). In addition, scoping information to
source code symbols rather than files was an important way to
match the developer’s tasks. Both insights have been taken into
account when designing Codebook. Our experience from building
Deep Intellisense was one source of inspiration to use regular language reachability as Codebook’s core analysis, to make it easier
to customize and build a wider variety of applications.
6.2
Semantic Web
The Semantic Web uses RDF triples [24] to describe the semantics
of documents, people, or any type of object accessible by an URI.
RDF triples are clauses of the form <Subject,Verb,Object> which
form a graph of nodes (subject and object) and edges (verbs). A
SQL-like query language called SPARQL [26] is used to look up
nodes and edges in the RDF graph. Several extensions to SPARQL,
such as PSPARQL [1] and SPARQLeR [22], have been proposed
and developed to support resolving paths through an RDF graph.
Codebook takes a similar approach towards the use of regular expressions (REs) to define paths through its graph. CSPARQL [1],
an extension of PSPARQL, adds constraints to the REs, giving the
ability to specify a type for the node and a constraint on its value.
Codebook does not yet support user-customizable constraints.
Kiefer, Bernstein and Tappolet use RDF and an extension to
SPARQL to discover patterns of similarity in software repositories [20]. This approach is similar in concept to our own, but does
not take advantage of paths through graphs to discover relationships. With their use of an in-memory data structure, the current
implementation of their approach has limited scalability, and their
performance is, in their own words, “not satisfactory.”
Hyland-Wood, Carrington and Kaplan [18] propose to use RDF
and SPARQL as a mechanism to discover single-hop relationships
in a graph derived from software maintenance information. They
implemented a proof of concept for an example consisting of only
two object-oriented classes. In contrast, Codebook scales to very
large software projects and exploits multi-hop relationships.
6.3
Software Engineering
In their Hipikat tool [11], Cubranic and Murphy used a graph of
change tasks, file versions, people, messages, and documents to
recommend related software artifacts by following a single relationship in the graph. Alex Tarvo used a similar graph of bug reports and file versions in his BCT tool [29] to predict software regression. The Fran tool by Saul, Filikov, Devanbu, and Bird walks
call graphs to find related functions, but only two steps [28]. In
contrast to all these works, Codebook supports multi-hop queries
(i.e. more than two steps) and more than one application through its
use of regular language reachability. In the Codebook framework,
applications can also be easily customized to the specific needs of
development teams.
Grok and other Prolog-like languages [17] support pattern discovery using relational algebras defined over graphs of tuples. The
graphs, which derive solely from source code analysis, have been
mined for design patterns, architectural and protocol compliance,
and change impact. While Codebook’s REs have less power than
relational algebras, we have found them adequate to describe all
of our applications’ information needs. In additions, REs are less
complex, ensuring that pattern creation and comprehension are as
accessible as possible to our end-users.
DebugAdvisor [5] is a tool that supports debugging activities
by allowing free-text queries over an index of structured software
repository data. First a “fat query” is analyzed and turned into a
hierarchical tagged structure of bags of words. Each bag is used
to query an associated repository, returning a graph of likely nodes
in the index. The links in the graph are then analyzed, resulting
in a single ranked list of result nodes. In short, DebugAdvisor
uses a graph to combine and rank results from several different
types of data. In contrast, Codebook traverses a graph to find exact
matches to queries written as path regular expressions. The main
focus of DebugAdvisor is debugging, while Codebook’s focus is on
improving inter-team coordination by supporting multiple applications with a single framework.
Many applications can be built on top of our Codebook framework. For example, the Augur tool by de Souza et al. [12] provides
visualizations of developer activities during the software development lifecycles. Among other visualizations, de Souza et al. show
in a “social” call graph how developers are related to one another
through the code that they wrote and call. The Ariadne tool by
Trainer et al. [30] displays a similar social call graph. In Codebook, a possible regular expression to compute a social call graph
is Person Modifies Code Calls Code ModifiedBy Person.
The Expertise Browser by Mockus and Herbsleb [25] addresses
information need #2 from our survey. It can also be built on top
of Codebook (Code ModifiedBy Person). The analysis of sociotechnical congruence [10, 9, 8] is supported by Codebook as well:
for the technical dimension, recorded dependencies help to identify coordination needs (e.g, Code Calls Code); for the social dimension, coordination activities are recorded directly (e.g., Person
Modifies WorkItem ModifiedBy Person).
7.
CONCLUSION
In this paper, we address the problem of inter-team coordination with Codebook, a framework for connecting engineers and
their work artifacts together. We motivated our work with a survey of software engineers at Microsoft who helped us prioritize the
most important information needs around coordination that should
be addressed by new tools. We designed our framework around a
single data structure (a directed graph) which captures the relationships between people, code, bugs, specifications, and other work
artifacts that are mined from any number of software repositories.
We discover transitive connections using a single algorithm (regular language reachability), which, with our optimizations, scales
to large graphs, and enables the customization of Codebook by domain experts to fulfill the information needs of their teams, on the
data that their teams have recorded in the repositories. In the future, we plan to augment our regular language reachability algorithm with the ability to compute a total weight per path, and to
combine these weights to make stronger inferences about the veracity of connections between people and artifacts in the graph. We
built two front-end applications, Hoozizat and Deep Intellisense, to
demonstrate the effectiveness of our framework, and have plans to
build several more.
Using our Codebook framework, software engineers no longer
have to dig through repositories or pester their colleagues to discover, track and maintain connections to other people and their associated work artifacts. It is an important step on the way to address
the challenges of inter-team coordination.
8.
REFERENCES
[1] F. Alkhateeb. Querying RDF(S) with Regular Expressions. PhD
thesis, Joseph Fourier University of Grenoble, June 2008.
[2] M. C. Andrew Cencini. Sql server 2005 full-text search: Internals
and enhancements.
http://msdn.microsoft.com/en-us/library/ms345119(SQL.90).aspx.
[3] J. Anvik, L. Hiew, and G. C. Murphy. Who should fix this bug? In
Proceedings of ICSE, pages 361–370, 2006.
[4] J. Aranda and G. Venolia. The secret life of bugs: Going past the
errors and omissions in software repositories. In Proceedings of
ICSE, pages 298–308, 2009.
[5] B. Ashok, J. Joy, H. Liang, S. Rajamani, G. Srinivasa, and
V. Vangala. Debugadvisor: A recommender system for debugging. In
Proceedings of ESEC/FSE ’09, August 2009.
[6] A. Begel and R. DeLine. Codebook: Social networking over code. In
Proceedings of ICSE, NIER Track, 2009.
[7] A. Begel, N. Nagappan, C. Poile, and L. Layman. Coordination in
large-scale software teams. In Proceedings of CHASE, pages 1–7,
2009.
View publication stats
[8] M. Cataldo, D. Damian, P. Devanbu, S. Easterbrook, J. Herbsleb, and
A. Mockus. 2nd international workshop on socio-technical
congruence, May 2009.
[9] M. Cataldo, J. D. Herbsleb, and K. M. Carley. Socio-technical
congruence: a framework for assessing the impact of technical and
work dependencies on software development productivity. In
Proceedings of ESEM, pages 2–11, 2008.
[10] M. Cataldo, P. A. Wagstrom, J. D. Herbsleb, and K. M. Carley.
Identification of coordination requirements: implications for the
design of collaboration and awareness tools. In Proceedings of
CSCW, pages 353–362, 2006.
[11] D. Cubranic, J. Singer, and K. S. Booth. Hipikat: A project memory
for software development. IEEE TSE, 31(6):446–465, 2005.
Member-Gail C. Murphy.
[12] C. de Souza, J. Froehlich, and P. Dourish. Seeking the source:
software source code as a social and technical artifact. In
Proceedings of GROUP, pages 197–206, 2005.
[13] C. R. B. de Souza and D. F. Redmiles. An empirical study of
software developers’ management of dependencies and changes. In
Proceedings of ICSE, pages 241–250, New York, NY, USA, 2008.
ACM.
[14] A. E. Hassan. The road ahead for mining software repositories. In
Proceedings ICSM, FoSM track, pages 48–57, 2008.
[15] P. Hinds and C. McGrath. Structures that work: social structure,
work structure and coordination ease in geographically distributed
teams. In Proceedings of CSCW, pages 343–352, 2006.
[16] R. Holmes and A. Begel. Deep intellisense: a tool for rehydrating
evaporated information. In Proceedings of MSR, pages 23–26, 2008.
[17] R. C. Holt. Grokking software architecture. In Proceedings of
WCRE, pages 5–14, 2008.
[18] D. Hyland-Wood, D. Carrington, and S. Kaplan. Toward a software
maintenance methodology using semantic web techniques. In
Proceedings of SOFTWARE-EVOLVABILITY, pages 23–30, 2006.
[19] H. H. Kagdi, M. L. Collard, and J. I. Maletic. A survey and
taxonomy of approaches for mining software repositories in the
context of software evolution. Journal of Software Maintenance,
19(2):77–131, 2007.
[20] C. Kiefer, A. Bernstein, and J. Tappolet. Mining software repositories
with iSPARQL and a software evolution ontology. In Proceedings of
MSR, page 10, 2007.
[21] A. J. Ko, R. DeLine, and G. Venolia. Information needs in collocated
software development teams. In Proceedings of ICSE, pages
344–353, 2007.
[22] K. Kochut and M. Janik. Sparqler: Extended sparql for semantic
association discovery. In Proceedings of ESWC, pages 145–159,
2007.
[23] T. D. LaToza, G. Venolia, and R. DeLine. Maintaining mental
models: a study of developer work habits. In Proceedings of ICSE,
pages 492–501, 2006.
[24] F. Manola and E. Miller. RDS primer.
http://www.w3.org/TR/REC-rdf-syntax/, February 2004.
[25] A. Mockus and J. D. Herbsleb. Expertise browser: a quantitative
approach to identifying expertise. In Proceedings of ICSE, pages
503–512, 2002.
[26] E. Prud’hommeaux and A. Seaborne. SPARQL query language for
RDF. http://www.w3.org/TR/rdf-sparql-query/, January 2008.
[27] P. Runeson, M. Alexandersson, and O. Nyholm. Detection of
duplicate defect reports using natural language processing. In
Proceedings of ICSE, pages 499–510, 2007.
[28] Z. M. Saul, V. Filkov, P. Devanbu, and C. Bird. Recommending
random walks. In Proceedings of ESEC-FSE, pages 15–24, 2007.
[29] A. Tarvo. Mining software history to improve software maintenance
quality: A case study. IEEE Software, 26(1):34–40, 2009.
[30] E. Trainer, S. Quirk, C. de Souza, and D. Redmiles. Bridging the gap
between technical and social dependencies with ariadne. In
Proceedings of eTX at OOPSLA, pages 26–30, 2005.
[31] G. Venolia. Textual alusions to artifacts in software-related
repositories. In Proceedings of MSR, pages 151–154, 2006.
[32] T. Zimmermann, P. Weißgerber, S. Diehl, and A. Zeller. Mining
version histories to guide software changes. IEEE TSE,
31(6):429–445, 2005.