Is Github CoPilot effective?

Empirical data over a 90 day window

Jun 04, 2024

It’s been about 90 days since we added a Github CoPilot license to every member of my software development team. When CoPilot was first introduced, I wanted to assess its impact on the team. Would it yield to more or better code? Would the engineering team be happier or more productive with this novel tool?

During the past 90 days I collected a few metrics, unbeknownst to the team to reduce any bias. The data I collected, which will be anonymized here, was my attempt to measure the impact of CoPilot across the following categories: productivity, quality and adoption.

Productivity

In order to assess the productivity impact of CoPilot I looked at both commit activity and the overall changes to the code base, measured in net lines of modified. I collected the same data for the same 90 day periods March - May 2023 and March - May 2024.

The chart below shows adds, deletes and net changes to the code base over that same 90 days window in 2023 and 2024. The 2024 period (red bars) shows higher activity than the same 90 day window in 2023 (blue bars). The 2024 period was more prolific across every measure of lines of code (LOC) changes: add, deletes and subsequently ent changes.

The LOC data above is in aggregate and doesn’t account for changes in the size of the engineering organization. Perhaps the increased activity in LOC is because the number of contributors to the code base increased between 2023 and 2024.

The next chart looks at LOC changes normalized by contributors within the respective 90 day windows. Again, 2024 (red bars) is healthier than 2023. More code was modified during the same 90 day window of 2024 than 2023

Next I looked at the commit activity during that same period across both years. This time, I limited the data I collected to a subset of the organization: engineers that were active in 2023 and 2024. This eliminated any recent hires, which might be ramping up and so on.

The chart below shows aggregate commit activity for that cohort across both years. The 2024 data looks better. The total volume of commits is higher, the variance (std dev) is lower even though the average commit volume per engineer in 2024 is lower than 2023

If we zoom in at the individual level we’ll see a different story. First, most developers had more commits in 2024 than in 2023 - although there are some exceptions. Second, several team members became a lot more productive in 2024 than 2023.

Putting all of our productivity data together now and one can start making the following observation: the same 90 day window in 2024 was more productive than its 2023 equivalent, both in aggregate and within the individual cohort.

Maybe this is the CoPilot impact?

Quality

I’ll quickly skip this section and won’t be able to share much data, but suffice it to say that quality metrics across many metrics: code coverage, regressions, outages were higher/better in 2024 vs 2023. Unlike productivity data, quality cannot be assessed in a discrete manner. The investments made, in testing or otherwise many months ago accrue to today’s code base. This means that a well tested and healthy code base is (usually) a function of aggregate and historical investments vs a point in time discrete event.

So now we know the code base is healthier in 2024 than 2023 and that developers are more productive in that same time period. CoPilot must be magical. Or is it?

CoPilot Adoption

There could be numerous reasons for the observed increase in productivity and quality between 2023 and 2024. Perhaps, the engineering team is getting better at their craft. Or perhaps they might be gelling better as a team. Perhaps the work they tackled in 2023 was much harder than 2024. There could be many more reasons.

Remember that I wasn’t trying to find the variables that resulted in this increase in productivity of quality, instead, I was trying to find if CoPilot had an impact. What better way to find the answer to this question than to get CoPilot usage from the team. If the tool wasn’t widely adopted, then surely this increase is due to some other variables. If, on the other hand, CoPilot was widely adopted, then one might make an argument that it had an impact. It would still be hard to quantify this impact.

It turns out that the majority of the team (75%) do not use CoPilot, so the increase in both productivity and quality must be due to some other reasons. Below are some commentary from members of the team on their CoPilot experience

Co-pilot is marginally useful … when writing tests for very “typical” patterns, it saves me an hour once in a while … so I have a key-bind to toggle it on and off, because it’s just obnoxiously useless when it is out of its depth
Good for parsing, but will probably stop using it once I’m working on something less repetitive
Super repetitive stuff it tends to get right which is nice
Often the beginning can be a good guess but the rest is garbage

In short our experience shows that Github CoPilot is marginally useful at best and even so it appears to be good at very simple software development tasks. Is it worth the incremental price of $10-$39 per developer per month? In my opinion, it is not.

This isn’t just a function of its limited utility, but also a reflection of comparing its cost to other dev tools. Consider that Jira’s upper end pricing is ~$13/user/month, Snyk ~$25/user/month and you can see that GH CoPilot is up there in terms of pricing with little to show in terms of value.