Last week Microsoft reported its FY24 Q2 earnings, and as is customary the company followed that with a live earnings call. It was during that earning call that Microsoft disclosed the following.
GitHub revenue accelerated to over 40% year over year, driven by all our platform growth and adoption of GitHub Copilot, the world's most widely deployed AI developer tool. We now have over 1.3 million paid GitHub Copilot subscribers, up 30% quarter over quarter, and more than 50,000 organizations use GitHub Copilot business to supercharge the productivity of their developers from digital natives like Etsy and HelloFresh to leading enterprises like Autodesk, Dell Technologies, and Goldman Sachs - Source: MSFT Q2 2024 Earnings Call Transcript
I had two reactions to the above.
First, this shows the tremendous advantage that large incumbents have over new and smaller competitors, namely a very large distribution channel. Microsoft is able to take a product like Github CoPilot, drop it into its customer ecosystem and very quickly create a $156M business.
My second reaction was questioning the efficacy of the investment all these 50K organizations made. CoPilot, as Microsoft describes it, is meant to “supercharge the productivity of developers”. But is it actually doing that? And more importantly, how can you show that it is?
Lessons from AI in healthcare
To try and answer this question, I reflected on my previous employer: Kheiron Medical Technologies. Kheiron develops AI products that help detect cancers. Their flagship product - Mia - can be used to read mammographies much like a human radiologist does.
Mia supports radiologists in making the most critical breast-screening decision – to recall or not to recall. You can use our solution as an independent reader as well as a concurrent reader. Watch the video to see how Mia helps radiologists more accurately and more rapidly reach a diagnosis. Source: Kheiron
In most countries (not the US) breast-screening readings are conducted by two (sometimes more) independent readers. If the readers disagree on their decision, a third reader becomes the arbitrator. Mia is clinically approved in some countries to be one of the two independent readers.
Before healthcare organizations adopt a product like Mia they need to see evidence of clinical efficacy and economic value. Said otherwise, these organizations want to see data that shows that Mia can increase their productivity without reducing the level of care these organizations provide.
The economic value is easy to assess. Mia can replace one human radiologist in the case of a double-reader regime. Hence the productivity gains are anywhere from 33%-50%. Any hospital CFO will love these figures.
However, economic value alone is insufficient. Mia needs to prove that is actually good at its job. Hence products like Mia go through rigorous clinical trials.
The prospective evidence, published under the title "Prospective implementation of AI-assisted screen reading to improve early detection of breast cancer" in Nature Medicine, found that the tool, called Mia, could significantly increase the early detection of breast cancers in a European health care setting by up to 13%. Source: MedicalXpress.
And now, your Chief Medical Officer will love these figures too.
Back to software
“But that’s healthcare, not software” I hear you say. True, but still the software that we develop using tools like CoPilot might end up being used in a pace maker, or the copilot of a Boeing jet. Then there’s also the sheer curiosity of trying to answer the question on how effective tools like CoPilot are in increasing developer productivity.
Github and Microsoft published some data on that topic, including this paper: The Impact of AI on Developer Productivity: Evidence from GitHub Copilot.
The authors conduct an experiment whereby they ask a group of developers to write an HTTP server in Javascript. The treatment group had access to CoPilot to help them with this exercise. Unsurprisingly the treatment group completed the exercise 55.8% faster.
Is that though a task that is representative of what a software engineer spends time on? How often does your average software developer create an HTTP server in Javascript? Said otherwise is the fact that CoPilot was able to increase the productivity of developers on this task generalizable to programming at large? I’m not so sure.
I then started to contemplate how I would answer these questions within my engineering organization. If I could, I would run the following experiment.
The baseline
First, I would select a cohort of engineers that had been at the company for at least 1 year. This reduces any bias from looking at data from relatively new engineers who might still be ramping up. I also want historical data for this cohort. For this cohort I would collect historical data from the previous 12 months. The data I would collect would measure productivity and quality and might include the following:
Total commits
Lines of code changed
Change failure rate
Deployment frequency
The first two data points (as horrible as they are), give me some semblance of productivity, or the volume of code these developers generated over that 12 months period. The latter two are quality indicators. Recall, I want to see evidence of both more code and higher (or equal) quality.
The treatment
I would then give that same cohort access to CoPilot and have them use it for 12 months. At the end of which I would collect the same metrics from above and compare them to the baseline figures.
There are many flaws with that simple experiment. The first is comparing apples to oranges. The code that the cohort worked on last year might be very different than the one they will work on next. Maybe last year’s work was highly complex and next year will not to be as challenging.
Then there is bias. If the cohort is aware that they are being measured, they might game the system. For example, instead of pushing N lines of code in 1 commit they can do it in N commits. This influences the commit volume. Similarly tracking LOC can be influenced by adding more comments in the codebase. These (vanity) metrics can be easily gamed :)
I’d written before about my firm belief in AI and in centaur-like working relationships between human and machine.
I still am of that belief, but part of me still wants to see evidence of the efficacy of these centaur-like interactions within the software engineering domain. It just sounds like a grand experiment to do, especially in a domain - software - that has had significant impact on humanity. Alternatively we can just look at Reddit for guidance on that question.
Or maybe I am overthinking this and should just pay $10/developer/month and call it good.
Karim. Long time follower first time writer.
I can say from personal experience that copilot increases productivity by a significant amount. I use it for my Ruby and Golang programming everyday.
While how much copilot can help me with complex code logic remains to be seen, there is a lot of grunt and repetitive with that a programmer does in their programming day. Copilot is pretty good at picking up patterns from surrounding code and is able to run with it. It saves me a ton of time. It allows me to program at the speed of thought which is hard to quantify but a tremendous boost when you are focusing on complex logic and letting copilot write the grunt code.
I’ll send some code samples later to make the point a bit clearer.
re: the experiment. Would you also want to look at the overall latency/throughput of the dev process?
I can see how the developer build time goes down - write code and test it locally. After that bottlenecks may occur at the code review step and post-merge (committed code moves up to test/prod instances).