The exciting new world of AI-generated code (Copilot)

Written by John Mikael Lindbakk - 2021-07-01

A few days ago, Github released a completely new tool to the world called Copilot, and I am very excited about what it may lead to, which is why this post exists. I have not received an invite, so I have yet to try it out for myself, so these are just my thoughts based on what I’ve seen.

For the uninitiated, Copilot is, in short, an AI-driven code auto-completer. Github has fed a bunch of open source code into an AI that has generated an algorithm that predicts what the developer wants to achieve (Note that this is an extremely basic explanation). What we type is fed into this algorithm, and it spits out its best suggestion of what we’re trying to achieve. I can’t say how well it deals with complex situations, and how much it can understand the surrounding architecture is unknown. The examples I’ve seen have been very isolated and non-domain specific. That said, Copilot is impressive, and I’m excited about how it might change the way we work.

Let’s be clear: I don’t think the current version will change our industry. It seems to deal with simple scenarios, and its consideration of the domain appears limited at best. I am excited about the potential of this technology, and I see some fascinating use-cases on the horizon.

AI driven unit test generation for legacy applications

As Micheal Feathers states in Working Effectively with Legacy Code: “Legacy code is code without tests”, which is a definition I agree with. Considering that Copilot already can generate tests, it would be exciting to see if it would be possible to take a legacy application and generate a bunch of tests for it. This technology would make it possible to go from 0% test coverage to 100% within a few hours of computation!

There will be many downsides with this approach, and it is not one which companies should be aiming for. The tests generated by this tool will probably not refactor the source code in any reasonable way to make it more testable and instead generate tests that are naive and difficult to read. At the same time, the names of these tests will most likely be questionable at best. I am by no means saying these will be great tests because they won’t be. Generating tests for a system that is not written to be testable will not turn out pretty, but at least the system will now be tested.

The big question is this: Are naive and chunky tests worse than no tests at all? Especially when we get to 100% coverage? Even if the tests are difficult to work with, they will let the developer know when some behaviour has changed, and therefore provide value in that sense. The developer will know when they have accidentally broken some functionality that previously existed, which is one of the more challenging parts of working with legacy systems.

In many situations, I believe AI-generated tests might be less painful than no test coverage whatsoever.

AI driven unit test generation for legacy teams

What is a “legacy team”? In my experience, it is a team that produces legacy code and simply do not evolve with the times. These teams exist, and they’re often not aware of it, or if they are, they make up excuses as to why they cannot change. They use excuses such as “not having enough time” to do essential quality assurance on the code they produce. They refuse to prioritize deadlines because management won’t allow them (tl;dr: I have opinions on this).

Unfortunately, I know for a fact that there are plenty of these teams out there. I’ve seen them myself, I’ve talked to friends, and I read about them online. They can be found in most organizations, and they can reflect the developer culture in a company. No matter how these teams exist, I believe that Copilot technologies can help get them to a better place.

Most professional developers write tests as they write code. They might not follow TDD, but they value the existence of tests and are happy to write them. Legacy teams tend not to hold the same values. They might want to write tests, but they use excuses such as not having enough time or thinking it doesn’t provide enough value. I beg to differ, but that is a discussion for another day. What AI-driven test generation can do is to enable an entirely new workflow:

Write implementation
Test manually
Generate tests
Push the code

Consider that we also may be able to auto-generate the tests when merging to master.

We want unit tests to document behaviour, and included in that is that we want the tests to let us know when we have changed something. Many legacy teams don’t have any form of automated testing and are solely relying on manual testing. Removing the need for manual testing cannot be done with this approach, as the generated tests would only document the behaviour that is currently present (be it correct or wrong). The AI has no clue whether the code makes sense and don’t understand how the changes will impact the users, but it can document what the code currently does by generating tests.

At this point, you might be asking whether this approach is worth it if we still have to test manually? Great question! I’m glad you asked. The reason is that the tests also documents the behaviour of the code for future developers. In this workflow, the tests can be seen as a developer’s signature that solidifies that “this is how it is supposed to work”. The tests become a warning for developers to watch out for as they develop, and when they break a test, they must decide whether the change of behaviour is correct or not.

In this process, one wouldn’t bother with changing tests. One would delete failing tests but only after understanding why it fails and agrees that the changed behaviour is correct. After all, in this flow, we don’t care about maintaining tests. We only care about being notified that something has changed. When done, we would test manually and generate tests again.

I would never encourage this approach. It completely ruins the other benefits tests gives us. Writing tests forces us to write testable code, and testable code is also easy to work with. The code tends to follow SOLID naturally, and it tends to be more readable. Hand-written tests will also better understand the domain and emphasise what is essential, while generated code will view everything equally important. Generated tests will be harder to read, and the original intent will be harder to get a grasp on. While I might not consider this a good approach, it might be a much better approach for teams that wouldn’t write tests in the first place.

Using this approach to generate tests is, in my opinion, an anti-pattern and an abuse of the tool. In this case, it would cover up poor development practices and a lack of consideration when writing code. However, it might be less of an anti-pattern than not writing tests at all.

TTD on steroids

So far, I’ve talked about generating tests, and Copilot seems much more focused on generating code - which I also believe can be a great foundation to increase the efficiency of TDD even further. This is the idea that got me really excited.

This is a common flow of TDD:

Write failing unit test.
Write code that makes test pass.
Refactor.
Repeat until feature is complete.

(Different people might have slightly different steps, but this is roughly what most translates to).

By following these steps, you tend to end up with very clean and well-tested code. It is the way I prefer to develop, and I personally find it very enjoyable.

So how would Copilot be able to impact TDD? Let's consider this:

Write failing unit test.
Generate code that makes tests pass.
Refactor if needed.
Repeate unil the feature is complete.

What if we could simply write tests, and then an AI would generate code that would make those tests pass. This approach is the opposite direction of the legacy team approach, but one that I believe to be much more powerful.

AI that generates code based on tests will have much more information to work with. The AI can verify whether the generated code is correct or not, and therefore can iterate through solutions until it finds something that works. The AI has multiple data points (tests) where inputs and results are different. It has a clear goal to make those tests pass, and if it does, it has succeeded. The big challenge here would be to write tests that would meaningfully describe what the implementation needs to do.

Another benefit of generating code based on tests would be that the AI won’t need domain knowledge to achieve good results, as that knowledge already exists in the tests. Granted, the AI would only do a good job if you got good tests. In this case, I’d remind people about the universal law that is “shit in, shit out”. AIs can do great things, but they can’t do magic.

The significant danger here is that we start ignoring the actual implementation in favour of the tests. In many scenarios, this might be appropriate, but the AI might decide to generate code that is incredibly inefficient or broken in some other way. Unless we have some protection against such things, we must still ensure that we read and understand the code generated (which we should do anyway in code reviews).

This approach could be taken even further to BDD, where we can write tests with some inputs that result in some output, then an AI can generate what it takes to achieve that goal. Not sure I would jump on such a solution, but it is an intriguing thought.

I don’t expect this to be reliable anytime soon, but it might be a path that software development might take in the future, and I would not be surprised if it turned out to be the future. I’m already generating classes, functions, etc., from my IDE when writing code. While these are elementary forms of code generation, I see that these basic tools help a lot, and I’d argue that TDD would be difficult to justify without the power of modern IDEs. If we add even more powerful code generation to the point that it can guarantee to generate code that will make your tests pass, then that will make TDD even more competitive.

Final thoughts

I don’t see AI-generated code will take developer’s jobs anytime soon. The AI must understand the domain and the users, which require a general-purpose AI. We have attempted to make general-purpose AIs since the 70s (if not earlier), and we’re still unsure what it would take or even if it is possible to create one. Ever since those days, we’ve been 5-10 years away, and I reckon we’ll be 5-10 years away for quite some time. This should not take away from how impressive Copilot looks and how Copilot (or technologies like it) might change the way we craft applications. While they’re just tools, they are compelling tools that might change our professions entirely.

Copilot is only the first iteration that might kick off an entire category of tools, and it might take multiple iterations before we get to a point where it changes the way we develop software. As it stands Copilot doesn't even guarantee to generate valid and compilable code, and it is clear that it has a long way to go, but I'm very excited for what future iterations might bring.