Accelerating software engineering with AI
In the last year, AI coding assistants have become increasingly prevalent in software development workflows, prompting many organizations to assess their utility. We’ve heard multiple CTO’s ask “How should you use AI coding assistants with your engineering team?” There unfortunately isn’t a simple, one-size-fits-all answer. After attending the T3 Software Engineering Leadership Summit in NYC, where 50 engineering leaders from companies like The New York Times, Betterment, MoMA and Block gathered, it became clear that while the use of tools like GitHub Copilot, Cursor or Zed, is pervasive, where and how its applied varies significantly across engineering teams.
What the Research Says
Before diving into our takeaways from T3, it’s essential to first review what academic research says about AI’s impact on coding productivity. Earlier this year, we published a meta analysis of the research that included a study from GitHub citing a 55% productivity improvement. A more recent study by Princeton, MIT, UPenn, and Microsoft revealed a more modest 26% increase in pull requests on average when coding assistants were used.
For junior developers, the productivity boost ranged from 27% to 39%, while senior developers saw a more modest increase of 8% to 13%. This data reinforces prior research that indicates AI assistants provide disproportionate benefits to less-experienced developers compared to senior engineers.
Insights from the T3 Summit
At T3, engineering leaders were unanimous in their usage of AI coding assistants. But what was striking was how drastically the extent and manner of their usage differed. Some teams chose to use AI sparingly, whereas other teams were writing the majority of their code with assistants. One team insisted their developers write their own code, but chose to use AI to write tests. Another team did the complete opposite, and focused their engineering time on writing test code. They then used those tests to validate the code that the AI had written.
Which one of these approaches is correct for your team will depend heavily on your unique context. We distilled our thinking down into a simple framework that can guide your decision making.
Key Considerations for Engineering Leaders
Personal and Cultural Risk Tolerance
It quickly became apparent during the discussions that the personal risk tolerance of the engineering leader colored their entire team’s approach to using AI. For example, engineering leaders with a “move fast and break things” attitude were far more willing to trust AI-generated code. Others who mentioned being skeptical to new technology developments in general approached AI in a similarly cautious manner, only using its suggestions in very specific scenarios. When using AI, they insisted on reviewing every line of code in-depth. Mark Zuckerberg captured these contrasting approaches well in a recent interview about Meta’s engineering culture:“There’s a certain personality that goes with taking your stuff and putting it out there before it’s fully polished. I’m not saying that our strategy or approach on this is the only one that works. I think in a lot of ways we’re like the opposite of Apple. Clearly, their stuff has worked well too. They take this approach that’s like, ‘We’re going to take a long time, we’re going to polish it, and we’re going to put it out.’ And maybe for the stuff that they’re doing that works, maybe that just fits with their culture.”
Neither one is necessarily superior to the other. But you should be honest with yourself about the kind of leader you are and the kind of organization you lead when considering where and how you should deploy AI coding assistants.
Maturity of the Product and Business
Another critical determinant of AI adoption that we saw was the stage of the product life cycle that the engineering team was working toward. Engineering teams that were tasked with improving mature applications with thousands or millions of users and paying customers understandably were far more conservative in their willingness to use code written by an AI. On the other hand, teams that were building for early-stage, pre-scale startups or innovation teams building proof-of-concept initiatives were far more willing to embrace coding assistants.Surprisingly, we did not observe a strong correlation between industry vertical and tolerance for AI assistance. Engineering teams operating within highly regulated industries like healthcare, where the cost of being wrong can literally mean someone’s life, seemed to be using AI assistants as much as companies within arguably lower risk verticals like media and entertainment.
Representation within Public Repos
Most engineering leaders realize that a Large Language Model (LLM) is only as good as the data set that it’s trained on. The foundation models powering these coding assistants are effectively trained on whatever code is available on the public Internet. Some languages, like Python or Java, are extremely prevalent on platforms like GitHub or Stack Overflow. Others, like Rust or Lisp, are far less common. Since the model has many examples in its training data of the former, coding assistants tend to handle those needs quite well. For the latter, the tools may not perform as reliably.You’ll need to experiment with the specific coding languages, frameworks and problem space that your team uses, but part of your usage of AI coding assistants will likely depend on how well-represented they are in the public domain.
Team Experience and Composition
As mentioned earlier, multiple research efforts have concluded that junior developers derive more benefits from using AI assistants than their senior counterparts. The anecdotal evidence we’ve gathered corroborates this finding. This is due to the fact that LLMs are effectively a compression of the Internet, and are most likely to provide code that’s seen frequently within the training data. The suggested code is therefore by definition an average or mediocre solution. You would expect your senior engineers to perform above average. They hold a higher standard for code quality and are often more set in their ways. As a result, they tend to scrutinize AI-generated code more thoroughly and will overwrite it more frequently. AI can assist in writing code, but senior engineers emphasize that understanding the code’s functionality and correctness remains crucial, particularly during code reviews.You’ll almost certainly get more value out of AI coding assistants if your engineering team skews on the less experienced side. It’s worth noting that “experienced” in this case is relative to the specific engineering task at hand. For example, your team of experienced full stack web engineers may be complete novices when it comes to building native mobile applications. In this hypothetical scenario, you would expect a tool like Cursor to be less useful while building a web-based tool, yet invaluable for creating an iOS app.
Other best practices
There were a couple of other topics that came up at T3 that should be top of mind. After you’ve decided how your team will use AI coding assistants, use these guidelines as you update your processes.
Make time to understand the code
Regardless of whether AI wrote your team’s code, it’s critical that these tools don’t excuse your engineers from deeply understanding their code. Developers should still be able to explain what their code is doing and why it’s the correct solution during code reviews. This level of comprehension ensures that AI-generated code is not just functional but also aligned with best practices, performance considerations, and the overall architecture of the project. Encouraging this critical thinking during code reviews will naturally lead to more thoughtful prompting strategies, allowing engineers to leverage AI more effectively as a tool rather than a crutch.
This isn’t a new issue per se. Long before generative AI, weaker developers would sometimes copy and paste snippets from Stack Overflow without fully understanding how it worked or how it fit into their codebase. The rise of AI coding assistants simply exacerbates this poor behavior. By prioritizing the understanding of code over mere output, you’ll ensure your teams use AI properly as the support system that it is.
Consider multiple measures of productivity
While productivity gains were noted by several teams at T3, it’s essential to define what “productivity” means in this context. Some teams focused on sprint velocity, while the aforementioned academic study used pull requests as the KPI. Neither of these capture the full story in a satisfactory manner. While AI tools may increase the quantity of code written, engineering leaders also care deeply about nuances that aren’t reflected such as code quality, performance and long-term maintainability.
Unfortunately, there’s no silver bullet here. At the recent Developer Productivity Engineering Summit, thought leaders from Google indicated there is no perfect model for measuring developer productivity. Instead, they have resorted to using multiple measurements and data points to draw their conclusions.
AI coding assistants are becoming increasingly embedded in engineering workflows, but their optimal use depends heavily on team dynamics, project maturity, and the specific coding languages involved. Leaders must assess their own risk tolerance, team composition, and the nature of their projects to determine how best to leverage these tools. While AI assistants can enhance productivity, especially for less experienced developers, fostering a deep understanding of the code remains crucial to ensure both functionality and long-term maintainability. By thoughtfully integrating AI into development processes, teams can maximize its benefits while mitigating potential risks.
Mitigating the Risks of AI
Despite the constant presence of AI in the media, trust in AI companies has been declining. As you can see below, the US public had viewed AI companies neutrally in 2019. That perspective has since moved into distinctly distrustful territory, a full 15% points lower in 2024.
Anecdotally, we hear similar concerns expressed from both business leaders as well as front-line employees. The former is worried they don’t have the appropriate governance in place for the technology, and the latter fears AI is coming for their jobs.
For business leaders, we see primarily four categories of risks that they should consider when deploying AI: inaccuracies in AI output, legal and data privacy risks, risk of AI bias and misuse through negligence or even otherwise bad actors. As enterprises look to AI to drive decision-making and operational efficiencies, understanding and mitigating these risks becomes paramount:
Risk #1: Accuracy risks
The incredible power and the Achilles’ heel of generative AI are one and the same. It’s precisely the probabilistic approach that allows LLMs to produce human-like content. But that same fuzzy logic by definition will never be 100% accurate - just like no human is ever perfect. Practically, this means you should deploy AI under the assumption that it will be wrong some percentage of the time.
In 2022 there was a very public example of this. Air Canada’s AI chatbot had incorrectly informed a passenger that he was eligible for a discount. When the passenger went to redeem that discount, Air Canada shockingly tried to argue that they weren’t responsible for any hallucinations the AI might have had and refused to honor the discount. This minor incident then turned into a PR debacle for Air Canada
Whether the AI is wrong 10% of the time, 5% or even 1% depends on the specific implementation, but regardless it’s critical to take steps to mitigate those inevitable moments of failure. Here are a few for you to explore:
Supplement AI with ground truth
One particularly promising approach to improving AI accuracy is to deploy your app using a popular architecture called Retrieval Augmented Generation, or RAG for short. In simple terms, it entails creating a database that contains the knowledge you want your AI to accurately represent. Depending on your use case, this knowledge might include your customer service documents, or your company’s HR policies, or your catalog of product details. When a user asks your AI app a question, it first searches this database for any relevant ground truth, retrieves the appropriate info and appends the info as part of the prompt to the LLM.
Since your AI is provided with not only the user’s original instructions, but also any relevant context that we know to be factually correct, the responses it provides become far, far more accurate. This technique has proven to be incredibly effective at driving down hallucinations, with various development frameworks emerging to make implementation relatively easy.
Keep a human-in-the-loop
Although RAGs can significantly improve the quality of AI outputs, it will never eliminate them fully. We always recommend that companies deploy their AI tools in such a way that a human can pass final judgment on the output before its use. For example, this could include reading the AI-generated article before it’s published, or reviewing an AI image before it's incorporated into an ad. It takes far less time for a person to review something than to create it, so the operational savings can still be preserved while you mitigate the risk of AI hallucinations.
In most businesses, a manager would expect to spend some portion of their time reviewing the work of an entry-level employee. The same mental model applies here.
Take responsibility when AI is wrong
When AI inevitably makes that mistake (because that’s how AI works), and the human-in-the-loop inevitably fails to catch it (they’re only human after all), be prepared as a business to take responsibility. Customers expect businesses to own their mistakes, whether that’s due to an error on their website, or perhaps a sales representative mistakenly quotes the wrong price, or if their chatbot says the wrong thing. The cost to take responsibility for the hallucinated discount would have been a mere $645. The reputational harm from a disgruntled customer is immeasurably more costly.
Companies deploying AI should gauge both the frequency that their AI is wrong, and the cost of being mistaken and bake that into their financials from the beginning. Within manufacturing, factories carefully measure the number of defects per million, and expect to absorb the cost of those defects. Credit card companies anticipate that a portion of their customer base will fail to pay off their balances, and structure their products accordingly. AI usage should be no different.
Avoid customer-facing applications to start
The worst kind of error a business could make is one that negatively impacts their customers. Given the inherent risk of error with AI and its novelty as a technology, some companies are prioritizing internal operational use cases for AI tooling over customer facing applications. We’ve observed this anecdotally with our clients as well as represented within the recent survey data published by Andreessen Horowitz.
While we firmly believe there are significant opportunities to create value through customer facing AI applications, if the cost of being wrong is particularly high or your company’s still getting comfortable with the technology it makes sense to start internally.
Risk #2: Legal & data privacy risks
The advent of generative AI has caused a wave of both legal and privacy concerns over what data is being used to train these large language models. Many media companies have expressed concern that their copyrighted content was used as training data without their permission and therefore illegally. This has led to multiple lawsuits against foundation model providers, such as the one from the NY Times.
Business leaders have been understandably skittish about adopting AI and inadvertently infringing on copyright. However, the model providers have nearly all responded by offering a form of “copyright shield”. For example, OpenAI explicitly states in their business terms that they will defend and indemnify users if a third party claims IP infringement. Some companies are trying to differentiate their models through copyright. For example, Adobe is positioning Firefly as a commercially safe alternative to Midjourney, having been trained on licensed photos from Adobe Stock vs scraped from the internet. We recommend clients review the indemnification clauses of model providers before using them, but rest assured the most popular providers are at this point commercially safe for enterprise use.
On the privacy side, businesses may be concerned that an AI provider will use their inputs into the AI as future training data. This would be especially problematic if sensitive, proprietary or personally identifiable data was then outputted by the AI to external users. We recommend clients pay close attention to the terms of use for the type of license you buy from the model providers. For example, OpenAI explicitly commits to not using your data, inputs and outputs for training models for ChatGPT Team and ChatGPT Enterprise. Contrast that with the privacy language for personal accounts, where OpenAI reserves the right to train their models on user provided content.
Some leaders have rolled out policies that allow the use of AI tools, but ask that employees refrain from providing the AI with confidential corporate information. These types of policies are cumbersome, confuse your teams and are nigh impossible to enforce. We believe a simple, clear policy is a better approach. Either get the appropriate license that provides data privacy (usually for Enterprise accounts) or don’t use those tools at all.
Risk #3: Risk of bias in AI
The unfortunate reality is that AI is often biased. This is because AI is trained on large swathes of the internet, which itself reflects the biases of humans and is not representative. Take language for example. 55.6% of the internet is written in English, despite the fact that native English speakers only account for 4.7% of the global population.
If you factor in secondary English speakers, that figure only goes up to 18.8%. Once you realize that this is how the underlying training data skews, it becomes unsurprising that the most popular LLMs handle English better than other languages. The same bias can be observed beyond spoken languages. For instance, within coding use cases most LLMs are more proficient with Python than something obscure like Rust. One study from MIT found that 3 computer vision gender classification systems were significantly less accurate for darker-skinned females (up to 34.7% error rate) compared to lighter-skinned males (0.8% error rate). This was attributed to training datasets that were overwhelmingly composed of lighter-skinned subjects.
The challenge for most leaders deploying AI for corporate applications is that they will almost certainly use off-the-shelf models like GPT-4, Claude or Gemini, where the exact training data used and the degree to which it is fair will not be clear. Another difficulty is that different organizations and individuals will have different definitions of fairness. However, you can still mitigate bias to an extent.
First, create a clear definition of fairness for your company and for your particular application of AI. Perhaps you’re primarily concerned with racial and ethnic fairness. Or perhaps you’re concerned with gender or age bias. Or perhaps you want to ensure your AI has sufficient non-English language coverage.
Once you have a measurable definition, you can set up an evaluation framework to periodically test your AI for bias. There are both proprietary and open source tools that can help with this, such as FairLearn from Microsoft and Fiddler.
Finally, you can inspect your own data for bias. While you won’t have access to the training data already incorporated into an LLM, in many instances you’ll supplement with additional training data to improve accuracy for your specific use case. Having balance within the data set that you control can help reduce any inherent bias.
Risk #4: Risk of negligence or bad actors
There have been numerous instances already of people being overly reliant on AI or not bothering to double check the work it produces. In New York, there was a high profile case where two lawyers were sanctioned for submitting a legal brief that contained six fake citations that ChatGPT had hallucinated. In the scientific journal Frontiers, there was another situation where an author published a paper showing a rat with impossibly large genitalia, a figure (amongst others) that was fabricated by Midjourney. What’s amazing in this instance is that this paper had made it past an editor and two peer reviewers prior to publishing. Whether the lawyer or scientific author in these instances were deliberately trying to mislead through AI or were simply negligent is beside the point. For a business leader, the outcome is the same.
There are a few techniques you can deploy to minimize this very real risk. The most potent technique was already mentioned earlier - keep a human in the loop. For particularly sensitive use cases, you may want to consider adding a second reviewer to catch anything the first reviewer may have missed. There are also tools that can be deployed to identify an over-reliance on AI. For example, Copyleaks and GPTZero are two popular tools that identify when AI has been used to generate a piece of content. In extreme instances where you’re concerned about truly malicious usage of AI, a common technique used within cybersecurity is to deploy red teams. With origins in war gaming, red teams are tasked with discovering ways to exploit the AI application and produce undesirable outcomes. Once these vulnerabilities have been identified, your teams can then develop appropriate countermeasures.
When integrating AI into business operations, it’s crucial to acknowledge and address the inherent risks that accompany this technology. However, once understood these challenges can be effectively mitigated through a variety of tactics. By establishing comprehensive governance frameworks and fostering a culture of accountability, businesses can safely and responsibly leverage AI to drive innovation and operational efficiency.
Selecting the right generative AI model.
It seems like every other day there are new AI models being released. While we’ve spoken to many CEOs and tech leaders who are looking to deploy AI within their organizations, many are overwhelmed by the sheer number of options now available. As of February 2024, Hugging Face had nearly 500K different AI models listed. In truth, most of the models out there are good enough for the majority of applications. We believe that most decision makers can radically simplify this decision by focusing on two factors: licensing and size.
Licensing
Like most enterprise software, Large Language Models (LLMs) are now offered in two primary types of licenses: proprietary and open source. Each path offers distinct advantages:
Proprietary Models: Speed, Scalability, Simplicity
Proprietary LLMs, such as GPT-4 or Claude 2, are built to be as turn-key as possible. These models have APIs that are well-documented, and come with tooling to facilitate deployment. This means your team can get up and running far more quickly, as they won’t need to deal with details like hosting, moderation guardrails, security and observability. Scaling up your AI application is also straightforward, as you simply pay more as you use it more. Finally, the complexity of the model itself is abstracted behind an API, reducing the need for extensive technical expertise.
That said, you inevitably sacrifice customizability and control. While you’ll be able to modify the architecture surrounding the model and take advantage of techniques like Retrieval Augmented Generation, you won’t be able to fine tune the model itself. Your application is also beholden to the model provider, who may choose to modify the service or pricing without your consent.
Open Source Models: Flexibility, Control, Community
Open source LLMs such as Llama 2 or BLOOM represent a more bespoke approach. One key advantage is flexibility, as businesses can modify and adapt these models to their specific requirements. You can tune and tweak these models as you see fit, adjusting not only the output but also optimizing response times. The second major advantage is the ability to mitigate data and privacy concerns. Since you’re the one operating the model, you maintain control over where the data sits and how it is or isn’t consumed.
While you benefit from the collective expertise of the community developing these open source models, the usage of open source LLMs generally requires deeper technical expertise. You will also need to commit to setting up, developing and maintaining the model itself in addition to your application.
Size
Generally speaking, the larger the model, the more capable it will be. The size of a model is typically measured by the number of parameters it contains. This latest wave of generative AI has been driven by large language models because they contain far more parameters as a result of having far more compute and training data.
Within the realm of large language models, there’s still a large spread in terms of size. To give you a sense, Microsoft’s Phi-2 (with 2.7 billion parameters) and Mistral’s 7B model (containing 7 billion parameters) are on the smaller size. As size continues to scale, you’ll see models like Llama-2 (70B parameters) and Falcon 180B (180B parameters). At the very top sits OpenAI’s GPT-4, which is estimated to contain an astounding 1.76 trillion parameters.
Bigger however, is not always better. First of all, the largest models may be overkill for your needs. Many of the largest models are multi-modal, in that they’re designed to handle multiple data types (e.g. text, images, video, audio, etc). Many AI applications will not need that full spectrum of capability, so why pay for it? Second, the larger models tend to have longer inference times - that is, the lag from when you input instructions to when the model returns a response. It’s well understood that page load times have a significant impact on the performance of web applications like e-commerce. Many AI applications will similarly benefit from faster inference times. Finally, smaller models are faster and cheaper to train on your proprietary data sets compared to larger models.
What does this all mean in practice?
The right model choice will depend heavily on how far along you are in your AI journey, and the needs of your specific use case. At Eskridge, we believe that you should always start small with a pilot or proof-of-concept. The best models for pilots are the large, proprietary ones like GPT-4. Your team will be able to stand up something quicker and more cost effectively and they’ll be using a more powerful model, so the outputs will be more likely to meet expectations and drive positive ROI. Finally, you most likely won’t need significant customization at this stage.
Once you’ve established the value of your AI pilot and start to scale it to production, we believe the timing is right to consider an open source model to optimize your application. At a higher scale, the performance nuances of your AI will matter more. For instance, the cost per inference isn’t material for a few uses a day, but as the number of inferences grows to the thousands per day the cost will add up. Or you may start to see the limits of the quality in response you can get from an off-the-shelf model and wish to explore how much better output you could get from a model tuned on your own data sets. In these situations, you’ll want to use the smallest open source model that gives you comparable results, for all the reasons outlined above. Many modern applications are built in a modular fashion that you can take a more bespoke route to AI down the road, once the associated costs become clearly justified.
Of course, there are exceptions to every rule. There are certain niche use cases where it would make sense to use an open source or small model right out of the gate. For one, if your use case requires handling extremely sensitive data, you could decide that even the limited exposure of that data to 3rd parties during a pilot is simply unacceptable. This is a common scenario encountered in verticals like healthcare or defense. Second, your application may require running the model on local hardware like a phone or IOT device. In these instances, proprietary models by definition won’t be an option, and you won’t have sufficient compute to run the larger open source models. Finally, if your use case requires a very specific type of output, you may find a need to fine tune the model upfront, in which case you’ll find yourself reaching for open source.
We do expect to see the practical differences between open/closed and large/small models to diminish over time. Just as the proprietary players are slowly opening up their walled gardens to give customers finer grained control, new AI infrastructure companies like Baseten are emerging to make it much easier to deploy open source models. But for the foreseeable future, you’ll want to pilot with large proprietary models like OpenAI’s GPT-4 and over time migrate to custom implementation of a smaller, open source model like Mistral 7B.
Benchmarks for AI-driven productivity gains.
2023 has seen an explosion of interest in generative AI. Venture capitalists are investing in it, big tech companies are hyping it and companies are rushing to pilot it. However, there have been precious few case studies that provide hard data on the impact of AI to operational efficiency. So how are business leaders to know what type of productivity gains to expect when deploying AI?
Fortunately, academia has leaned into this question and published a number of working papers this year studying the impact of AI on different types of work tasks. Below are the key learnings from a few of the most interesting papers, as well as some of the major implications for business leaders considering deploying AI within their companies.
Writing tasks
MIT published research on the impact of AI on business writing. The study consisted of 444 college-educated professionals from a variety of backgrounds (marketers, grant writers, data analysts, consultants, HR, managers). They were asked to write two pieces of content for work, like press releases, short reports, analysis plans and delicate emails. The first piece was written by themselves. The treatment group used ChatGPT to help write the second piece. The time to complete the task was 37% less for the treatment group, compared to the control group. Not only did AI speed up the work, it also improved the quality of the work. Evaluators were asked to score the quality of the writing on a scale of 1-7. The average writing grade improved by 15% when using AI, from an initial score of 4 to a final score of 4.6.
Strategy tasks
In September, Harvard Business School shared the results of their collaboration with BCG. They gave 758 consultants each a set of 18 realistic consulting tasks to gauge the impact of AI on productivity. Consultants using AI completed 12.2% more tasks on average and completed tasks 25% more quickly. Like the aforementioned writing study, AI usage also boosted the quality of their output. Consultants using AI produced 40% higher quality work compared to those consultants who didn’t use AI.
Coding tasks
Microsoft and Github partnered with MIT to study the impact of Copilot, an AI programming tool, on developer productivity. They recruited 95 professional programmers and asked them to write an HTTP server in Javascript. The 45 developers in the treatment group used Github Copilot, and the 50 developers in the control group did not. The average completion time from the AI-using group was 71.2 minutes, compared to 160.9 minutes for the group without AI, which means that AI drove a 55.8% reduction in task completion time.
Customer support tasks
Finally, the National Bureau of Economic Research conducted a study of the impact of AI by examining 3 million support conversations held by 5,179 customer support agents, of which 1.2 million conversations were held after AI was introduced. The study found that customer support agents using AI resolved 14% more issues per hour. These gains came from a reduction in the time it takes for an agent to handle an individual chat, an increase in the number of chats an agent can handle per hour and an increase in the share of chats that are successfully resolved.
As you can see, the studies report productivity improvements range anywhere from 14% to 55% depending on the task and specific metric used. However, we recommend you think of these figures as an upper bound on the expected impact when deploying AI within your company. First of all, these studies were conducted under strictly controlled environments. The only time measured was that spent doing the test task, like writing or coding. In the real world, your employees will have to attend meetings, respond to emails and engage in a whole host of other typical business activities that aren’t captured in these studies.
Second, many mid-market business leaders lack the resources to accurately measure internal productivity, and therefore will have to rely in part on anecdotal evidence. But the Github Copilot study shows the challenges with this approach, as the participating engineers underestimated the impact of AI, self-reporting a 35% increase in productivity compared to the 55.8% increase actually measured. Now, beyond a starting point to evaluate AI and the potential productivity gains it can unlock, what else should a mid-market CEO take away from these studies?
Not every task is suitable for AI
These studies were built around tasks that the latest generation of AIs are particularly well-suited for - tasks such as writing, coding and conversing. It's important to remember that there are many use cases where generative AI commonly hallucinates or is otherwise poorly suited. For example, ChatGPT can be terrible at solving simple arithmetic problems. And many developers I’ve spoken to report that while Copilot is helpful for common and simple coding tasks, it struggles with more niche problems.
In the BCG research above, some of the 18 tasks used in the study were chosen because they were deemed beyond the current capabilities of AI. And when consultants used AI to support those tasks, they were 19% less likely to produce correct solutions compared to consultants without AI. Part of the risk in deploying AI is that even when it produces something incorrect, the output will often look plausible. Business leaders should carefully consider the tasks they wish to deploy AI against, as applying it in the wrong scenarios will lead to much worse outcomes than if you didn’t use it at all.
AI provides greater benefits for less-skilled workers
While the research confirms that AI can drive productivity improvement, all of the above studies also indicate that the gains were distributed unevenly across the test population. Specifically, less skilled workers benefit more from using AI, compared to highly skilled workers. In the business writing study, the writers who scored the lowest on the initial task saw the greatest degree of improvement when using AI, compared to those who initially scored higher.
In the BCG study, consultants who performed below average saw a 43% quality improvement when using AI. But consultants who were above average only saw a 17% improvement in quality.
The Github study specifically notes that “developers with less programming experience benefited the most” and even amongst the customer support research there was a more pronounced improvement amongst less-skilled and less-experienced workers. Newer agents using AI saw a 35% increase in the number of issues resolved per hour - much higher than the average improvement of 14%.
This suggests that companies that are more heavily reliant on less skilled or experienced workers would benefit more from deploying AI. Conversely, companies with a highly skilled workforce should expect to see lower productivity improvements from leveraging AI with their current talent. And any company expecting to hire rapidly, whether due to high turnover or high growth, should consider using AI to quickly bring new employees up to speed and close any skill gaps.
Key drivers of AI advancement.
In recent conversations with the owner of a video production company, the topic of creative AI tools came up. This leader had been ruminating how recent AI developments might impact the future direction of his firm. It's a subject I’ve previously encountered while experimenting with generative AI during my time at Amazon Ads. The burning question on everyone's mind, from this business owner to individual producers, is if and when AI will be capable of fully automating video production? The answer to this isn’t clear, but in this article I’ll provide some context that will help you gauge the progress of AI and interpret its evolution.
There are really three main drivers of AI improvement. The first is the availability of computing power, the second is the explosion of data and the third is increasingly sophisticated algorithms.
Driver #1: Availability of Computing Power
The latest generative AI technologies utilize something called large language models. They’re called “large” models because of the vast number of parameters they must learn during training. In general, the more parameters within a model, the better it performs. For instance, GPT-4 supposedly uses an astonishing 1.7 trillion parameters. The amount of computing power required to train a model like this is staggering, and has only recently become available. This is because over the past 50 years, our computer chips have doubled in power approximately every two years - a phenomenon known as Moore’s Law.
While this trend has slowed slightly in the past decade, I expect our hardware will continue to improve for the foreseeable future. And if you believe we’ll have faster and faster chips, then you can reasonably expect that we'll have the ability to train larger, better performing models over time.
Driver #2: Explosion of data
Where the improvements to computing power have followed a linear curve, the amount of data that we collectively produce has grown exponentially. Hal Varian, Chief Economist at Google helps put this into perspective: “Between the dawn of civilization and 2003, we only created five exabytes; now we’re creating that amount every two days. By 2020, that figure is predicted to sit at 53 zettabytes [e.g. 53 trillion gigabytes] — an increase of 50 times.”
The vast majority of this data is unstructured, and only a portion of the data produced each year is actually stored and carried forward into the following year. But even with those caveats, the growth of data has been incredible. It’s precisely these immense volumes of data, coupled with greater computing power, that have made it possible to train large language models like ChatGPT and Bard on such a wide range of topics.
Driver #3: Increasingly sophisticated algorithms
While the growth of computing power and data follows a relatively predictable pattern, the sophistication of our AI algorithms are anything but and therefore the one to watch the closest. The typical pattern that has played out in history is that a researcher makes a new AI architectural discovery that unlocks a major leap forward in terms of performance. The industry rushes to build upon this fundamental breakthrough, but at some point performance begins to plateau and an AI winter inevitably follows.
The current hype around generative AI can be traced back to the discovery of the transformer architecture, first proposed by Google Researchers in 2017. Along the way, there have been many other model architectures that drove AI progress, starting with generative adversarial networks in 2014 and recurrent neural networks starting in 2007 and going all the way back to expert systems introduced in 1965. As you can see, the major breakthroughs occur rather sporadically and contribute the greatest uncertainty to the future of AI - even for the field’s luminaries. Marvin Minskey, the founder of MIT’s AI laboratory, famously told Life Magazine back in 1970 that “from 3 to 8 years we will have a machine with the general intelligence of an average human being.”
Speech recognition as a case study
In order to better understand the development of generative AI, it may be helpful to trace the development of a different AI application, that of speech recognition (also known as speech-to-text). Speech recognition has quickly become a ubiquitous feature in mobile devices and other smart devices. Part of its prevalence now is that speech recognition has finally become “good enough.”
A crisp definition of “good enough” is important to evaluate the potential impact of AI. A good benchmark is to compare the quality of the AI’s work against the quality produced by an average human. For self-driving cars, an important consideration is safety. In Tesla’s Q4 2022 report, drivers using Autopilot recorded one crash for every 4.85 million miles driven, compared to the most recent 2021 data provided by NHTSA showing an average of one crash every 652,000 miles. In the context of speech recognition, the average human is around 95% accurate when it comes to recognizing what someone else is saying.
Although speech recognition is pervasive now, it's actually been under research and development in some fashion since Bell Laboratories first designed the “Audrey” system in 1952. Thanks in part to DARPA research funding, major improvements were made in the 1970’s and beyond. However, even as recently as 2001 the state of the art was only capable of 80% accuracy - well short of where it might be suitable for mainstream applications. It took another decade and a half before Google managed to achieve human parity of 95% accuracy in 2017.
Coming full circle back to video production, one of the most impressive startups applying AI to video production is Runway. When you look at the output of their Gen 2 model, it’s clearly not ready for prime time yet - the frame rate is choppy and there are noticeable artifacts. At the same time, their output is starting to feel like a real video. When you trace the history of Runway, the startup was founded in 2018 around building video editing tools for TikTok and YouTube creators. It wasn’t until 2022 that they made a breakthrough in generative AI. Runway partnered with Stability AI to release Stable Diffusion, one of the leading transformer models for image generation. They built Runway’s Gen 1 model for video generation in Feb 2023, followed in June with Gen 2. Even Runway’s brief history traces a similar pattern in terms of algorithm advancement. They saw years of incremental improvement, until they made the leap into transformer architectures in 2022.
This post has been a long winded way of saying I don’t know when generative AI will become sufficiently good to fully automate creative work like video production. The technology’s progress depends on factors like computing power, proliferation of data and algorithm advancements. Given the unpredictable nature of algorithm innovation, it could happen next year or it could be a decade or more into the future. Regardless of the specific timing, the astute business leader won’t wait for that potential future. Instead, they’ll begin reinventing their workflows and creative process around AI technology today.
Why it’s the right time to explore AI.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
Every day there are multiple news articles about how AI is changing the world. Some pundits caution that the latest wave of innovations will lead to doom and gloom. Others express a far more optimistic outlook. But nearly everyone is talking about it in some fashion.
When I speak with business leaders and executives, they express a mixture of interest, excitement, and fear. They are incredibly excited about the potential value that AI could create within their businesses. But they also express concern that their business may be left behind if they don’t adapt quickly enough and are afraid that they lack sufficient understanding of the technology. This makes sense. It’s a technical and complex area, and the raging hype around AI creates more noise than signal.
Fortune 500 executives can mitigate this with high priced consultants or in-house teams dedicated to innovation and emerging technologies. Amongst small and medium-sized business leaders however, these fears are particularly acute. They must deal with dozens of other pressing demands on their attention - rising cost of goods, contracting consumer spending, labor challenges, channel saturation and growing global competition to name a few. They realize AI has the potential to unlock incredible value for their customers and shareholders but lack the time and expertise to know where to begin.
Andrew Ng, arguably the godfather of AI, observed that the first wave of AI innovation required large data science and research teams. Therefore, only companies working on the most lucrative applications at scale could afford to build those capabilities. However, over the last few years a second wave of innovation around data science tools and open source models have made AI far more accessible to technology teams beyond the hallowed halls of Google DeepMind and Meta Research. As a result, it has only recently become financially feasible to apply AI to the long tail of use cases.
We’re excited to be founding Eskridge to serve this nascent market. AI is a fascinating technology, one that we believe will transform mid-market companies. We want to be a part of this revolution, shepherding clients into this incredible new era.
We plan to relentlessly focus on applied AI. Every technology service provider, system integrator and CX agency claims to offer some form of AI capabilities. These larger, more sprawling service providers tend to carry overhead costs that price them beyond the reach of small-to-medium business owners. They’ll only develop expertise but so quickly when AI is ancillary to their core business. This is a complex space that continues to evolve at a rapid pace, carries the potential to unlock unprecedented value and ultimately warrants full, undivided attention. At Eskridge, we’re looking forward to doing precisely that.