Mindshifts in Sofware Engineering

An oral history into the first contact between humanity and an alien intelligence

🔷 Subscribe to get breakdowns of the most important developments in AI in your inbox every morning.

💻 Mindshifts in Sofware Engineering

If there is a generative AI person in your circle of friends, yes, they’ve probably felt that they were going slightly crazy in the last year. Only a few generative AI systems have truly succeeded in production, and they’ve been hard to build. This wonderful paper from Microsoft-Github, the first organization outside OpenAI to unveil a large-scale system, makes clear the challenges and opportunities:

  • Time-consuming process of trial and error

    • “Early days, we just wrote a bunch of crap to see if it worked. Experimenting is the most time-consuming [thing] if you don’t have the right tools. We need to build better tools”

  • Wrangling prompt output

    • “It would make up objects that didn’t conform to that JSON schema, and we’d have to figure out what to do with that“

    • “if the model is kind of inherently predisposed to respond with a certain type of data, we don’t try to force it to give us something else because that seems to yield a higher error rate“ - on eventually accepting that file trees would be better generated as ASCII output and then parsed

  • Prompt management

    • its “a mistake doing too much with one prompt“

    • “So we end up with a library of prompts and things like that.”

  • Every test is a flaky test

    • “that’s why we run each test 10 times“

    • “If you do it for one scenario no guarantee it will work for another scenario“

    • “[manually curated spreadsheets with hundreds of] input/output examples“ - on how they managed testing

  • Creating benchmarks and reaching testing adequacy

    • “especially for more qualitative output than quantitative, it might just be humans in the loop saying yes or no [but] the hardest parts are testing and benchmarks [still]”

    • “most of these, like each of these tests, would probably cost 1-2 cents to run, but once you end up with a lot of them, that will start adding up anyway”

    • “Where is that line that clarifies we’re achieving the correct result without overspending resources and capital to attain perfection?”

  • Safety and privacy

    • “We have telemetry, but we can’t see user prompts, only what runs in the back end, like what skills get used. For example, we know the explain skill is most used but not what the user asked to explain.”

    • “telemetry will not be sufficient; we need a better idea to see what’s being generated.”

  • Mindshifts in software engineering

    • “So, for someone coming into it, they have to come into it with an open mind, in a way, they kind of need to throw away everything that they’ve learned and rethink it. You cannot expect deterministic responses, and that’s terrifying to a lot of people. There is no 100% right answer. You might change a single word in a prompt, and the entire experience could be wrong. The idea of testing is not what you thought it was. There is no, like, this is always 100% going to return that yes, that test passed. 100% is not possible anymore“

The whole paper is interesting as an oral history into the first contact between humanity and an alien intelligence. Hat Tip @vboykis

Become a subscriber for daily breakdowns of what’s happening in the AI world:

Join the conversation

or to participate.