If we presume the hypothesis of the talk – that LLM prompts/replies can now or soon be considered natural language VMs [NLVMs] – I believe one primary concern for both software engineers and business stakeholders alike is the formal verification of the executable outputs of those VMs.
Two questions from this:
- If a primary concern is to determine whether a NLVM is doing exactly what we think it will do, how do we apply tools like SAT solvers, dependent type/category theory, linear resource management, etc. to NLVMs?
- Anecdotally, I’ve found that GPT models have a very hard time “understanding” type systems and, for that matter, proper abstraction boundaries. Does this mean we need non-NLVM systems as intermediates between NLVM inputs and outputs? Where do those demarcations lie?