Alan Schuchman (Cairncross & Hempelmann in Seattle, USA) has offered us a report on his tests of CaseText’s CoCounsel AI (our correspondence is attached). At the same time, a group of academics released a paper last week titled “Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence” (attached).
The conclusions of these trials are very similar, diverging only in their final thoughts. Here are some key points:
What we did
Research Paper: In this paper … we assess the accuracy of responses generated by LLMs on thousands of tax law inquiries across experimental setups … along with various [answer-]retrieval techniques (with comparisons made across different retrieval methods). We conduct these experiments across LLMs, from smaller and weaker models, up through the largest state-of-the-art model, OpenAI’s GPT-4.… Our total sample across the experiments contains 28,700 answers.
Cairncross: Over the course of my seven-day demo period I was … able to run 20 research memo requests [and, after the demo period, we are now using it on a $50-per-use basis]. You type in your inquiry … and within 10 minutes you have hours-worth of research summarized in memo form…. [W]e verified. I ran research memos for questions I already had human research on and … ran memos for multiple other attorneys with current research needs and asked for feedback on the results.
Results: “positive with caveats”
Research Paper: As evidenced by Figure 3, GPT-4 [but not smaller LLMs] can leverage relevant legal text and examples of the question-and-answer task to “reason” and come to a correct answer a large proportion of the time on difficult tax questions.… [However] even our best current models underperform a professional tax lawyer, who would be expected to answer these questions with near-perfect accuracy. Moreover, answering clear-cut legal questions is only a small part of the work of a practising lawyer.
Cairncross: … the [CoCounsel] results came up with the same cases [as the human researchers] as well as additional helpful cases not identified by the human researchers … Across the board the feedback was positive with caveats. There was … only [one] blatant error I am aware of, but there were often a lot of irrelevant results mixed in with the useful/relevant stuff.
Conclusion “a very useful tool”
Research Paper: The most capable model, GPT-4, with both prompting enhancements [] and the most relevant [] legal text input into the prompt, can perform extremely well, far better than any other setup in the experiments.… there is no strong reason to believe that LLMs could not eventually accomplish a wide range of legal tasks with greater performance…. Our work [demonstrates] the emergence of tax law understanding, which occurs once the LLM is of sufficient underlying general capability and is adequately prompted to elicit “reasoning” behaviour.
Cairncross: … The bottom-line conclusion of our group was that this is a very useful tool. … [It] allows you to spend the hours you would have spent just finding a fraction of the results going through a much larger universe of relevant results (while ignoring those you can tell are irrelevant) to pull the most helpful results, including some you may not have found the old-fashioned way when you decided you had enough to quit your Westlaw efforts. I think there is no doubt that the $50 cost for a research memo more than pays for itself in the time saved in finding the same relevant cases.… The amount of human research hours that are saved by that $50 makes the calculus a no-brainer even on a pay per use basis. HAVING SAID THAT, WE STILL VERIFY EVERYTHING.
Speculations
Research Paper: Extrapolating these capabilities forward, LLMs being able to “understand” law would affect law-making and necessitate changes to legal services regulation…. including regulations about the unauthorised practice of law…
Cairncross: I don’t see a timetable for ending verification and frankly not sure that would ever happen…. That’s just my two cents given where the tech is now. That could change as this is all changing quickly.