That said, Harden’s essay sparks a good conversation with an excellent question:
What’s still unclear is whether or not we’d allow, as a profession, someone to use AI to get legal help. If we’re not at that point yet, then at what point would we be ok with that? And to figure out whether or not we’re at that point, I think we need some type of Turing-style test that we can point to. This is kind of the point with the bar exam and lawyers - like it or loathe it, it’s a metric that we can point to and say “this person should be allowed to practice law because they got a score of ___.” Otherwise we’re just moving goalposts around.
But the answer stops short of charting a course, so let’s try to do better.
The core problem is that comparing AI to “random human help” sets the bar so low that nearly any semi-coherent chatbot can trip over it, and it provides the same recourse for reckless legal advice as you have against a random human: none.
Lawyers are held to a higher standard. We’re licensed precisely so that we can lose our license if we screw up, which is often the end of a lawyer's career. Benchmarks of “normal quality” don't mean anything without consequences when reliability fails.
Further, Harden never decides who he’s addressing. When discussing confidentiality, he hints at regulatory concerns. His anecdotes about LiveHelp volunteers suggest an audience of legal-aid designers. Yet the rhetorical nods to clever branding (“alliteration is nice”) feel aimed at legal tech founders. Each group poses different questions. Regulators require explicit thresholds: error-rate ceilings, jurisdictional accuracy bands, disclosure protocols, etc. Legal-aid designers looking out for their operations and clients want to know integration costs, content governance, and volunteer displacement. Legal tech product managers need acceptance criteria they can translate into feature backlogs and QA benchmarks. The Bartleby Baseline satisfies no audience.
Another version of this idea might replace the Bartleby question with tiered claims for service, with corresponding predictable triggers if the service fails, scalable governance, and consequences for breach:
Community Parity – General purpose AI without legal guardrails (i.e. Chatbots that don’t refuse legal questions) must equal median non-lawyer help (the floor). Providers are required to make self-declarations of benchmark results that need to be filed with a public registry, with mandatory transparency reports. Consequence for “malpractice” could be public delisting (i.e., mandatory legal guardrails), corrective notices to users, and modest administrative fines.
Professional Safety – AI must be comparable to licensed human practitioners on common discrete issues. Providers require a conformity assessment by an accredited body (perhaps following the EU AI Act model) and must hold professional-indemnity insurance sized by usage volume. A retained interaction log (with privacy safeguards) becomes the fact record for (possibly automated) post-incident analysis of “malpractice”, addressing the evidentiary gap that shields bad legal systems today. Consequences for the provider would mean suspension of the conformity assessment and compensation from the insurer.
Critical Reliance & Continuous Disclosure – AI must meet or exceed regulated professional standard in high-risk contexts, and providers must publish errors and remediation timelines continuously. Responsibilities could be analogous to bank‐capital rules with statutory licensing of the provider, periodic audits, and a reserve fund sized by usage volume. Consequences could mirror medical-device law, meaning license revocation, personal liability for officers, and potential criminal penalties for willful deception.
Framing the answer to the original question in tiers supplies regulators with threshold logic, product teams with build targets, and legal-aid providers (and their users) with transparent guarantees.
Benchmarks alone are toothless if malpractice merely means “try harder next release”. By attaching each tier to explicit governance instruments and pre-defining penalties, we can turn abstract reliability scores into enforceable obligations and service expectations. Then AI providers that wish to market at Tier 2 or Tier 3 will need to accept the same career-ending consequences that lawyers face.