The dust has hardly formed, much less settled, when it comes to AI-powered text-to-image generation. Yet the result is already clear: a tidal wave of crummy images. There is some quality in the mix, to be sure, but not nearly enough to justify the damage done to the signal-to-noise ratio – for every artist who benefits from a Midjourney-generated album cover, there are fifty people duped by a Midjourney-generated deepfake. And in a world where declining signal-to-noise ratios are the root cause of so many ills (think scientific research, journalism, government accountability), this is not good.
It’s now necessary to view all images with suspicion. (This has admittedly long been the case, but the increasing incidence of deepfakes warrants a proportional increase in vigilance, which, apart from being simply unpleasant, is cognitively taxing.) Constant suspicion - or failing that, frequent misdirection - seems a high price to pay for a digital bauble that no one asked for, and offers as yet little in the way of upside. Hopefully - or perhaps more aptly, prayerfully - the cost-to-benefit ratio will soon enter saner territory.
But in the meantime, we should be aware of a new phenomenon in the generative AI world: AI-powered text-to-CAD generation. The premise is similar to that of text-to-image programs, just instead of an image, the programs return a 3D CAD model.
A few definitions are in order here. First, Computer Aided Design (CAD) refers to software tools wherein users create digital models of physical objects - things like cups, cars, and bridges. (Models in the context of CAD have nothing to do with deep learning models; a Toyota Camry ≠ a recurrent neural network.) Also, CAD is important; try to think of the last time you were not within sight of a CAD-designed object.
Definitions behind us, let’s turn now to the big players who want in to the text-to-CAD world: Autodesk (CLIP-Forge), Google (DreamFusion), OpenAI (Point-E), and NVIDIA (Magic3D). Example of each are shown below:
Major players have not deterred startups from popping up at the rate of nearly one a month, as of early 2023, among whom CSM and Sloyd are perhaps the most promising.
In addition, there are a number of fantastic tools that might be termed 2.5-D, as their output is somewhere between 2- and 3-D. The idea with these is that the user uploads an image, and AI then makes a good guess as to how the image would look in 3D.
Open source animation and modeling platform Blender is, unsurprisingly, a leader in this space. And the CAD modeling software Rhino now has plugins such as SurfaceRelief and Ambrosinus Toolkit which do a great job of generating 3D depth maps from plain images.
All of this, it should first be said, is exciting and cool and novel. As a CAD designer myself, I eagerly anticipate the potential benefits. And engineers, 3D printing hobbyists, and video game designers, among many others, likewise stand to benefit.
However, there are many downsides to text-to-CAD, many of them severe. A brief listing might include:
- Opening the door to mass creation of weapons, and racist or otherwise objectionable material
- Unleashing a tidal wave of crummy models, which then go on to pollute model repos
- Violating the rights of content creators, whose work is copyrighted
- Digital colonialism: amplifying very-online western design at the expense of non-western design traditions
In any event, text-to-CAD is coming whether we want it or not. But, thankfully, there are a number of steps technologists can take to improve their program’s output and reduce their negative impacts. We’ve identified three key areas where such programs can level up: dataset curation, a pattern language for usability, and filtering.
To our knowledge, these areas remain largely unexplored in the text-to-CAD context. The idea of a pattern language for usability will receive special attention, given its potential to dramatically improve output. Notably, this potential isn’t limited to CAD; it can improve outcomes in most generative AI domains, such as text and image.
Dataset Curation
Passive Curation
While not all approaches to text-to-CAD rely on a training set of 3D models (Google’s DreamFusion is one exception), curating a model dataset is still the most common approach. The key here, it scarcely bears mentioning, is to curate an awesome set of models for training.
And the key to doing that is twofold. First, technologists ought to avoid the obvious model sources: Thingiverse, Cults3D, MyMiniFactory. While high quality models are present there (mine among them ;) the vast majority are junk. (The Reddit thread ‘Why is Thingiverse so shit?’ is one of many that speak to this problem.) Second, super high-quality model repos should be sought out. (Scan the World is perhaps the world's best.)
Next, model sources can be weighted according to quality. Master of Fine Arts (MFA) students would likely jump at the chance to do this kind of labeling - and, due to the inequities of the labor market, for peanuts.
Active Curation
Curation can and should take a more active role. Many museums, private collections, and design firms would gladly have their industrial design collections 3D scanned. Plus, in addition to producing a rich corpus, scanning would create a robust record of our all-too-fragile culture.
Data Enrichment
In the process of creating a high quality corpus, technologists must think hard about what they want the data to do. At first glance, the main use case might seem to be ‘empowering managers at hardware companies to move a few sliders that output blueprints for a desired product, which can then be manufactured’. If the failure-rich history of mass customization is any guide, however, this approach is likely to flounder.
A more effective use case, in our view, would be ‘empowering domain experts - people like industrial designers at product design firms - to prompt engineer until they get a suitable output, which they then fine-tune to completion’.
Such a use case would require a number of things which are perhaps non-obvious at first glance. For example, domain experts need to be able to upload images of reference products, as in Midjourney, which they then tag according to their target attributes - style, material, kinetics, etc. It might be tempting to adopt a faceting approach here, where experts select dropdowns for style type, material type, etc. But experience suggests that enriching datasets so as to create attribute buckets is a bad idea. This manual approach was favored by the music streaming service Pandora, which was ultimately steamrolled by Spotify, which relies on neural nets.
Takeaways
Rigorous dataset curation is an area where (with a few exceptions) little has been done and, hence, much is to be gained. This should be a prime target for companies and entrepreneurs seeking a competitive advantage in the text-to-CAD wars. A large, enriched dataset is hard to make and hard to imitate - the best kind of mote.
On a less corporatist note, thoughtful dataset curation is the ideal way to drive the creation of products that are beautiful. Reflecting the priorities of their creators, generative AI tools to date have been, to put it lightly, taste-agnostic. But we ought to take a stand for the importance of beauty. We ought to care about whether what we bring into this world will enchant users and stand the test of time. We ought to push back against the mediocre products being heaped onto mediocre bandwagons.
If beauty as an end in itself is insufficient to some, perhaps they will be persuaded by two data points: sustainability and profit.
The most iconic products of the past hundred years - the Eames chairs, Leica cameras, Vespa scooters - are treasured by their users. Vibrant fandoms restore them, sell them, and continue to use them. Perhaps the intricacy of their design required 20% more emissions than rival products of their day. No matter. That their lifespans are measured in quarter centuries and not in years means that they led to less consumption and less emissions.
As for profit, it’s no secret that beautiful products command a price premium. iPhone specs have never been comparable to Samsungs’. Yet Apple can charge 25% more than Samsung. The adorable Fiat 500 subcompact gets worse gas mileage than an F-150. No matter. Fiat wagered, correctly, that yuppies would gladly pay an extra $5K for cuteness.
A Pattern Language for Usability
Overview
Pattern languages were pioneered in the 1970s by polymath Christopher Alexander. They are defined as a mutually-reinforcing set of patterns, each of which describes a design problem and its solution. While Alexander’s first pattern language was targeted at architecture, they have been profitably applied to many domains (most famously in programming) and stand to be at least as useful in the domain of generative design.
In the context of text-to-CAD, a pattern language would consist of a set of patterns; for example, one for moving parts, one for hinges (a subset of moving parts, hence one layer of abstraction down), and one for friction hinges (another layer of abstraction down). The format for a friction hinge pattern might look like this:
In common with natural language, pattern languages comprise a vocabulary (the set of design solutions), syntax (where a solution fits into the language), and grammar (rules for which patterns may solve a problem). Note that the above pattern ‘Friction Hinge’ is one node in a hierarchical network, which can be visualized by a directed network graph.
Embodied in these patterns would be best practices with respect to design fundamentals - human factors, functionality, aesthetics, etc. The output of such patterns would thereby be more usable, more understandable (avoiding the black box problem), and easier to fine-tune.
Crucially, unless text-to-CAD programs account for design fundamentals, their output will amount to little less than junk. Better nothing at all than a text-to-CAD-generated laptop whose screen doesn’t stay upright.
Perhaps the most important of all these fundamentals - and the most difficult to account for - is design for human factors. To get a useful product, the number of human factors considerations verges on the infinite. The AI must recognize and design around pinch points, finger entrapment, ill-placed sharp edges, ergonomic proportions, etc.
Implementation
Let’s look at a practical example. Suppose Jane is an industrial designer at Design Studio ABC, which has a commission to design a futuristic gaming laptop. The state of the art now would be for Jane to turn to a CAD program like Fusion 360, enter Fusion’s generative design workspace, and spend the rest of the week (or month) working with her team to specify all relevant constraints: loads, conditions, objectives, material properties, etc.
But however powerful Fusion’s generative design workspace is (and we know from experience that it’s powerful) it can never get around one key fact: a user must have lots of domain expertise, CAD ability, and time.
A more pleasant user experience would be to simply prompt a text-to-CAD program until its output meets ones’ requirements. Such a pattern design-centric workflow might look like the following:
Jane prompts her text-to-CAD program: “Show me some examples of a futuristic gaming laptop. Use for inspiration the form factor of the TOMO laptop stand and the surface texture of a king cobra”.
The program outputs six concept images, each informed by patterns such as “Keyboard Layout”, “Hinged Mechanisms”, and “Port Layout for Consumer Electronics”
She replies “Give me some variations of image 2. Make the screen more restrained and the keyboard more textured.”
Jane: “I like the third one. What parameters do we have on that one?”
The system, drawing on the ‘Solution’ fields of the patterns it finds most relevant, lists 20 parameters - length, width, monitor height, key density, etc.
Jane notes that the hinge type is not specified, so types “add a hinge type parameter to that list and output the CAD model”.
She opens the model in Fusion 360 and is pleased to see that an appropriate friction hinge has been added. As the hinge has come parameterized, she increases the width parameter, knowing that Studio ABC’s client will want the screen to hold up to a lot of abuse.
Jane continues making adjustments until she’s fully satisfied with the form and function. This done, she can pass it off to her colleague Joe, a mechanical engineer, who will inspect it to see which custom components might be replaced by stock versions.
In the end, management at Studio ABC is happy because the laptop design process went from an average of six months to just one. They are doubly pleased because, thanks to parameterization, any revisions requested by their client can be quickly satisfied without a redesign.
Thorough Filtering
As AI ethicist Irene Solaiman recently pointed out in a poignant interview, generative AI is sorely in need of thorough guardrails. Even with the benefit of a pattern language approach, there’s nothing inherent in generative AI to prevent generation of undesirable output. This is where guardrails come in.
We need to be capable of detecting and denying prompts that request weapons, gore, child sexual abuse material (CSAM), and other objectionable content. Technologists wary of lawsuits might add to this list products under copyright. But if experience is any guide, objectionable prompts are likely to make up a significant portion of queries.
Alas, once text-to-CAD models get open-sourced or leaked, many of these queries will be satisfied without compunction. (And if the saga of Defense Distributed has taught us anything, it’s that the genie will never go back into the bottle; thanks to a recent ruling in Texas, it’s now legal for an American to download an AR-15, 3D print it, and then - should he feel threatened - shoot someone with it.)
In addition, we need widely-shared performance benchmarks, analogous to those that have cropped up around LLMs. After all, if you can’t measure it, you can’t improve it.
____
In conclusion, the emergence of AI-powered text-to-CAD generation presents both risks and opportunities, the ratio of which is still very much undecided. The proliferation of low-quality CAD models and toxic content are just a few things that require immediate attention.
There are several neglected areas where technologists might profitably train their attention. Dataset curation is crucial: we need to track down high-quality models from high-quality sources, and explore alternatives such as scanning of industrial design collections. A pattern language for usability could provide a powerful framework for incorporating design best practices. Further, a pattern language will provide a robust framework for generating CAD model parameters that can be fine-tuned until a model meets the requirements of its use case. Finally, thorough filtering techniques must be developed to prevent the generation of dangerous content.
We hope the ideas presented here will help technologists avoid the pitfalls that have plagued generative AI to date, and also enhance the ability of text-to-CAD to deliver delightful models that benefit the many people who will soon be turning to them.
Authors
Reggie Raye is a teaching artist with a background in industrial design and fabrication. He is the founder of design studio TOMO.
K. Alexandria Bond, PhD is a neuroscientist focusing on the rules driving learning dynamics. She studied cognitive computational neuroscience at Carnegie Mellon. She currently develops machine learning methods for precision diagnosis of psychiatric conditions at Yale.
Citation
For attribution in academic contexts or books, please cite this work as
Reggie Raye and K. Alexandria Bond, "Text-to-CAD: Risks and Opportunities", The Gradient, 2023.
Bibtex citation:
@article{raye2023texttocad,
author = {Raye, Reggie and Bond, K. Alexandria},
title = {Text-to-CAD: Risks and Opportunities},
journal = {The Gradient},
year = {2023},
howpublished = {\url{https://thegradient.pub/text-to-cad},
}