May 15, 2026

Turning a terrible phone pic into a 3D asset for 55¢

Wherein we stitch together Nano Banana 2, Hunyuan 3D 3.1 Pro, compress the mesh – and find good value for money, most of the time.

By Fred Jonsson · Tumbric

Cutting-edge technology has always been about knowing what is possible today and what might come next. Scavenging for building blocks is a core engineering activity, but there are a lot of things you won't find in papers or launch posts: does it work on my inputs? How big are the outputs? How long will it take? How cheap can I get it? For that, you have to try things.

I'm far from a 3D expert, but it's been hard not to notice the progress of 3D generation from text and images in the AI/ML space. Some exciting developments:

Microsoft TRELLIS.2 (December 16, 2025 paper; code) is an open-source 4B image-to-3D model for high-resolution PBR-textured assets. The official repo says local inference needs Linux and an NVIDIA GPU with at least 24GB of memory, verified on A100 and H100.
Hunyuan 3D 3.1 Pro is a proprietary model from Tencent that accepts single-image and multiview inputs, including front, back, left, right, top, bottom, left-front, and right-front references.
Hyper3D Rodin Gen-2 is another proprietary model from Deemos technologies, cheaper than Hunyuan 3D 3.1 Pro but still very good. Its API supports text-to-3D and image-to-3D from one or more images, with export formats including GLB, the binary glTF format commonly used for web 3D assets.

Although the classical applications for 3D were largely around game props, the ease of creating 3D assets is opening possibilities in more product surfaces: commerce product pages with 3D models and AR, virtual try-on, and game and UGC asset generation. If parallax managed to take over the web, it wouldn't be surprising if generated 3D became more commonplace once anyone can make assets cheaply.

For this use case, the recent proliferation of multiview image-to-3D models is a big enabler. A multiview model can take multiple images as reference, which gives more control over generation than asking a model to infer every unseen side from one image. This is beneficial because we can synthesize high-quality inputs to the model, and allow the 3D model to focus on 3D modelling rather than inventing details of the objects from world knowledge.

So let's see how we can use the latest models to turn terrible photos into 3D artifacts.

Photo → synthesized views → 3D generator

To turn pictures into 3D artifacts, we first generate a cheap multiview set: front, back, left, and right. Those panels can be inspected before a mesh provider runs, and the same panels can feed more than one provider.

To generate these images, we use Nano Banana 2 (Gemini 3.1 Flash Image) to generate a single image that contains all perspectives. It took a bit of prompting to get this working well, particularly to get it to consistently arrange the items into a square grid. The model occasionally struggles with physics reasoning when showing objects from multiple angles, but the model generally does a really good job for about 15 cents per 4K composite image before prompt and thinking tokens. (On my tiny and informal test set for this task, I had visually comparable results to Nano Banana Pro, which was more expensive and quite a lot slower in my tests.)

Cost footnote: as of May 16, 2026, the 55¢ headline is $0.40 for one Rodin Gen-2 mesh call plus Gemini 3.1 Flash Image Preview's documented 4K output price, 2520 image tokens × $60 / 1,000,000 = $0.1512, rounded to the nearest 5¢.

After generating the multi-view images, we crop them and feed them into the 3D generator. Since Gemini is very promptable and has tremendous world knowledge, this synthesis step gives us a lot of control to make sure that we have good inputs for the 3D generator, and allows us to also apply stylistic control if we want to (eg. if we wanted the objects in Minecraft style).

Rodin Gen-2 vs. Hunyuan 3D 3.1 Pro

Rodin was the cheaper default and worked for many samples. Hunyuan 3D 3.1 Pro generally rendered with higher fidelity but was ~60% more expensive.

Takeaways

This was a small-scale and informal experiment, but there were still some useful takeaways:

Synthetic multiview images worked well. Generating front, back, left, and right views before mesh gave the pipeline control over unseen sides and created an inspectable artifact before the expensive step. It also creates a consistency from an input image to the final artifact, meaning that we can potentially use model routing to switch between different 3D models but still have a consistent and controlled generation style.
More views was not better. The Hunyuan 3D model has the ability to take in more than four views. I experimented with this but couldn't get it working much better than four images, and synthesizing more than 4 images at a time made Gemini's outputs less reliable.
The GLBs needed a bit of compression to be web-suitable. Raw outputs were roughly in the 10 MB range for Rodin and the 50 MB range for Hunyuan, but it was easy to bring that down to 0.5-3MB through compression, without much noticable degradation. Some models allow generating smaller artifacts through reducing the number of generated faces, but I didn't find much success bringing down the sizes significantly through this.
Agent-driven evals and vision helps a lot. In the early stages of building a pipeline, nothing beats looking at the data. I often prefer agent-produced HTML reports to custom eval tooling at this stage: they make it easy to test combinations of prompts and 3D models without building a full automaetd eval stack or working out how to handle GLB files in Jupyter notebooks. Many combinations did not work for this use case, and reducing that search space before adding rigid evaluation saved time. I particularly like working with the Codex app for visual data, because you can see the model test its scripts by inspecting generated outputs. It can't really judge them like a human, but it can tell if they're obviously wrong.

The future of 3D

We can expect generation quality to go up and cost to come down. Here's what I'm looking out for:

Small open-weights models. In my tests, there was a decent quality gap between 3D generation models that cost ¢7/image (via GPU inference on L40S) and SOTA models that cost upwards of ¢65/image. When that gap closes, serious experimentation is on the table for more players. At the moment, these models seem most likely to come out of Meta or Microsoft. (I would also include Tencent, but its Hunyuan3D-2mv model is a good open-weights model whose license appears to ban showing model outputs in the UK, EU, and South Korea.)
Animated 3D. The most interesting uses of 3D probably involve some form of animation over static objects - imagine rendering a web site with a retro computer where the keys click and mouse moves as you type. And now imagine that if you could create this lightweight animation only from a picture of an old Apple II and a prompt about how it ought to be animated. As much as world models might enable similar experiences without 3D rendering, I expect them to be costly and limited to commercial hardware for a long time.
Parametric generation. The idea of generating thousands of polygons is not necessarily efficient for all use cases where 3D might be desirable, especially when you want to represent kinetics. Similar to SVG for vector images, many objects could benefit from a parameterized representation, which can faithfully represent both topology and interactions and can be much more efficient. This could be true for an illustrated keyboard or Simon Willison's cartoon pelican riding a bicycle benchmark.

loading GLB

About Tumbric

Tumbric is a product lab and consultancy that helps with this kind of work: prototypes to see what's possible, experiments to decide where to invest, and production-ready AI pipelines to land the impact. tumbric.com fred@tumbric.com