EngineeringApril 2026

Dominic Phillips

Software Developer

MedGemma and the Local Models Moment

I spent Saturday morning running a medical multimodal model on my laptop. Pointed it at a chest X-ray, waited about thirty seconds, and got back a plausible preliminary read. I am not making a clinical claim from one run. The part I kept coming back to was the boring one: no network call. No vendor GPU log. The image never left the machine.

The Saturday run

I pulled MedGemma 4B down from Hugging Face. Roughly 3GB in quantized form. Loaded it into Ollama, handed it an anonymized chest radiograph, and asked for a structured impression. No API key. No terms of service describing how a patient image would be retained, logged, or used for training. Just a model running against an image on the same machine.

I kept thinking about that. The model is small enough, fast enough, and good enough that the whole inference path fits inside a network boundary a clinic already controls. A lot of portfolio company AI work is this repetitive, privacy-heavy stuff. Until recently I would have routed much of it through a hosted API without much debate. That assumption feels weaker now.

What Google shipped

Google released MedGemma as part of the Gemma family, with 4B and 27B variants tuned for medical work, plus MedSigLIP as the image encoder path for classification and retrieval.

MedSigLIP. A lightweight medical image-text encoder for classification, zero-shot labeling, retrieval, and other image-based tasks that do not need a generated free-text answer.

MedGemma 4B (multimodal). Reads both text and medical images. Designed for chest X-ray interpretation, longitudinal radiograph tracking, anatomical feature localization, and classification across radiology, pathology, dermatology, and ophthalmology. Small enough to run on a laptop with integrated graphics.

MedGemma 27B (text). Heavier reasoning for clinical question answering, report generation, and structured extraction from lab results. Runs comfortably on a workstation GPU or a modestly equipped on-prem server.

MedGemma 27B (multimodal). The larger text-and-image variant is the one to test when a single model needs to handle medical images, medical text, medical records, and FHIR -structured EHR workflows.

The release page uses a phrase I usually distrust, but here it is literal.

“MedGemma can be adapted by developers for clinical workflows and used as a privacy-preserving tool within agentic systems.”
Google DeepMind, MedGemma release page

I usually read “privacy-preserving” as a phrase someone had to put in a deck. Here it describes the deployment: run the model inside your infrastructure, give your agents access to it, and leave PHI where it already lives. That’s where we have been pushing portfolio companies anyway. The difference is that the model vendor is now shipping for that posture instead of treating it like an edge case.

Google also lists seven concrete use cases the models are optimized for: high-dimensional imaging from CT, MRI, and whole slide histopathology; longitudinal chest X-ray tracking; anatomical localization; structured extraction from medical lab reports; image classification across radiology, pathology, dermatology, and ophthalmology; report generation; and clinical question answering. That list is mostly the annoying work inside a healthcare operation: repetitive, privacy-sensitive, and awkward to send through a generic hosted API.

The selector is the work

One model doesn’t have to win everything. The useful thing in Google’s lineup is that the routing is legible. Labels and retrieval go to the image encoder. Prose, structured reasoning, and reports go to a MedGemma generator. Cases that mix images, text, and FHIR-shaped records go on the 27B multimodal shortlist.

Model Selection

Choose by workload, not by model size

ModalityI want to...Suggested model

Image

Classify or retrieve medical images

Radiology, skin, pathology, or ophthalmology image collections.

MedSigLIPImage encoderHardware: Jetson Orin Nano Super

ImageText

Fine-tune generative imaging workflows

Question answering, reporting, or other image-to-text applications.

MedGemma 4BCompact multimodalHardware: RTX workstation; Jetson edge deploy

ImageText

Build for on-device or low-compute settings

Use the smaller generative model when footprint is the constraint.

MedGemma 4BCompact multimodalHardware: Jetson Orin Nano Super

Text

Get the strongest text-only baseline

Clinical QA, summarization, extraction, and reasoning without images.

MedGemma 27B TextText specialistHardware: 24GB+ workstation GPU

ImageText

Use one model for text and imaging

A single route for mixed medical text, records, and image tasks.

MedGemma 27B MultimodalUnified modelHardware: 48GB+ GPU server

ImageText

Handle complex multimodal reasoning

Higher-capacity reasoning when the case crosses modalities.

MedGemma 27B MultimodalUnified modelHardware: 48GB+ GPU server

FHIR

Interpret FHIR-based health records

Text-based EHR data with structure the model must preserve.

MedGemma 27B MultimodalFHIR-awareHardware: 48GB+ GPU server

Google's model guidance splits image-only encoder tasks from generative text and multimodal workflows. Start with the workload, then choose the model.

Google Health AI Developer Foundations

I would build around this part. Not one medical model behind every button. A small router. Image search to MedSigLIP. Local generative image work to MedGemma 4B. Text-only reasoning to 27B text. FHIR-heavy multimodal work to the larger multimodal model. Much less grand than a giant clinical copilot. Also probably more useful.

Google is also clear about the limit. From the release page: “The outputs generated by these models are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications. All model outputs should be considered preliminary and require independent verification, clinical correlation, and further investigation.” Good. That’s how every medical AI tool should be treated right now.

EHR Navigator agent using MedGemma with a FHIR store — Three Google-built reference demos showing what MedGemma can drive: an EHR Navigator agent over a FHIR store, a radiology explainer that annotates X-ray findings in plain language, and a learning companion for trainees.
Google DeepMind

MedGemma generating a structured radiology impression from a CT slice — Three Google-built reference demos showing what MedGemma can drive: an EHR Navigator agent over a FHIR store, a radiology explainer that annotates X-ray findings in plain language, and a learning companion for trainees.
Google DeepMind

The boundary

For the past three years, most healthcare AI deployments I have seen have the same shape. The EMR or workflow tool calls OpenAI, Anthropic, or Google. A BAA sits around the call. PHI leaves your network, runs through someone else’s GPUs, and comes back. Sometimes that is fine. Sometimes the security review turns into a long thread no one wants to be in.

With local inference, that middle hop is gone. No PHI leaves. No BAA with the model vendor, because the vendor is not in the path. The audit trail can live beside the EMR audit trail. That’s a smaller surface to explain.

A few practical things follow.

Latency collapses. A round trip to a hosted API runs 300 to 900ms on a good day for a small prompt. Local inference on an M-series laptop returns the first token in under 100ms. In a scribe or intake workflow, that is the difference between feeling instant and feeling like another form.

Per-token cost stops being the meter. High-volume workflows like coding QA, chart review, intake triage, and document classification stop accumulating a bill that scales linearly with patient volume. The capex is a GPU you already have or a single workstation. After that, usage is mostly electricity.

Reliability stops depending on someone else’s uptime. Clinics lose internet more often than people outside healthcare realize. A local model keeps working when the fiber cut down the street takes out the hosted endpoint your scribe relies on.

On-Device Inference

The Stack Inside the Perimeter

Data Never Leaves Premises

Edge

Clinic Device

M-series laptop or on-prem NPU

Model

MedGemma 4B

Quantized weights, ~3GB VRAM

Tools

FHIR-Native Tools

Structured calls into EMR schema

System

Clinic EMR

Existing system of record

No external API callsNo PHI egressAudit trail stays on-prem

With a local model, the model call stays inside the clinic network. The hosted API disappears from the path.

A week earlier

MedGemma got me to write this down because I had just seen the same pattern somewhere else.

On April 16, Simon Willison reported that Qwen3.6 35B A3B, a 20.9GB quantized model running locally on his M5 MacBook Pro through LM Studio, beat Claude Opus 4.7 on his long-running pelican-on-a-bicycle SVG test. Even with thinking level cranked to max, Opus could not render a correct bicycle frame. The flamingo-on-a-unicycle variant went the same way.

Yes, the test is silly. I still pay attention to it, because Simon has been running it long enough that the results tend to track real model utility. He is careful with his conclusions: he is not claiming the 21GB file is genuinely more useful than Anthropic’s newest release across the board. He is reporting that on one specific, non-trivial task, on his laptop, it was.

Two data points are not a law. But I have a hard time ignoring the pair: a 4B medical model writing a defensible X-ray impression on a laptop, and a 35B general model beating a hosted frontier API on a weird rendering test on consumer silicon. My old default was simple: serious healthcare AI lives behind a hosted API. That default is getting harder to keep.

And to be clear: I still love you, OpenAI and Anthropic. My Claude subscription is not going anywhere. Frontier hosted models still do plenty of things no 4B model can. I’m just less willing to send every task there by default.

Capability vs Footprint

Local Just Crossed the Line

Runs on Your Laptop

Cloud Only

Utility Threshold

Gemma 3 4B

3 GB

MedGemma 4B

3 GB

MedGemma 27B

~18 GB

Qwen3.6 35B A3B

21 GB quant

Claude Opus 4.7

Hosted

Four of the five models above now run on hardware a clinic already owns.

Four of the five models above now run on hardware a clinic may already own. The laptop and workstation column is no longer just demos.

Notes from the toolchain

If you have not run a local model in the last six months, the developer experience has changed more than the public scoring discourse has. A few concrete things we have learned putting this into portfolio company workflows.

Quantization has mostly stopped being scary. 4-bit via GGUF or MLX cuts memory by roughly 4x with accuracy loss that, for the kinds of structured tasks we run (extraction, summarization, classification), we cannot reliably tell from the full-precision version. A 27B model that would not fit on a workstation in its native form sits comfortably in 18GB of VRAM at Q4. MedGemma 4B quantized is about 3GB, which is why it runs on my old laptop and not just the current-gen one.

The runtime layer is the part that surprised me. llama.cpp handles most CPU and GPU targets cleanly. Apple Silicon gets first-class treatment through MLX, which is what makes the M-series laptop workflow feel fast instead of academic. Ryzen AI PCs now have AMD Lemonade doing the same job on the Windows / NPU side. Ollama and LM Studio wrap all of this with OpenAI-compatible HTTP endpoints, so your existing code that calls chat.completions.create drops in with one changed line.

Running a model locally in 2026 is weirdly normal. Not research-project normal. Normal-software normal. ollama pull medgemma:4b and you are talking to the thing. The integration cost I was budgeting for a year ago turned out to be almost nothing.

Tool calling works. Both MedGemma and Qwen support structured output and function calling via standard JSON schema. The same pattern we wrote about in Building the Bridges Your AI Needs. The MCP servers we originally built for hosted models dropped into the local setup with almost no changes.

The weak spots are real. Long-context reasoning beyond 32K tokens still favors hosted frontier models. Complex agentic workflows that stitch five or six tool calls into one session are still better on a Claude or GPT-class model. Anything that depends on a very recent training cutoff is frontier-only by definition. So I’m not thinking replacement. I’m thinking routing. Let the local model handle the boring private bulk. Spend hosted tokens when the request actually needs them.

What I would change now

We spend most of our time inside the AI infrastructure of portfolio companies: scribes, coding engines, prior auth agents, scheduling, patient messaging. Here is how I would change the build plan now.

1. Default to hybrid

Anything new we design now assumes two inference targets: a local model for the structured, repetitive, privacy-heavy bulk, and a hosted frontier model for the hard reasoning. The interesting part is the router between them. A year ago this felt early. Now it changes cost, compliance, and reliability.

2. Redraw the PHI map

If you operate a multi-site healthcare business, your CIO or CISO has a diagram somewhere showing which workflows send PHI to which vendor under which BAA. Local inference lets you cross a lot of lines off that diagram. In diligence, fewer PHI egress paths make the security review shorter.

3. Reprice the AI line item

High-volume document workflows have been quietly accruing API spend that scales with patient volume. Move the same workflow onto a local model and variable cost drops to electricity. For a multi-site group doing seven-figure annual encounters, the number is not a rounding error. Model it before next year’s budget closes, not after.

4. Start building on-prem inference ops

The useful skills shift a little. Prompt engineering still helps. Running a reliable inference stack helps more. GPU capacity planning, quantization selection, observability, batching, update discipline. The teams we pushed earliest into this are now faster at shipping AI features than teams still routing everything through hosted APIs.

5. Actually test MedGemma

If you have clinical imaging, medical-text extraction, or clinical QA that needs to stay inside your four walls, test it. It won’t be the best fit for every task. It doesn’t need to. Medical tuning, FHIR alignment, and on-prem deployment make it different enough from a general hosted model that it deserves a real validation cycle.

Where I landed

Three years ago I assumed serious healthcare AI would live behind hosted APIs for a long time. I don’t anymore. A 4B medical model runs on a laptop. A 35B general model can beat a hosted frontier API on a concrete task on consumer hardware. That doesn’t make local the whole stack. It makes it part of the stack.

The practical upside is boring and good: less PHI moving around, lower variable cost, fewer hard dependencies on someone else’s uptime. That’s enough for me.

If you want to pressure-test which parts of your healthcare AI workload could move local, get in touch. We do this work inside portfolio companies.

Cade Newsletter

Research that moves before the market does.

Original analysis on healthcare strategy, AI adoption, and market dynamics. Delivered when we publish.

No spam. Unsubscribe anytime.