The Translation Bill for Bulk Chinese Research is Manageable

Lessons Learned from using LLMs for Bulk Chinese Data Extraction and Technical Translation

Jun 23, 2026

Local open-weight models vs GPT, with 95% confidence intervals

How to read it: left is faithful translation quality against human-curated gold; right is technical-detail extraction accuracy. Error bars are 95% confidence intervals. What to see: DeepSeek-V4-Flash joins GPT-5.5’s tier on both tracks, Qwen3.6-35B-A3B bf16 thinking reaches the GPT-5.5 extraction tier, and the practical recommendation still depends on serving reliability.

Results up front

I have spent the last few weeks paying for frontier model tokens to read Chinese financial and technical filings.

Not for literary translation.

For the data that actually matters to the questions I want to answer: semiconductor disclosures, subsidy language, data-center capacity, R&D spend, government grants, risk factors, customer concentration, and the data’s units that become load-bearing when you publish.

Those LLM tokens got more expensive than I wanted to achieve the fidelity I needed.

More importantly, it risked changing how I research answers. When every Chinese excerpt feels like a token bill, you start sampling carefully. That is the wrong instinct. For this work I want to read broadly, translate everything that might matter, extract the numbers, and reserve expensive review for the passages that carry a claim. I was searching for needles in the haystack so I needed to be able to automate going through it haystack by haystack, hay strand by hay strand, until I found the needles or be relatively confident those haystacks didn’t bear needles.

So I ran the experiment I should have run earlier.

The question was narrow:

Can a locally hosted open-weight model replace frontier LLM for technical Chinese filing work?

The answer, as of mid June 2026, is yes.

With one important qualification: this does not replace skilled Chinese translators where judgment matters. It replaces the broad, expensive Frontier AI pass. Human translation effort is scarce and valuable. I would not spend it grading a model bake-off unless the review itself had direct strategic value. I would spend it on the handful of passages where a real research claim depends on nuance.

That is the whole point.

The benchmark result is real. DeepSeek-V4-Flash tied GPT-5.5 on human-gold translation chrF++ (65.1 vs 65.0) and on technical-detail extraction (85.2% vs 85.7%), while significantly beating GPT-4.1 on both tracks. It also produced zero hallucinations in the judge run.

The workflow changes from “pay a Frontier AI to read everything” to:

Bulk technical translation: use Qwen3.6-35B-A3B bf16 direct. Do not turn on thinking mode by default.

Unit-heavy financial extraction: use Qwen3.6-35B-A3B bf16 thinking, Google Gemma 4 31B IT bf16, or NVIDIA Nemotron-3 Super 120B-A12B-NVFP4. DeepSeek-V4-Flash also passes the stress test, but I would not make it the default until I can serve the model without layers of kludges. Do not use Qwen direct for this job.

Structured JSON extraction: use Qwen3.6-35B-A3B bf16 thinking with strict unit instructions.

Publication-critical material: use GPT-5.5 audit (I had dedicated my Anthropic token budget to Fable while it was enabled), and human review if the Chinese excerpts are strategically important to geopolitical decision making.

Sensitive working notes and large source corpora: run local first. Do not upload unnecessarily to an API.

Benchmark curiosity: DeepSeek-V4-Flash is the strongest all-around local result in this run, including the unit stress test. I am not making it the default until vllm “just works”.

The salient point here is that I was becoming reticent to spend a few hundred dollars on frontier LLM tokens to achieve the necessary fidelity for each high-confidence assessment. I already have a sunk personal cost of high-end AI GPUs. By reframing my mental workflow from “is it worth spending a few hundred bucks on something that will probably fail” to “these tokens are almost free” I could take on more project risk which led to some very interesting findings.

Put through the entrepreneur’s lens, converting a recurring operating cost to a higher sunk cost let me optimize the human aspects of a workflow. Money well spent.

The result that changed the recommendation

The most important finding was not the headline leaderboard. It was the unit stress test.

Chinese filing work is often not hard because the sentence is hard. It is hard because 亿元 means hundred-million yuan, 万元 means ten-thousand yuan, 千元 means thousand yuan, and a missed table heading can move a number by a power of ten.

The model does not need haiku.

It needs unit discipline.

So I added a stress test: 77 objective targets from my work with Chinese annual reports and prospectuses. The past projects had already cross checked these transcribed numbers and unit conversion against other (audited) filings. That’s a validated test suite.

How to read it: 77 objective prospectus/annual-report targets; 95% confidence intervals; dashed line is GPT-5.5. What to see: Qwen3.6-35B-A3B bf16 direct collapses. Gemma 4 31B IT bf16 and Nemotron-3 Super 120B-A12B-NVFP4 lead at 96.1%. GPT-5.5, Qwen3.6-35B-A3B bf16 thinking, MiniMax-M2.7-NVFP4, and DeepSeek-V4-Flash all land at 94.8% with overlapping confidence intervals.

The result:

Google Gemma 4 31B IT bf16, 32.7B checkpoint parameters: 96.1%
NVIDIA Nemotron-3 Super 120B-A12B-NVFP4, MoE, 120B total / 12B active: 96.1%
OpenAI GPT-5.5: 94.8%
Qwen3.6-35B-A3B bf16 thinking, MoE, 35B total / 3B active: 94.8%
MiniMax-M2.7-NVFP4, MoE, about 116B checkpoint parameters: 94.8%
DeepSeek-V4-Flash mixed fp4/fp8, MoE, 284B-class: 94.8%
OpenAI gpt-oss-120B MXFP4, MoE, 120B, 128 experts / top-4: 93.5%
OpenAI GPT-4.1: 92.2%
Z.ai GLM-4.7-Flash bf16 thinking, MoE-lite, 64 experts / top-4: 89.6%
Z.ai GLM-4.7-Flash bf16 direct, MoE-lite, 64 experts / top-4: 83.1%
Mistral Small 4 119B 2603 NVFP4: 77.9%
Qwen3.6-35B-A3B bf16 direct, MoE, 35B total / 3B active: 64.9%

DeepSeek’s late stress-test run matters because it did not merely win the broad leaderboard and then fail the unit trap. On this 77-target prospectus stress set, it scored 94.8%, with a 95% CI of [89.6, 98.7], matching GPT-5.5, Qwen3.6-35B-A3B bf16 thinking, and MiniMax-M2.7-NVFP4. It also had 0.0% scale errors. That makes DeepSeek part of the same unit-discipline tier, even though I still would not make it the default workflow model until the serving stack just works.

That changed the operating doctrine.

Qwen3.6-35B-A3B bf16 direct (non-thinking) is my default translator.

It is not my extractor for unit-heavy filings.

In one example, Hygon disclosed R&D spending as:

研发投入 280,977.56 万元

The requested answer was in 亿元. That’s a unit which means hundred-million yuan. The right answer is 28.10.

Qwen3.6-35B-A3B bf16 direct returned 2,809.78.

Qwen3.6-35B-A3B bf16 thinking returned 28.10.

Same model. Same excerpt. One setting changed.

That is not a style problem. That is a hundred-fold error. And if you’ve read my 10 part series on China as an AI Great Power, you might remember that one of the hardest parts of analyzing Chinese data is keeping the numbers’ units correct. The smallest error is at least a 10x error.

So the rule is simple: translate direct; extract unit-heavy numbers with thinking mode or with a model that already has strong unit discipline.

What I tested

This was not a generic translation benchmark. I used the artifacts from my recent research projects: Hunting Huawei’s Hidden Fab, Mapping China’s AI Fab Industry, Mapping China’s AI Model Builders, Mapping Chinese Data Centers, and an exploration of Chinese state subsidies which I haven’t published yet. These are the kinds of Chinese excerpts I was already paying models to process while doing real work. And more importantly, I have been cross-checking numbers from alternative sources and modalities so I find myself with a really effective translation evaluation set for the problems I care about.

The shorthand matters, so here is what the local names mean in this article:

Qwen direct / Qwen thinking means Qwen3.6-35B-A3B bf16, MoE, 35B total / 3B active parameters, run in direct or thinking mode.

Gemma means Google Gemma 4 31B IT bf16, 31B class / 32.7B checkpoint parameters.

Nemotron means NVIDIA Nemotron-3 Super 120B-A12B-NVFP4, MoE, 120B total / 12B active parameters, 512 routed experts, top-22 routing, mixed NVFP4/FP8.

MiniMax means MiniMax-M2.7-NVFP4, MoE, about 116B checkpoint parameters, 256 experts, top-8 routing, NVFP4.

GPT-OSS means OpenAI gpt-oss-120B MXFP4, MoE, 120B class, 128 experts, top-4 routing.

GLM means Z.ai GLM-4.7-Flash bf16, MoE-lite, 64 routed experts, top-4 routing; substitute for the larger GLM-4.7-NVFP4 target.

DeepSeek means DeepSeek-V4-Flash base mixed fp4/fp8, a 284B-class MoE, served through the prebuilt vLLM-lucifer docker stack. It is included in both the main leaderboard and the 77-target prospectus stress test, where it matches GPT-5.5 at 94.8% with zero scale errors.

Mistral means Mistral Small 4 119B 2603 NVFP4, 119B class, NVFP4A16 group-16 quantization.

Two tracks:

Track A: faithful translation. Chinese quote in, faithful English out. No summary, no commentary, no invented connective tissue. Preserve the numbers, units, names, and technical language. The track has 180 translation pairs: 75 human-curated gold pairs and 105 GPT-4.1-referenced pairs.

Track B: technical-detail extraction. Chinese filing excerpt in, structured English values out: revenue, grants, percentages, PUE, megawatts, cabinet counts, subsidy categories, R&D spend. The track has 439 records and 1,534 scoreable targets.

Then came the 77-target unit stress test above.

That last test is why the article is about Chinese filing work, not just translation.

The leaderboard, corrected

Here are the core results.

Leaderboard: Chinese filing translation and extraction

How to read it: translation is chrF++ (a translation similarity metric) on human-gold references; extraction is target-level accuracy; judge is GPT-5.5 scored on a 1 to 5 scale; latency is mean seconds per item. What to see: DeepSeek reaches GPT-5.5’s tier on both objective tracks, Qwen3.6 thinking reaches the extraction tier with much smaller operational complexity, and GPT-5.5 remains the best judged publication translator.

The raw point estimate says Qwen3.6-35B-A3B bf16 thinking scores 86.5% on extraction and GPT-5.5 scores 85.7%. The first draft would have said Qwen3.6-35B-A3B “beats” GPT-5.5.

That is a little too strong. I care most about data extraction fidelity, but I do care about translation especially for aggregating sentiment analysis.

After bootstrapping the comparisons, Qwen3.6-35B-A3B bf16 thinking vs GPT-5.5 on extraction is a statistical tie: +0.8 percentage points, 95% confidence interval [-0.7, +2.6]. MiniMax-M2.7-NVFP4 vs GPT-5.5 is also a tie.

DeepSeek-V4-Flash is the new wrinkle. On human-gold chrF++, DeepSeek scores 65.1 and GPT-5.5 scores 65.0: a tie, +0.09, 95% CI [-1.3, +1.5]. On extraction, DeepSeek scores 85.2% and GPT-5.5 scores 85.7%: also a tie, GPT-5.5 +0.46 points, 95% CI [-1.1, +2.0]. DeepSeek significantly beats GPT-4.1 on both human-gold chrF++ (+2.9, 95% CI [+1.3, +4.5]) and extraction (+2.5 points, 95% CI [+0.4, +4.9]).

The durable claims are these:

DeepSeek-V4-Flash ties GPT-5.5 on both main objective tracks and significantly beats GPT-4.1 on both. That is the strongest all-around local benchmark result.
Qwen3.6-35B-A3B bf16 thinking significantly beats GPT-4.1 on data extraction: +3.8 points, 95% CI [+1.7, +6.2].
NVIDIA Nemotron-3 Super 120B-A12B-NVFP4 significantly beats GPT-4.1 on human-gold chrF++ translation: +2.4, 95% CI [+0.9, +4.0].
Thinking genuinely improves Qwen3.6-35B-A3B data extraction: +2.9 points, 95% CI [+1.8, +4.0].
GPT-5.5 remains the best judged publication translator: it leads the GPT-5.5 judge overall, and significantly beats Qwen3.6-35B-A3B bf16 thinking on that judge, +0.11, 95% CI [+0.04, +0.19]. DeepSeek’s objective translation score is excellent, but its fluency score is lower than GPT-5.5’s.

The honest headline is not “DeepSeek replaces GPT-5.5.”

It is: local open-weight models reach GPT-5.5’s tier on the main objective tracks, reach or clear GPT-4.1 for this workflow, and leave GPT-5.5 as the final prose-audit ceiling.

That is enough to change how I work. This surprises many people: I don’t codify Claude or Codex skills very often (a skill is a specific technique you encode into your AI agent). I typically manage hierarchies of agents like I manage organizations of people. I give them objectives, requirements, metrics, and expectations; I provide rigorous oversight; and I don’t care how they tactically or technically achieve my objectives. My workflow results in the agents building one-time-use tools bespoke to the problem we are addressing. But in this case, I had Claude and Codex both codify a translation skill for each agent. Pro tip: you can just tell your agent this URL and tell it to create a translation skill based on the blog post.

What the examples show

For ordinary technical disclosure, Qwen3.6-35B-A3B bf16 direct is good enough and far cheaper at scale if you have suitable GPUs already.

Chinese disclosure:

如果我们因出口管制而无法采购足够数量的先进半导体芯片，我们可能面临重大挑战，包括产能受限、成本上升及效能下降。

Reference:

If we are unable to procure a sufficient quantity of advanced semiconductor chips due to export controls, we may face significant challenges, including capacity constraints, rising costs, and degraded performance.

Qwen3.6-35B-A3B bf16 direct:

If we are unable to procure sufficient quantities of advanced semiconductor chips due to export controls, we may face significant challenges, including constrained production capacity, increased costs, and reduced performance.

The load-bearing terms are there: export controls, advanced semiconductor chips, constrained capacity, costs, performance. I would not spend frontier tokens on this.

For close prose, GPT-5.5 still has the edge. It tends to preserve disclosure register more precisely, while local models sometimes paraphrase into a phrase that is accurate but less filing-native. That matters when a sentence is going into an article as evidence.

This is why GPT-5.5 stays in the workflow.

Not everywhere. Just where the language needs to be precise. That’s precisely where I would have spent favor tokens on a human translator when I was working national security mission.

Thinking mode is not a religion

Thinking mode is a knob. Use it where it buys accuracy.

How to read it: direct vs thinking for Qwen3.6-35B-A3B bf16 and Z.ai GLM-4.7-Flash bf16. Left: extraction accuracy. Middle: translation chrF++ against human gold. Right: output tokens per item on a log scale. What to see: thinking helps extraction, barely moves translation, and costs roughly 25x the tokens.

The direct/thinking trade-off in numbers:

Qwen3.6-35B-A3B bf16 direct: extraction 83.6%; translation chrF++ 62.9; 96 output tokens per item; 2.5 s mean latency.

Qwen3.6-35B-A3B bf16 thinking: extraction 86.5%; translation chrF++ 63.6; 2,606 output tokens per item; 63.3 s mean latency.

GLM-4.7-Flash bf16 direct: extraction 61.5%; translation chrF++ 62.0; 85 output tokens per item; 2.7 s mean latency.

GLM-4.7-Flash bf16 thinking: extraction 84.4%; translation chrF++ 61.8; 2,167 output tokens per item; 134.0 s mean latency.

For translation, thinking is mostly wasted latency.

For extraction, thinking can be decisive. GLM-4.7-Flash bf16 direct is bad at extraction; GLM-4.7-Flash bf16 thinking is competitive. Qwen3.6-35B-A3B bf16 direct looks acceptable on broad extraction, then fails the unit stress test. Qwen3.6-35B-A3B bf16 thinking fixes the exact problem.

So the rule is simple: Translate direct. Extract with thinking when units matter.

How each model fits the workflow

The leaderboard tells you who won. The failure modes tell you what to do with the model.

DeepSeek-V4-Flash base mixed fp4/fp8, 284B-class MoE: strongest all-around benchmark result; ties GPT-5.5 on human-gold chrF++ and extraction; matches GPT-5.5 on the 77-target unit stress test at 94.8%; zero hallucination in the judge run. Do not make it the default yet unless you are deliberately doing serving-stack research or the stack stabelizes.

Qwen3.6-35B-A3B bf16 direct, MoE 35B / 3B active: good translation; unsafe on unit-heavy extraction. Use it as the default bulk translator.

Qwen3.6-35B-A3B bf16 thinking, MoE 35B / 3B active: strong extraction; 0.0% hallucination in the judge run; slow. Use it as the default structured extractor.

Google Gemma 4 31B IT bf16, 32.7B checkpoint parameters: strong all-around; 96.1% on the unit stress test. Use it as the best fallback and possible direct-mode extractor.

NVIDIA Nemotron-3 Super 120B-A12B-NVFP4, MoE 120B / 12B active: high chrF++; literal/verbose prose. Use it for analyst drafts; it is weaker as publication prose.

NVIDIA MiniMax-M2.7-NVFP4, MoE about 116B checkpoint parameters: strong extraction; slow and two-GPU. Useful if already serving, but not the default.

OpenAI gpt-oss-120B MXFP4, MoE 120B, 128 experts / top-4: solid but not the best at anything here. Viable backup.

Z.ai GLM-4.7-Flash bf16 direct, MoE-lite 64 experts / top-4: very weak extraction. Avoid for extraction.

Z.ai GLM-4.7-Flash bf16 thinking, MoE-lite 64 experts / top-4: extraction rescued, but very slow. Use only if already committed to GLM.

Mistral Small 4 119B 2603 NVFP4: too many wrong values and scale errors. Avoid for structured data.

GPT-5.5: best translator, strong extractor. Use for final audit.

How I would keep this from lying to me

This workflow is not “let a local model write the research.”

It is a controlled pipeline:

Translate broadly. Use Qwen3.6-35B-A3B bf16 direct for the first pass over Chinese filings.
Extract with schemas. Use Qwen3.6-35B-A3B bf16 thinking, Google Gemma 4 31B IT bf16, or NVIDIA Nemotron-3 Super 120B-A12B-NVFP4 for structured JSON extraction.
Validate units deterministically. Any value in 亿元, 万元, 千元, 元, or % gets checked before it enters a claim ledger.
Detect scale errors. Values off by powers of ten get flagged automatically.
Audit only what matters. GPT-5.5 checks the passages that carry the argument.
Use human translation where it is valuable. If a passage is strategically or geopolitically important and nuance matters, that is where scarce human review belongs.

The point is not to remove human judgment. The point is to spend human judgment where it matters.

The cost answer

The cost comparison uses a pricing snapshot: GPT-5.5 at $5 per 1M input tokens and $30 per 1M output tokens, local electricity at roughly $0.15/kWh on one RTX PRO 6000-class GPU (which is a sunk cost). Token prices change. Recalculate this based on your power costs and today’s token costs if you’re making decisions based on it.

The exact dollars are less important than the order of magnitude:

This benchmark, 619 excerpts: frontier API proxy $6.75; local electricity proxy $0.02; about 340x difference.

10,000 excerpts: frontier API proxy about $109; local electricity proxy about $0.31; about 350x difference.

100,000 excerpts: frontier API proxy about $1,090; local electricity proxy about $3.12; about 350x difference.

Do not read that as a cloud economics paper. Hardware is not free. Engineering time is not free. Local serving breaks in annoying ways.

DeepSeek is the warning label on that sentence. I am glad I got the score. I am not pretending a model is cheap to operate when it only runs on a pile of infrastructure exceptions (today).

Read it as the workflow answer.

At frontier-token prices, you ask “which excerpts deserve translation?”

With a local model, you ask “what did I miss?” and “did I remember to turn the air conditioner on in my office before it gets uncomfortably warm?”

“What did I miss?” is the better research question.

There is also a control benefit. Running locally lets me process large working corpora, intermediate notes, and sensitive leads without sending everything to an API. Cost is the obvious reason to do this. Control is the quieter reason for those projects I choose not to publish.

Reproduce the practical workflow

Use Claude or Codex or your coding agent of choice:

> Download the largest quant of Qwen 3.6 that will fit on my GPU. Download and configure the latest version of vLLM to serve up Qwen 3.6. Test it. Then write and install a user-level skill to run and use Qwen for Chinese translation and data extraction based on the blog post at https://www.mfrantzen.com/p/the-translation-bill-for-bulk-chinese

Notice the wording: I would ask the agent to stand up Qwen, not DeepSeek. DeepSeek is the most interesting benchmark result. Qwen is the model I would currently put into a repeatable translation workflow.

The two prompts for translation and data extraction deliberately boring.

Translation: expert Chinese-to-English translator for technical and financial disclosure; preserve every term, number, unit, and named entity; do not omit, summarize, or add; output only English.

Extraction: meticulous bilingual financial-filing extractor; extract only values explicitly present; use null when absent; fields ending _yi are in 亿元; convert 千元, 万元, 元, and percentages explicitly; return only JSON.

The boring prompt is part of the result. Fancy prompting was not the point. Unit discipline was.

Caveats that matter

First, the translation judge is GPT-5.5. That creates a possible family bias toward GPT-style translations. I mitigated that with model-independent metrics, human-curated references already present in the source projects, numeric extraction targets, and bootstrap confidence intervals.

Second, I did not ask a human translator to grade this benchmark. That was deliberate. Skilled Chinese translators are valuable. I would not spend a favor token on a translator validating a model bake-off unless the review itself had direct strategic or geopolitical value.

Third, many references are GPT outputs. Track A includes 105 GPT-4.1-referenced pairs and 75 human-curated pairs. That is why the human-gold subset is the primary translation metric. Self-agreement is not quality.

Fourth, extraction accuracy scores only reconstructable targets: 1,534 out of 2,405 field targets in the main Track B set. That is the right choice because the model should not be punished for failing to recover a value absent from the excerpt. But it is not full end-to-end recall.

Fifth, this is a deterministic engineering benchmark, not a translation shared task. The corpus is real. The results are useful. They are not universal.

Sixth, the results are about the hardware I have: 2x RTX PRO 6000 Max-Q Blackwell, vLLM, quantized deployments where needed. DeepSeek-V4-Flash did run, but only through the base mixed fp4/fp8 checkpoint in a mess of kludges. I count its benchmark result. I do not count it as my operational default (yet).

Bottom line

Local open-weight models are good enough now for the technical Chinese filing work I was paying frontier GPT for.

DeepSeek shows how high the ceiling has moved. Qwen shows what I’m actually using right now.

It’s good enough to change my workflow to look at some much larger problems than I thought could be addressed.

Mike Frantzen

Discussion about this post

Ready for more?