Backtranslation for Data Augmentation in NLP ─ Ezz's Blog

Notes from the paper Backtranslation and paraphrasing in the LLM era? Comparing data augmentation methods for emotion classification.

Problem

Machine learning models need large amounts of training data to perform well. When data is scarce, models tend to overfit which leads to memorizing specific patterns rather than learning to generalize. The challenge is to augment the data while maintaining its fidelity.

Fidelity means how accurately something is represented. In the context of data augmentation, it refers to maintaining the integrity and meaning of the original data while introducing variations.

Solution

Backtranslation is a data augmentation technique used in machine learning to augment limited data while maintaining its fidelity. It works by translating existing data into another language and then translating it back to the original language. This process can introduce variations in phrasing while preserving the original meaning.

For example, consider this English sentence that goes through backtranslation via Arabic:

English (Original)

“The quick brown fox jumps over the lazy dog.”

↓ Translate to Arabic

Arabic (Intermediate)

“الثعلب البني السريع يقفز فوق الكلب الكسول.”

↓ Translate back to English

English (Backtranslated)

“The fast brown fox leaps over the lazy dog.”

The backtranslated sentence retains the original meaning but uses different wording (“quick” → “fast”, “jumps” → “leaps”), which increases the diversity of the dataset.

Code Example

Let’s implement backtranslation using TypeScript and the DeepAgents framework.

First, set up the imports and define the input text we want to augment:

import { agent, generate, user } from "@deepagents/agent";
import { groq } from "@ai-sdk/groq";
import z from "zod";
 
const input = `The customer service was excellent and the staff were very helpful.`;

Step 1: Translate to Arabic — Create an agent that translates English text to Arabic:

const toArabicAgent = agent({
  model: groq("gpt-oss-20b"),
  output: z.object({
    translation: z.string().describe("Arabic translation of the input"),
  }),
  prompt: `Translate the following English text to Arabic.`,
});
 
const { experimental_output: arabic } = await generate(toArabicAgent, [
  user(input),
]);
// Result: "كانت خدمة العملاء ممتازة وكان الموظفون متعاونين للغاية."

Step 2: Translate back to English — Create another agent that translates the Arabic text back to English:

const toEnglishAgent = agent({
  model: groq("gpt-oss-20b"),
  output: z.object({
    translation: z.string().describe("English translation of the input"),
  }),
  prompt: `Translate the following Arabic text to English.`,
});
 
const { experimental_output: backTranslated } = await generate(toEnglishAgent, [
  user(arabic.translation),
]);
// Result: "The customer service was outstanding and the employees were very cooperative."

Result: Notice how “excellent” became “outstanding”, “staff” became “employees”, and “helpful” became “cooperative” — natural variations that preserve meaning while creating diversity in the dataset.

You can verify fidelity using semantic similarity. Cosine similarity measures the angle between two embedding vectors — a value close to 1 means the texts have similar meaning:

import { cosineSimilarity, embedMany } from "ai";
import { openai } from "@ai-sdk/openai";
 
const { embeddings } = await embedMany({
  model: openai.embedding("text-embedding-3-small"),
  values: [input, backTranslated.translation],
});
 
const similarity = cosineSimilarity(embeddings[0], embeddings[1]);
// Result: 0.94 (high similarity = meaning preserved)

Key Takeaways

Enhances dataset diversity - Creates natural variations of your training data
Improves model generalization - Models learn to understand different phrasings of the same concept
Reduces overfitting - More diverse data prevents models from memorizing specific patterns

Backtranslation for Data Augmentation in NLP

Problem

Solution

Code Example

Key Takeaways

◇ Related Articles

Agent vs Harness: What's the Difference?

Build a Text-to-SQL Agent with LLMs

Stream OpenAI Responses in Real-Time with Node.js