Quickly aligning corpora for sequence learning

I was thinking about “The Boy Who Wasn’t Unalived” recently. It’s a great project! They’re rewriting the Harry Potter books with Gen Z slang.

But there is no way they are going to complete it: there are only a few chapters completed between three different books. In fact, translation of the Goblet of Fire (or, “Harry Potter and the Chug Jug of Fiya”) consists of a single line:

“WHY IS YOUR GAMER TAG IN THE CHUG JUG, HARRY?” Dumbledore asked, not pressed at all.

Look, this is brilliant.

The only way this could see completion is by training a style-transfer model and automating it. This is a sequence learning task, hence the title. T5 or GPT2 could do the trick.

The hard part is structuring the texts so that equivalent lines are next to one another.

Here’s my process for curating the dataset and my results from training a model.

Preprocessing

We’re going to use the version of the original chapter of the first book here and the Gen Z translation here.

Although we’ll align most of these sentences by hand, we can make our lives easier by making the new-line characters more consistent. Ideally, we’ll have one sentence per line. Here’s my approach:

Replace all unicode characters with their closest ASCII equivalent (see: unidecode)
Replace all mid-sentence line breaks (text = re.sub(r"(?<!\n)\n(?!\n)", " ", text).replace(" ", " "))
Replace multiple newlines with a single newline (text = re.sub(r"\n+", "\n", text))
Segment sentences using SpaCy (nlp = spacy.load("en_core_web_sm"); sents = [sent.text for sent in nlp(txt).sents])

Export both texts as columns in a CSV

if len(sc_jkr) < len(sc_gen_z):
sc_jkr = sc_jkr + [""] * (len(sc_gen_z) - len(sc_jkr))
elif len(sc_gen_z) < len(sc_jkr):
sc_gen_z = sc_gen_z + [""] * (len(sc_jkr) - len(sc_gen_z))
pd.DataFrame().assign(sc_jkr=sc_jkr, sc_gen_z=sc_gen_z) \
.to_csv("parallel_sorcerers_stone.csv")

Manual curation

At this point, the low-hanging fruit is done. But, looking at the text, there isn’t a one-to-one correspondence between equivalent lines in the dataset. We need to do this bit manually. Luckily, spreadsheet software works pretty well for this!

The trick is to find bits of text that are clearly translations of one-another. Add empty rows above and below them so that those bits are aligned, and work backwards. The whole process, for the first chapter, takes about 30 minutes.

Conversion to a reasonable format

Put this back into Python and convert to JSONL.

sorcerers_stone_parallel_df = pd.read_csv(
    "parallel_sorcerers_stone.csv")
sorcerers_stone_parallel_df.to_json(
    "parallel_sorcerers_stone.jsonl", orient="records", lines=True)

Its a bad idea to keep this in CSV, because these texts have commas in them and that would confuse some parsers. I like JSONL because its good for streaming and the types are unambiguous.

Here are my results.

Training a model

Here are some notes on training the style-transfer model.

As a final quality check, I removed any sentence that did not change under translations. For instance, in both texts, “Dumbledore nodded glumly” was the same. If I were to add these to the training data, I would be concerned that the model would learn to leave the text untouched. So, I removed any sentence where the Levenshtein distance between them was greater than 5.

! pip install --quiet python-Levenshtein
from Levenshtein import distance as lev

l = df.apply(
    lambda row: lev(row.sc_gen_z.lower(), row.sc_jkr.lower()),
    axis=1,
)
df = df[5 < l]

For modeling, I started with T5, because I knew it was a transformer model used for summarization and machine translation. That means that the decoder understands English and decodes casually in English. (This wouldn’t be the case in an English-to-French model, where the decoder understands English and decodes in French.)

However, despite my quality check, the model only ever decoded by copying the input. For instance, I always got an output of “Dumbledore nodded glumly” for the input “Dumbledore nodded glumly”.

So, I moved on to the GPT2-large. That’s the biggest model that fits into GPU memory on Colab Pro without quantization. The trick is to add special tokens to separate inputs and outputs.

from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    DataCollatorForLanguageModeling,
)

MODEL = "gpt2-large"

tokenizer = AutoTokenizer.from_pretrained(MODEL)
new_model = AutoModelForCausalLM.from_pretrained(MODEL)
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False,  # 👈 causal, not de-noising
)

SEP, END = "<|sep|>", "<|endofsent|>"
def preprocess_function(example):
    text = example["sc_jkr"] + SEP \
         + example["sc_gen_z"] + END
    return tokenizer(text, truncation=True)

dataset = (
    Dataset.from_pandas(df)
    .map(preprocess_function)
    .shuffle(seed=42)
    .train_test_split(0.05)
)

tokenizer.decode(dataset["train"][0]["input_ids"])
# 'He eyed them angrily as he passed.<|sep|>He took a beeg look at them with unhappy as he passed.<|endofsent|>

That does it for structuring the dataset.

Now, to train it:

training_args = TrainingArguments(
    learning_rate=5e-5, # 5e-05 is the default
    run_name="genz-harry-potter",
    output_dir="./results",
    save_total_limit=3 ,
    num_train_epochs=10.0,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    per_device_train_batch_size = 1,
    per_device_eval_batch_size = 1,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

This time, the results are much more promising. Here are the best, most cherry-picked outputs.

JKR	GenZ
If the motorcycle was huge, it was nothing to the man sitting astride it.	However, when the motorcycle was tiny, the man sitting at the top of the selfie totem pole had everything you’d want in a bikinis man: high-tech specs, hyped up motorcycle, and lots of crotch selfie swag.
His blue eyes were light, bright, and sparkling behind half-moon spectacles and his nose was very long and crooked, as though it had been broken at least twice.	His bae’s eyes were penetrating, bright, and starry behind his specs and his bae’s nose was long and sassy, like a funky bee.
He looked simply too big to be allowed, and so wild - long tangles of bushy black hair and beard hid most of his face, he had hands the size of trash can lids, and his feet in their leather boots were like baby dolphins.	He looked like he was killed in the line of duty (a big dick) and had some major BDE (big dick energy).

Next steps

This is a good first step. At my current pace, I would be able to align everything in 2-3 hours. More data would improve the model, but its already pretty good! For example, knows which characters are which: it consistently translates “Professor McGonagall” into “Prof McG”. It also comes up with some funny and ridiculous translations.

The best move to translate the whole first book is create a tool curate outputs. I’d segment the entire book into sentences using Spacy and run the model against each sentence. Then, I would display those outputs as options in a Streamlit app.

Something like:

import streamlit as st
from transformers import pipeline
import spacy


@st.cache
def get_sorcerers_stone():
    nlp = spacy.load("en_core_web_sm")
    with open("sorcerers_stone.txt") as f:
        doc = nlp(f.read())
    return [sent.text for sent in doc]

if "sentences_translated" not in st.session_state:
    st.session_state["sentences_translated"] = []

next_line = get_sorcerers_stone()[len(st.session_state["sentences_translated"])]

pipe = pipeline("text-generation", "gen_z_translator", return_full_text=False)
translations = [
    translation["generated_text"].split("<|endofsent|>)[0]
    for translation
    in pipe([next_line + "<|sep|>"] * 10, batch_size=10)
    if "<|endofsent|>" in translation["generated_text"]
]

option = st.selectbox(
    "Which translation should we use?",
    translations
)

st.session_state["sentences_translated"].append(translations)

with open("output.txt", "w") as f:
    for translation in st.session_state["sentences_translated"]:
        f.write(translation + "\n")

If you had ten outputs per sentence, at least one of those outputs should work. You could even train the model further based on those choices, which would be easier than aligning the sentences by hand.

Its amazing the tools we have as machine learning engineers. I freaking love the future 🚀