You will find the 🇫🇷 French version of this article here
Reading or writing has never been much of a problem for me, but as soon as it comes to speaking, I realize my pronunciation is not always clear.
ChatGPT is great for text, but current generative models are not designed to evaluate pronunciation.
So I asked myself: what if I built my own pronunciation coach?
A tool that:
That’s the project I’ll detail here, while also breaking down the AI building blocks: embeddings, distances, DTW, phonemes, visemes.
A spoken word is not a sequence of letters, but a sound wave that varies over time:
Even two people pronouncing exactly the same word will never have identical curves.
The central question becomes: how do you compare two audios that do not align perfectly?
Here’s an overview of the architecture I set up:
Don’t worry(we’ll break down all these concepts step by step.
A computer does not understand sound. It only manipulates vectors of numbers.
[0.2, -0.7, 1.1]
.With Wav2Vec2, every few milliseconds of audio are encoded into 768 numbers describing timbre, energy, articulation, etc.
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h")
def extract_embeddings(audio_waveform, sampling_rate=16000):
inputs = processor(audio_waveform, sampling_rate=sampling_rate,
return_tensors="pt", padding=True)
input_values = inputs.input_values.squeeze(0)
with torch.no_grad():
features = model(input_values).last_hidden_state
return features.squeeze(0).numpy()
In short:
Once two vectors are extracted, you need to measure their proximity.
The basic tool: Euclidean distance.
Simple example between [1,2]
and [4,6]
:
√((4-1)² + (6-2)²) = 5
With audio embeddings, it’s the same principle, but in a 768-dimensional space.
from fastdtw import fastdtw
from scipy.spatial.distance import euclidean
def compare_pronunciation(expected, actual):
expected_seq = get_phoneme_embeddings(expected)
actual_seq = get_phoneme_embeddings(actual)
distance, _ = fastdtw(expected_seq, actual_seq, dist=euclidean)
return distance
Problem: a word can last 0.5 seconds for me and 0.8 seconds in the reference.
If you compare naively, it fails.
The solution: Dynamic Time Warping (DTW).
This algorithm aligns two sequences of different speeds by “stretching” or “compressing” time to match similar parts.
import numpy as np
def align_sequences_dtw(seq1, seq2):
distance, path = fastdtw(seq1, seq2, dist=euclidean)
aligned1, aligned2 = [], []
for i, j in path:
aligned1.append(seq1[i][0])
aligned2.append(seq2[j][0])
return np.array(aligned1), np.array(aligned2)
Result: a robust comparison, even when pacing differs.
Some sounds are hard to distinguish by ear.
Example: “think” vs “sink” (subtle difference between /θ/ and /s/).
In my prototype, when a word is mispronounced, I can click on it and see a mouth animation showing the correct articulation.
👉 Learning becomes more concrete: I both hear and see what to correct.
(Microsoft documentation on visemes)
Let’s be honest:
I haven’t reinvented Duolingo. But I have built a home-made coach that helps me improve my speaking.
The strength comes from the combination of:
A mix of machine learning, math, and pedagogy, all serving a very concrete goal: speaking English better.
👉 Source code is available on GitHub.
Feedback or contributions are welcome.
© Jean-François Lépine, 2010 - 2025