Tags: javascript tensorflowjs ml jekyll dark-mode
This post explains the idea conceptually and technically. The implementation is inspired by Charlie Gerard's dark-mode clap extension, which uses TensorFlow.js and a Teachable Machine audio model to detect claps.
I already had dark mode on this site using DarkReader. The manual flow was simple:
user clicks dark mode button
-> JavaScript calls DarkReader.enable() or DarkReader.disable()
-> dark-mode state is saved in localStorage
The new goal was:
user clicks clap listener button
-> browser asks for microphone permission
-> browser listens to microphone audio
-> TensorFlow.js model predicts whether the sound is a clap
-> if prediction is "Clap", call the same dark-mode toggle function
There are two important design choices:
That keeps the feature small. The ML part only answers one question: "Was that sound a clap?"
One tempting idea is to send microphone audio to a serverless function and run a model there. That is unnecessary for this feature.
A clap toggle needs low latency. If every short audio window had to travel to a server and back, the interaction would feel slow. It would also create an avoidable privacy problem: users do not expect a personal blog to stream microphone audio to a backend.
The better architecture is:
microphone audio
-> browser audio APIs
-> TensorFlow.js model in browser
-> local prediction
-> local UI action
The browser downloads the model files, but the actual inference happens locally.
TensorFlow.js can do many things, but in this feature it is only doing inference.
Training and inference are different phases:
training:
many labeled audio examples
-> model learns patterns
-> model files are saved
inference:
new microphone audio
-> load saved model
-> output probabilities for known labels
We are not training a model every time the page loads. That would be slow and unnecessary. We are loading a model that was trained earlier.
Conceptually, the model receives a transformed representation of sound and returns scores:
{
"Background Noise": 0.03,
"Clap": 0.91,
"Unknown": 0.06
}
Then the app logic is ordinary JavaScript:
if (prediction.label === "Clap" && prediction.score >= 0.75) {
window.toggleDarkMode();
}
The machine learning model does not know what "dark mode" is. It only classifies audio. The web app decides what to do with the classification.
https://teachablemachine.withgoogle.com/models/GWAYbcqlE/
The inspiration project uses this model URL:
const modelUrl = "https://teachablemachine.withgoogle.com/models/GWAYbcqlE/";
This is not an API endpoint where we upload audio and get a result back. It is a folder containing model assets exported by Google's Teachable Machine.
A TensorFlow.js audio model normally has files like:
model.json
metadata.json
weights.bin
The JavaScript code builds URLs from the base path:
const checkpointUrl = modelUrl + "model.json";
const metadataUrl = modelUrl + "metadata.json";
Then TensorFlow.js downloads those files into the browser and runs the model locally.
So the reason to use https://teachablemachine.withgoogle.com/models/GWAYbcqlE/ was practical: it already contains a pretrained clap detector. That lets us implement and understand the full application flow before training our own model.
This is a good learning sequence:
If the model disappears or its labels change, the feature can break. For a serious version, I would train my own model and host the exported model files inside this site.
A browser does not pass "sound" to the model as a human concept. It passes numbers.
The rough pipeline is:
microphone waveform
-> short audio frames
-> frequency features
-> neural network
-> class probabilities
Raw microphone audio is a waveform: amplitude over time. A clap has a sharp transient: a sudden burst of energy. But robust detection is harder than checking volume, because a dropped object or cough can also create a spike.
Audio ML models usually transform audio into frequency-domain features. In TensorFlow.js speech commands, the recognizer can use BROWSER_FFT.
FFT means Fast Fourier Transform. It converts a short window of audio from "how loud was the signal over time?" into "which frequencies were present, and how strong were they?"
That gives the model a better representation than raw volume:
time-domain waveform
-> FFT / spectrogram-like features
-> classifier
The classifier has learned patterns from examples. During training, it saw examples labeled something like:
Clap
Background Noise
Unknown
During inference, it outputs the probability that the current audio window belongs to each label.
The page loads three important scripts:
<script defer src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.3.2/dist/tf.min.js"></script>
<script defer src="https://cdn.jsdelivr.net/npm/@tensorflow-models/speech-commands@0.5.4/dist/speech-commands.min.js"></script>
<script defer src="/js/dark_clap.js"></script>
The first script is TensorFlow.js. The second script is the speech commands helper library. The third script is my site-specific code.
The model is created like this:
const speechRecognizer = speechCommands.create(
"BROWSER_FFT",
undefined,
checkpointUrl,
metadataUrl
);
await speechRecognizer.ensureModelLoaded();
The important pieces:
BROWSER_FFT tells the recognizer to use browser FFT audio features.checkpointUrl points to model.json.metadataUrl points to metadata.json.ensureModelLoaded() downloads and initializes the model before listening.Once the model is loaded, the recognizer starts listening:
recognizer.listen(function(result) {
const prediction = getPrediction(labels, result.scores);
if (prediction.label === "Clap" && prediction.score >= 0.75) {
window.toggleDarkMode();
}
}, {
includeSpectrogram: false,
probabilityThreshold: 0.75,
invokeCallbackOnNoiseAndUnknown: true,
overlapFactor: 0.5
});
result.scores is an array of probabilities. The labels come from the model metadata:
const labels = recognizer.wordLabels();
The code finds the highest scoring label:
function getPrediction(labels, scores) {
let bestIndex = 0;
for (let i = 1; i < scores.length; i++) {
if (scores[i] > scores[bestIndex]) {
bestIndex = i;
}
}
return {
label: labels[bestIndex],
score: scores[bestIndex]
};
}
Then it checks whether that label is Clap.
Audio classifiers run continuously while listening. A single clap can be detected across more than one overlapping audio window.
Without protection, one clap might toggle dark mode on and then immediately off.
So the listener uses a cooldown:
const clapCooldownMs = 1200;
let lastClapAt = 0;
const now = Date.now();
if (now - lastClapAt > clapCooldownMs) {
lastClapAt = now;
window.toggleDarkMode();
}
This is not an ML concept. It is interaction design. The model gives predictions; the product code still has to decide how to handle them.
The existing dark mode wrapper had internal enable(), disable(), and darkmode() functions. To let the clap listener reuse the same behavior, I exposed one public function:
window.toggleDarkMode = darkmode;
That gives this structure:
manual button click
-> window.toggleDarkMode()
clap prediction
-> window.toggleDarkMode()
Both paths update the same localStorage key and the same DarkReader state. This matters. If clap mode had its own dark-mode logic, the two controls could drift out of sync.
The first version was technically interesting but not great UX. A clap icon beside a moon icon is ambiguous. Also, users should not be surprised by a microphone permission prompt.
The improved version uses two separate controls:
moon/sun button
-> toggles the theme
clap listener button
-> starts or stops microphone listening
On first use, the site explains what is about to happen:
window.confirm(
"This uses your microphone locally to detect claps and toggle dark mode. Audio is not sent to this site. Start listening?"
);
This is not just politeness. It clarifies the privacy boundary:
The controls are also real buttons, not links:
<button type="button"
class="DarkReader_Button theme-control"
aria-label="Toggle dark mode"
aria-pressed="false">
<span id="icon-dark" aria-hidden="true">☾</span>
</button>
<button type="button"
class="ClapDark_Button theme-control"
aria-label="Start clap listener for dark mode"
aria-pressed="false">
<span id="icon-clap-dark" aria-hidden="true">👏</span>
</button>
That gives better keyboard behavior and better assistive technology semantics.
Using the pretrained model is fine for learning, but it is not ideal long term.
The next step is to train my own model:
Clap.Background Noise.model.json, metadata.json, and weights inside this repository.modelUrl from the public Teachable Machine URL to a local path.For example:
const modelUrl = "/models/clap-dark-mode/";
At that point the site would not depend on someone else's model staying online.
The main lesson is that TensorFlow.js lets a static website run pretrained ML models directly in the browser.
The full architecture is:
trained Teachable Machine model
-> exported as TensorFlow.js files
-> browser downloads model files
-> microphone audio becomes FFT features
-> model predicts audio label
-> app code checks label and confidence
-> app toggles dark mode
The ML model is only one component. A usable feature also needs state management, cooldowns, permission handling, accessibility, and clear UI language.
That separation is the key concept:
model responsibility:
classify sound
application responsibility:
decide what classification means for the user interface
Once that boundary is clear, replacing the model is straightforward. The rest of the site does not care whether the classifier came from Teachable Machine, a custom TensorFlow.js model, or some future audio model. It only needs a label and a confidence score.