Room Occupancy Detection on Microcontrollers with TinyML and XGBoost

Rej Šafranko
Jul 25
6 min read

TinyML enables running powerful machine learning models directly on microcontrollers, providing fast and energy-efficient inference at the edge. In this case study, we walk through selected steps in building and deploying a room occupancy detection model using IoT sensor data and an XGBoost classifier, highlighting a few key aspects of the pipeline.

Overview

In this showcase, we use a research dataset containing sensor readings that capture key environmental factors such as humidity, temperature, light intensity, CO₂ levels, and humidity ratio. The goal is to predict room occupancy, a binary variable indicating whether a room is occupied or unoccupied based on these measurements.

This post highlights selected components of the machine learning development process:

Cleaning real-world sensor data
Feature selection using ANOVA
Model hyperparameter tuning with Bayesian optimization
Exporting the model to C code for microcontroller deployment

Cleaning real-world sensor data

To ensure high data quality, it is essential to remove outliers from the dataset, as they can negatively impact model learning and overall performance. We used the Shapiro-Wilk test to determine which features follow an approximately Gaussian distribution and which do not, so we can apply appropriate outlier removal methods for each.

The visualization above shows Z-score filtering and IQR methods applied to sample data, with acceptable value ranges on the left plot highlighted in red.

For features that follow an approximately Gaussian distribution, we apply z-score filtering, removing data points more than three standard deviations from the mean.

For non-Gaussian features, we use the Interquartile Range (IQR) method, which is robust to skewed data and does not rely on distribution assumptions.

import numpy as np
import pandas as pd

from scipy.stats import shapiro, zscore

# Example dataframe `df` with numeric columns.

# 1. Identify normal vs non-normal features using Shapiro-Wilk test.
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()
normal_features = []
non_normal_features = []

for col in numeric_cols:
	stat, p = shapiro(df[col].dropna())
	if p > 0.05:  # p > 0.05 suggests data is approximately normal
		normal_features.append(col)
	else:
		non_normal_features.append(col)

# 2. Z-score filtering for normal features.
if normal_features:
	df_normal = df[normal_features]
	z_scores = np.abs(zscore(df_normal))
	mask_normal = (z_scores < 3).all(axis=1) # Keep rows where all normal features are within 3 standard deviations.
else:
	mask_normal = pd.Series([True] * len(df), index=df.index)

# 3. IQR filtering for non-normal features.
if non_normal_features:
	df_non_normal = df[non_normal_features]
	Q1 = df_non_normal.quantile(0.25)
	Q3 = df_non_normal.quantile(0.75)
	IQR = Q3 - Q1
	lower_bound = Q1 - 1.5 * IQR
	upper_bound = Q3 + 1.5 * IQR
	mask_non_normal = ~((df_non_normal < lower_bound) | (df_non_normal > upper_bound)).any(axis=1) # True for rows with no outliers in non-normal features
else:
	mask_non_normal = pd.Series([True] * len(df), index=df.index)

# 4. Combine masks to keep rows without outliers in either group.
final_mask = mask_normal & mask_non_normal
df_clean = df.loc[final_mask].reset_index(drop=True)

Note: Outlier removal was performed on the entire dataset before splitting to ensure clean, consistent data for both training and testing.

Feature selection using ANOVA

ANOVA (Analysis of Variance) is a statistical method used to determine whether there are significant differences between the means of two or more groups. In feature selection for classification, ANOVA helps assess if the distribution of a numerical feature differs significantly across the target variable’s categories.

Put simply, ANOVA tests whether the variation between occupancy classes (such as occupied versus unoccupied) is greater than the variation within each class for a given feature. A higher F-score indicates that the feature’s values change significantly with occupancy state, making it a strong candidate for prediction. In this case, the ANOVA F-test (implemented as f_classif in sklearn) helps us identify sensor readings that vary meaningfully between occupied and unoccupied rooms.

from sklearn.feature_selection import SelectKBest, f_classif
import pandas as pd

# Perform univariate ANOVA F-test on the training data.
selector = SelectKBest(score_func=f_classif, k="all")
selector.fit(X_train, y_train)

# Retrieve F-scores and corresponding p-values for each feature.
f_scores = selector.scores_
p_values = selector.pvalues_

# Create a summary DataFrame of ANOVA results.
anova_results = pd.DataFrame({
    "Feature": X_train.columns,
    "F-Score": f_scores,
    "p-Value": p_values,
}).sort_values(by="F-Score", ascending=False)

This helps us identify which sensor measurements, such as Light and CO₂, are most relevant for predicting room occupancy.

Feature	F-Score	p-Value
Light	89008.23	0.000
CO₂	9538.07	0.000
Temperature	6850.1	0.000
HumidityRatio	467.13	5.04e-102
Humidity	5.67	1.73e-02

Light has the highest F-score, indicating it is the most important feature for distinguishing occupancy states. This aligns with intuition, as room lighting often changes when the room is occupied.
CO₂ and Temperature also have very high F-scores, reflecting strong relationships with occupancy likely due to human presence and activity.
HumidityRatio shows moderate importance, with a statistically significant p-value.
Humidity has the lowest F-score and, while statistically significant, likely contributes less to model performance.

Selecting the most relevant features helps reduce dimensionality, improving both model interpretability and the model’s ability to generalize to new data.

Model Hyperparameter Tuning with Bayesian Optimization

To achieve optimal performance, we use Bayesian optimization via Optuna to fine-tune our XGBoost model's hyperparameters. Unlike traditional grid or random search, Bayesian optimization leverages past evaluations to intelligently explore the hyperparameter space. This results in faster convergence and requires fewer trials.

We tune the following hyperparameters:

n_estimators: Number of trees in the ensemble. More trees can improve accuracy but increase training time and risk overfitting.
max_depth: Maximum depth of each decision tree. Deeper trees model more complex patterns but may overfit.
learning_rate: Also called eta. Controls the step size at each boosting iteration to prevent overfitting. Lower values slow learning but can improve generalization.
subsample: Fraction of training samples randomly selected for each tree. Values less than 1 add randomness to reduce overfitting.
colsample_bytree: Fraction of features randomly selected for each tree, also adding randomness.
gamma: Minimum loss reduction required to make a split. Larger values make the algorithm more conservative.
reg_alpha: L1 regularization on leaf weights, encouraging sparsity.
reg_lambda: L2 regularization on leaf weights, controlling model complexity.

These hyperparameters collectively control model complexity and the balance between bias and variance. Bayesian optimization efficiently searches this space to find the best combination for our problem.

import optuna
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

def objective(trial):
    param = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "learning_rate": trial.suggest_float("learning_rate", 0.01, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.6, 1.0),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 1.0),
        "gamma": trial.suggest_float("gamma", 0, 0.5),
        "reg_alpha": trial.suggest_float("reg_alpha", 0, 0.1),
        "reg_lambda": trial.suggest_float("reg_lambda", 1, 3),
    }
    
    model = XGBClassifier(**param, eval_metric="logloss", random_state=42)
    score = cross_val_score(model, X_train_scaled, y_train, cv=3, scoring="accuracy").mean()
    return score

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50, n_jobs=-1)

# Train the final model using the best hyperparameters.
best_model = XGBClassifier(**study.best_params)
best_model.fit(X_train_scaled, y_train)

# Make predictions on the test set.
y_pred = best_model.predict(X_test_scaled)

Note: Feature scaling is an important preprocessing step before model training. While it is not shown here for brevity, the model was trained on scaled data (X_train_scaled, X_test_scaled).

We use cross-validation during tuning to reliably estimate generalization performance and avoid overfitting. After finding the best hyperparameters, we retrain the model on the full training dataset and evaluate it on a separate test set for an unbiased assessment.

The final XGBoost model achieved strong performance on the test set with an accuracy of 0.99. The confusion matrix shows:

Unoccupied rooms: 2,945 correctly identified, only 12 misclassified as occupied.
Occupied rooms: 739 correctly identified, 12 misclassified as unoccupied.

The confusion matrix of our final model, achieving a test accuracy of 0.99

This balance between false positives and false negatives suggests the model is unbiased toward either class. Importantly, the low false negative rate (just 12 missed occupancy events) indicates the model is reliable for detecting presence — a critical requirement in real-world IoT deployments where missing occupancy can cause energy waste or safety risks.

The model size is only 360.75 KB, making it highly suitable for deployment on memory-constrained microcontrollers in TinyML applications.

Exporting the model to C for microcontroller deployment

To deploy the trained XGBoost model on resource-constrained microcontrollers, we convert it to C code using the m2cgen library. This generates a standalone C header file that can be directly integrated into embedded firmware, enabling on-device inference without external dependencies.

import m2cgen

code = m2cgen.export_to_c(model)

with open("../models/XGBClassifier.h", "w") as file:
    file.write(code)

The generated XGBClassifier.h file includes the model logic as a standalone C function, facilitating easy integration into microcontroller firmware.

Here is an example of how to use the exported model in an Arduino sketch:

#include <Arduino.h>
#include "XGBClassifier.h"  // The exported XGBoost model header.

// Means and stds from training set (copy from your training scaler).
const double feature_means[] = {22.47, 308.1, 650.8, 0.0124};  // example values
const double feature_stds[]  = {2.15, 120.6, 105.3, 0.0018};   // example values

void scale_features(double* input, double* output, int len) {
	for (int i = 0; i < len; i++) {
		output[i] = (input[i] - feature_means[i]) / feature_stds[i];
	}
}

void setup() {
  Serial.begin(115200);
  Serial.println("Starting inference...");

  // Example raw sensor readings (replace with actual sensor inputs).
  double temperature = 23.5;
  double light = 300.0;
  double co2 = 600.0;
  double humidityRatio = 0.012;

  double raw_features[] = {temperature, light, co2, humidityRatio};
  double scaled_features[4];
  scale_features(raw_features, scaled_features, 4);
  
  // Run inference using the exported model function.
  double prediction = score(features);

  Serial.print("Occupancy prediction: ");
  Serial.println((int)prediction);
}

void loop() {
  // No repeated actions needed.
}

Conclusion

The machine learning development journey is complex, involving many techniques and steps beyond those covered here. In this post, we highlighted a select few key aspects, demonstrating how thoughtful methods can produce effective and efficient models ready for real-world deployment.