Smart devices and stress level detection: balancing machine learning model complexity and device intrusiveness

02.03.2026 1

Mental health disorders have become leading contributors to disability worldwide. Furthermore, stress represents a relevant risk factor linked to depression, chronic disease, and reduced well-being. Public health preventive and monitoring strategies underlie the need to further consider stress as a targeted outcome to be reduced. Besides this, prior work has examined the potential of machine learning (ML) and deep learning (DL) approaches for stress detection by assessing model complexity, prediction accuracy, and device intrusiveness trade-offs. However, there is a scarcity of studies which evaluate which type of approach outperforms stress-level prediction.

This gap was the main question of interest to Bello-Orgaz et al. (2026): to evaluate whether more complex DL models further improve stress-level prediction over classic ML models. This study compared classical ML models and transformer-based architectures under varying levels of device intrusiveness and validation schemes, using physiological and behavioral data collected from wearable devices and smartphones.

The authors of the study used publicly available multimodal datasets (e.g., WESAD and StudentLife) containing physiological and smartphone-derived behavioral data. Stress classification was evaluated under different levels of device intrusiveness: a) low-intrusion aggregated features (e.g., mobile phones); b) medium-intrusion wearable-derived features (e.g., smartwatches; wristbands). Additionally, two strategies of validation were used: a) stratified split (within-sample validation); b) Leave-One-Subject-Out (LOSO; cross-subject generalization).

Multiple classical ML models (e.g., KNN, Random Forest, Gradient Boosting, XGBoost) and DL approaches (e.g., MLP, transformers, TabPFN) were both compared to assess the quality of stress level predictions, in terms of model complexity, data complexity, and data collection intrusiveness. All experiments were repeated 10 times to ensure robustness.

Performance metrics between ML and DL models for low and medium-intrusion wearables

Regarding low-intrusion wearables, under stratified validation, classical ML models slightly outperformed DL models (KNN – ML model – F1 = 77.2%; MLP – DL model – F1 = 73.7%). However, performance dropped dramatically under LOSO validation for both models (maximum F1 = 34.4%), suggesting poor cross-subject generalization. These findings indicate that low-intrusion features may capture subject-specific patterns rather than generalizable stress indicators.

In the case of medium-instrusion wearables, performance improved substantially. Furthermore, in the stratified split, TabPFN – DL model – achieved an F1-score above 98.8%, outperforming classical ML models, suggesting an excellent model classification method for stress-levels. Moreover, cross-subject generalization improved markedly in LOSO validation, especially for the Gaussian Boosting Classifier (GBC) model – ML model – (F1 = 82.5%), demonstrating that medium-intrusion physiological features provide more robust population-level stress markers.

Fine-tuning stress-level detection models in real-world contexts

The main findings of this study highlight a key trade-off between model complexity, intrusiveness, and generalizability between ML and DL models. While DL models can outperform classical ML under certain conditions, increased architectural complexity does not always yield meaningful gains, particularly with low-intrusion data. In this regard, classic ML models provide strong performance with simpler setups, transformed-based architectures and advanced DL models excel in known-user conditions.

The marked performance drop in LOSO experiments under low intrusion emphasizes the challenge of inter-subject variability in stress detection. Conversely, medium-intrusion wearable features substantially mitigate this issue, suggesting that physiological measures offer more stable cross-individual indicators.

Selecting specific validation strategies, therefore, becomes an important matter for stress-level classification in real-world scenarios. Moreover, within-subject validation may overestimate real-world performance, whereas LOSO offers a more stringent test of generalizability.In conclusion, the authors affirm that stress-level detection using wearable and mobile data is a feasible technique with high accuracy, particularly when moderate intrusiveness is acceptable. The efficiency of ML models in computation and the potential of DL approaches in specific contexts may provide practical guidance for designing scalable, trustworthy stress detection systems. Furthermore, this model performance should be obtained by balancing accuracy, usability, privacy, and ethical responsibility.

Read the full article

What are your thoughts on ML and DL models in stress-level detection? Is this an opportunity to further advance on early detection of mental health problems? Read the full study in the link and let us know your comments!

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top