Science

Voice Biomarker

Voice production is a complex neuromuscular coordination process. Air moves out of the lungs towards the vocal folds via a coordination of the diaphragm, abdominal/chest muscles, and rib cage etc. Vocal fold vibrates and modulates airflow through the glottis, producing voiced sound. Voiced sound then travels through the vocal tract, where it is selectively amplified or attenuated at different frequencies. Prior clinical researches have shown that mental health disorders like depression affects the voice production process, for example the voice from a depressed person was summarized as slow, monotonous and disfluent with high jittering and shimmering. Such characteristics (or features/representations) in the voice, so called voice biomarker, can be used to assess or diagnose a condition/disease. Voice Health Tech owns the most cutting edge AI technology for depression assessment using voice biomarkers.

Multi-Dimension Voice Data from Real Patients (DSM-5)

To ensure high model accuracy, we follow the gold standard when collecting training data. Our multi-center reasearch was designed and led by Peking University Sixth Hosipital, one of the best leading mental health institutes in China. Patients were diagnosed and recruited by psychiatrists from six different mental health hospitals across the country, following DSM-5 standards. Patients were given an H5 miniprogram for the voice sample collection, and the collection process was carefully designed, covering long vowels,number counting, rainbow passages, speech under cognitive load, open questions etc. Our mental health dataset Oizys now contains more than 43000 audio sessions, collected from depression patients, anxiety patients, non-depression non-anxiety people etc, and it's by far the world leading from DSM-5 diagnosed patients.

Cutting-Edge Voice AI Technology

Leveraing the advanced deep learning and transfer learning AI technology, Voice Health Tech owns the advanced AI model for depression assessment using voice biomarkers. Our AI model can give accurate assement results based on 30 second voice recordings (16KHz, 16-bit).

We first use self-supervised learning to learn latent voice feature representations from unlabled voice data. These latent feature representations are then used as the input for another neural network which was trained on Oizys dataset. Compared to AI models from tranditional feature engineering (MFCCs etc.), our model achieves much higher performance (AUC 0.902) and provides better robustness in real world scenarios.

Our Publication

07 November

Clinical study: A deep learning-based model for detecting depression in senior population

Read More