Research Question:

Can a machine learning model, trained on extracted audio features, accurately distinguish between truthful and deceptive speech in audio recordings, in various natural languages?

Objective of the exercise:

Develop a machine learning model to detect deception from short speech clips using acoustic features such as:

  • Mel-Frequency Cepstral Coefficients (MFCCs)

  • Pitch

  • Spectral characteristics

Dataset:

  • 100 labelled audio recordings in multiple languages (Hindi, English, Bengali)

  • Each sample is labeled as either truthful or deceptive


Methodology:

  1. Feature Extraction: MFCCs, pitch, spectral centroid, bandwidth

  2. Preprocessing:

    • Standardisation of features

    • Stratified 80/20 train-test split

  3. Model: Support Vector Machine (SVM) with linear kernel

  4. Evaluation Metrics:

    • Accuracy

    • Precision, Recall

    • F1-Score

    • Confusion Matrix


Impact of Using SVM

The Support Vector Machine (SVM) classifier, especially with a linear kernel, proved well-suited for this task for several reasons:

  • Effective on small datasets: SVM is robust even with limited data (like the 100 samples used here), especially when the number of features is high after extraction.

  • High-dimensional space handling: Acoustic features like MFCCs and spectral statistics can be complex - SVM handles this high-dimensional space efficiently.

  • Reduced overfitting: Compared to more complex models SVM performed well without overfitting, especially after dimensionality reduction via PCA.

Observed Improvements with SVM:

  • Performance improved over baseline models (e.g., logistic regression or naive classifiers).

  • Provided more balanced precision and recall, making it better at both detecting deception and minimizing false accusations.

  • When combined with PCA, SVM training became faster and more stable, helping generalize better on the test set.

Key Insights

Limitations

  • Small Dataset: Only 100 samples were available, limiting the model’s ability to generalize effectively.

  • Artificial Deception: The dataset features acted deceptive stories, which may not reflect natural, spontaneous lying behavior.

  • Language Imbalance: Uneven representation of languages may bias the model, as certain audio features may correlate with specific linguistic traits.

  • Limited Feature Scope: Only spectral and prosodic features were used; temporal or linguistic features could provide deeper insights.

  • Subtle Differences: Deceptive speech may not always differ clearly in acoustic features, making consistent detection difficult.

  • Speaker Variability: Natural differences in speech patterns across individuals introduce noise and reduce model reliability.

  • Lack of Context: Without non-verbal or situational context, audio-only analysis may miss key cues relevant to deception.

Future Improvements

  • Increase Dataset Size: Collect a larger dataset to improve the generalisability of the model. A bigger dataset would allow the model to capture more diverse patterns in speech.

  • Collect Natural Deception Data: Instead of relying on prompted deceptive stories, collect data in more realistic contexts where deception occurs naturally.

  • Prosodic Features: Include features like pitch range, speaking rate, or energy dynamics that might capture intentional modulations in deceptive speech.

  • Voice Quality Features: Analyse jitter, shimmer, or harmonics-to-noise ratio (HNR) to detect subtle changes in voice quality.

  • Higher-Level Linguistic Features: Extract semantic or syntactic features using tools like ASR (Automatic Speech Recognition) to analyze the content of the stories.

  • Experiment with Other Models: Test more advanced classifiers like Gradient Boosting Machines (e.g., XGBoost or LightGBM) or Neural Networks.

  • Ensemble Learning: Combine multiple models (e.g., SVM and Random Forest) to improve classification performance.