Speech Processing Model in Embedded Media Processing

Speech Processing Model in Embedded Media Processing


Speech processing plays an essential role in integrated multimedia processing. Although voice information consumes less memory and processing power than audio and video data, it is large enough to be used.

Speech processing and audio processing process audible data, although the frequency range supported by speech processing is between 20 Hz and 4 kHz, while the frequency range supported by audio processing is between 20 Hz and 20 kHz. In addition, there is a significant difference between speech and audio processing: the speech compression mechanism remains based on the human vocal system. In contrast, the audio compression mechanism remains based on the human auditory system.

Speech processing is a subset of digital signal processing. Specific properties of the human vocal apparatus are used with certain mathematical techniques to compress voice signals to broadcast data over VoIP and Cellular networks.

Speech Processing Remains Usually Divided Into

Speech Processing Remains Usually Divided Into

  • Speech encoding: Speech compression reduces data size by eliminating redundancies in data for storage and streaming.
  • Speech recognition: The ability of the algorithm to identify spoken words and convert them into text.
  • Speaker verification/identification: For security applications in banking sectors, verify speaker identity.
  • Speech Improvement: To suppress noise and increase the gain to make recorded speech more audible.
  • Speech synthesis: Artificial generation of human speech for text-to-speech conversion.

Anatomy of the Human Vocal Tract from the Speech Processing Perspective

The speech processing signals comprise a sequence of sounds. When air remains expelled from the lungs, acoustic excitation of the vocal tract generates sound/speech signals. The lungs remain used as air supply equipment during speech production. The vocal cords (as seen in the figure below) are two membranes that vary the glottis region. When we breathe, the vocal cords remain open, but they open and close when we talk.

Air pressure builds up near the vocal cords when air remains expelled from the lungs. Once the air pressure reaches a certain threshold, the vocal lines/folds open, and the airflow through them vibrates the membranes. The frequency of vibration of the vocal cords depends on the length of the vocal cords and the tension in the chords. This frequency is called fundamental frequency or fundamental frequency, which defines the tonal height of humans. The fundamental frequency in humans is statistically within the following range:

  1. 50 Hz to 200 Hz for men
  2. 50 Hz to 300 Hz for women and
  3. 200 Hz to 400 Hz for children

Women’s and children’s vocal cords are shorter and speak at higher frequencies than men’s.

Human Speech Processing can Remain broadly Categorized into three Types of Sounds

  • Voice sounds are produced by the vocal cords’ vibration when air flows from the lungs through the voice tube, for example, a, b, m, n, etc. Therefore, voiced sounds carry low-frequency components. During vocal production, vocal cords remain closed most of the time.
  • Unvoiced sounds: Voice chords do not vibrate for unvoiced sounds. The continuous flow of air through the voice duct causes the non-voiced sounds, for example, shh, sss, f, etc. The non-voiced sounds carry high-frequency components. During the production of unvoiced speech, the vocal cords remain open most of the time.
  • Other sounds: These sounds can remain classified as:
  1. Nasal sounds: Voice tract acoustically coupled with the Nasal lot, i.e., sounds radiated by the nostrils and lips, for example, m, n, ing, etc.
  2. Explosive sounds: These sounds result from a sudden accumulation and release of pressure near the closure at the front of the vocal path, for example, p, t, b, etc.

The voice channel is a vase-shaped acoustic tube that stops at one end by the vocal cords and by the lips at the other end.


The properties of voice signals depend on the human speech production system. Therefore, the model of speech production has remained derived from the fundamental principles of the human speech production system. Therefore, understanding the characteristics of a human speech production system is essential for designing algorithms for speech compression, speech synthesis, and speech recognition techniques. In addition, the voice production model remains used for converting analog speech to digital form for transmission via telephony.

Also Read: What are Internal Combustion Engines

No Comments

Post A Comment