Using Speech Recognition to Predict VoIP Quality Wenyu Jiang IRT Lab April 3, 2002 Introduction to Voice Quality Quality factors in Voice over IP (VoIP) Packet loss, delay, and jitter Choice of voice codec Quality metric: Mean Opinion Score Widely used Human based Time consuming Labor intensive Results N/A in real-time MOS Grade Scor e Excellent 5 Good 4 Fair
3 Poor 2 Bad 1 Motivation Features of a speech recognizer: Automatic speech recognition (ASR), no human listeners needed Accuracy of recognition is apparently coupled with the quality of input speech Recognition can be done in real-time, allowing online quality monitoring. Recognition performance may be related to speech intelligibility as well as quality. Related Work ITU-T E-model [G.107/G.108] An analytical model for estimating perceived quality Provides loss-to-MOS mapping for some common codecs (G.729, G.711, G.723.1). Chernick et al studies speech recognition performance with DoD-CELP codec
Effect of bit error rate instead of packet loss Phoneme (instead of word) recognition ratio Some MOS results, but not accurate enough Experiment Setup Speech recognition engine IBM ViaVoice on Linux Wrote software for both voice model training and performance testing Training and Testing 2 scripts, #1 for training, #2 for testing. 2 speakers, A and B, both read 2 scripts. Script #2 is split into 25 audio clips, with 5 clips per loss condition (0%, 2%, 5%, 10%, 15%) Codec: G.729 Training by G.729 processed audio Experiment Setup, contd. Performance metric Absolute word recognition ratio # of correctly recognized words Rabs total # of spoken words
Relative word recognition ratio Rabs ( p) Rrel ( p) Rabs (0%) p is packet loss probability MOS listening tests: 22 listeners Recognition Ratio vs. MOS Impact of packet loss on audio quality 3.6 Then, eliminate middle variable p 3.8 3.2 3 2.8 MOS Both MOS and Rabs decrease w.r.t loss G.729 codec 3.4 2.6 2.4 2.2 2 0 2
4 6 8 loss rate (%) 10 12 14 16 mapping from speech recognition performance to MOS speech recognition performance 3.6 44 3.4 42 3.2 40 Impact of packet loss on automatic speech recognition G.729 codec 38 MOS 3 36 word recognition ratio (%) 2.8 34 2.6
32 2.4 30 2.2 2 28 30 32 34 36 word recognition ratio (%) 38 40 42 44 28 0 2 4 6 8 loss rate (%) 10 12 14 16 Properties of ASR Performance
When loss probability is low Recognition ratio changes slowly Possibly due to robustness in ViaVoice Less accurate MOS prediction in such case Importance of voice training method 50 Training audio should use same codec as testing Impact of packet loss on machine speech recognition 3.8 Speaker A, trained by PCM linear-16 mapping from speech recognition performance to MOS Speaker A, trained by PCM linear-16 3.6 45 3.4 3.2 40 word recognition ratio (%) MOS 3 2.8 35 2.6 2.4
30 25 2.2 0 2 4 6 8 loss rate (%) 10 12 14 16 2 25 30 35 word recognition ratio (%) 40 45 50 Speaker Dependence in ASR ViaVoice SDK cites a 90% accuracy for Average speaker without a heavy accent Sampling at 22KHz, PCM linear-16
For speaker A, we achieved About 42% accuracy with no packet loss Reasons: 8KHz sampling + G.729 compression Accent + talk speed Does not interfere with MOS prediction, but need to check for speaker dependence Speaker Dependence Check Absolute recognition ratio is 75 70 70% for speaker B, but 42% for speaker A dependent on the speaker But the relative recognition ratio Rrel is universal and speakerindependent Impact of packet loss on machine speech recognition 100 Speaker A Speaker B 90 55 85
word recognition ratio (%) 50 45 40 35 30 2 4 6 8 loss rate (%) 10 12 14 16 relative word recognition ratio R_rel(%) 60 0 Speaker A Speaker B 95 65 25 Impact of packet loss on machine speech recognition 80 75 70 65 0 2
4 6 8 loss rate (%) 10 12 14 16 Rrel as Universal MOS Predictor Mapping from relative recognition ratio3.8 Rrel to MOS speaker A, trained by G.729 speaker B, trained by G.729 3.6 3.4 3.2 3 MOS 2.8 2.6 2.4 2.2 2 65 70 75 80 85 relative word recognition ratio R_rel (%) 90 95 100
Human Recognition Results Listeners are asked to transcribe what they hear in addition to MOS grading. Human recognition result curves are less smooth than MOS curves. Impact of packet loss on human speech recognition 85 Human recognition performance human recognition performance 3.6 80 3.4 75 3.2 70 MOS 3 absolute word recognition ratio (%) 2.8 65 2.6 60 2.4 55
50 mapping from human recognition performance to MOS 3.8 2.2 0 2 4 6 8 loss rate (%) 10 12 14 16 2 50 55 60 65 absolute word recognition ratio R_abs (%) 70 75 80 85 Human Results, contd. Two flat regions in loss-human curve
2-5% loss (some loss but not very high) 10-15% loss (loss is already too high) Mapping between machine and human recognition performance human vs. machine recognition performance 90 human recognition performance 85 80 75 70 Human R_abs (%) 65 60 55 50 28 30 32 34 36 Machine R_abs (%) 38 40 42 44 Application Scenarios
Sender transmits a pre-recorded audio clip of a speaker known to receiver. Receiver does the following: No need to store the original audio clip Looks up Rabs(0%) for this speaker Performs speech recognition Compare to the original text, compute Rrel Just the text is sufficient less storage Need not know packet loss probability Suitable for e2e black-box measurements Conclusions Evaluation of speech recognition performance as a MOS predictor Used ViaVoice speech engine Performance metric: word recognition ratio The relative word recognition ratio is a universal, speaker-independent metric Also analyzed human recognition performance Future work: evaluate other codecs, e.g., G.726, GSM.
7.2 const (Constant) Objects and const Member Functions 7.3 Composition: Objects as Members of Classes 7.4 friend Functions and friend Classes 7.5 Using the this Pointer 7.6 Dynamic Memory Management with Operators new and delete 7.7 static Class Members ...
What dividend yield and capital gain rate would you expect at this price? Example. Example. Solution: the law of one price implies that to value any security we must determine the expected cash flows one receives from owning it. Total...
Times New Roman Arial Narrow Symbol Times Arial Default Design Microsoft Excel Worksheet Statistical vs Clinical Significance Summary Background Making Inferences in Research P Values and Statistical Significance PowerPoint Presentation Better Interpretation of the Classical P Value Confidence (or Likely)...
This presentation was designed to assist you with a basic navigation of the OASIS Registration Process. After viewing this presentation, if you continue to experience difficulties, please contact the Office of the Registrar's Client Services Team at 813-974-2000.
30 day - Is a transfer of the out-of-district IEP to LAUSD forms. Expulsion - Bulletin #_____ Call program specialist and/or behavior specialist at your support unit. Suspension - after the second suspension or a total of three days. After...
K-Award Workshop: Mentorship Plan. Peg Nopoulos, M.D. Professor of Psychiatry, Pediatrics and Neurology. Your Primary Mentor. It all starts here. You need a Rock Start Mentor. Well funded lab by the NIH. Provides resources.
MULTINATIONAL FORCES STANDING OPERATING PROCEDURES (MNF SOP) Mr. Bernie Carey MNF SOP Program Manager USPACOM J-372 15 AUG 2016 UNCLASSIFIED * * * * * Information Brief Overview (U) Purpose Background Status Organization and Concepts Way Ahead Defend the Homeland...
Ready to download the document? Go ahead and hit continue!