This research into automated lip reading was part of a three-year project and was supported by the Engineering and Physical Sciences Research Council
Automated CCTV lip reading is challenging due to low frames rates and small
images, but the University of East Anglia is pushing the next stage of this technology

Scientists at the University of East Anglia in Norwich, England, are working on the next stage of automated lip reading technology that could be used for deciphering speech from video surveillance footage.

The visual speech recognition technology, created by Dr. Helen Bear and Professor Richard Harvey of UEA’s School of Computing Sciences, can be applied “any place where the audio isn’t good enough to determine what people are saying,” says Dr. Bear.

Training System To Recognize Lip Movements

She says that unique problems with determining speech arise when sound isn’t available – such as on CCTV footage – or if the audio is inadequate and there aren’t clues to give context to the conversation. The technology can also be used where there is audio but it is difficult to pick up because of ambient noise, such as in cars and aircraft.

The technology uses deep neural networks that “learn” the way people move their lips, explains Professor Harvey. Researchers “train” the system using one person’s lip movements, then test it on another person’s lip movements. The team has a database of 12 people at the moment, using a list of around 1,000 words. This produces a success rate of 80 percent with a single speaker, and 60 percent with two different speakers. An element of language modeling is also used to train the computer to recognize the context of words spoken.

Challenges Of Lip Reading CCTV

“Lip-reading is one of the most challenging problems in artificial intelligence, so it’s great to make progress on one of the trickier aspects, which is how to train machines to recognize the appearance and shape of human lips,” says Harvey.

“CCTV is still a challenge – there’s lots of stuff working against you. For example, on most CCTV footage the lips are quite small and frame rates are low. But an easier application could be, for example, to enhance messages sent over radio by a security guard.”

Of course, most CCTV systems do not include audio, in part due to privacy and data protection laws, which tend to limit the use of audio except in specific circumstances.

The research was part of a three-year project and was supported by the Engineering and Physical Sciences Research Council. The research paper, Decoding Visemes: Improving Machine Lip-Reading, was presented at the IEEE International Conference on Acoustics, Speech and Signal Processing in Shanghai last month.

Download PDF version Download PDF version

Author profile

Ron Alalouff Contributing Editor, SourceSecurity.com

In case you missed it

Luxury Londoner Hotel Secured By OPTEX Laser Sensors
Luxury Londoner Hotel Secured By OPTEX Laser Sensors

OPTEX, the pioneering global sensing manufacturer, has specified and installed its compact and intelligent REDSCAN RLS-2020 LiDAR laser sensors at the new luxury five-star Londoner...

ASSA ABLOY eCLIQ: Secure Access At Hofbräuhaus Munich
ASSA ABLOY eCLIQ: Secure Access At Hofbräuhaus Munich

Munich’s Hofbräuhaus enjoys an iconic status, as both a heritage property and a spiritual home for lovers of German beer. “In this historic building is the world&r...

How Should Total Cost of Ownership (TCO) Impact Security Decisions?
How Should Total Cost of Ownership (TCO) Impact Security Decisions?

Direct costs such as purchase price and maintenance are important elements in the total cost of ownership (TCO). However, there are others. Elements such as opportunity costs of lo...