But today, I learned (via Rubel) that a company named TVEyes intends to offer a service named Podscope (by the end of the month) that will do exactly that. (Video, err, “vodcasts”, “podshows”, insert-clever-meme-name-here, will be supported too. In fact, it’s hard to imagine that any audio content, including recordings of VOIP/Skype conference calls, university/conference lectures, etc. wouldn’t eventually be supported.)
Beautiful! Anything that makes audio and video more searchable, referenceable, etc. would be a huge boon if that were the only application.
But let’s project out to a future point (just riffing, not doing feasability assessment)… Apple, Odeo and others have made creating audio and video an order of magnitude easier. Google, Yahoo and Ourmedia‘s efforts allow for a ton of personal audio and video to be hosted server-side at no-cost to the publisher… All of it is automatically transcripted to text (and largely, publishers care enough to fix inevitable transcription mistakes)…
Now, let’s make a leap and assume that all of your transcripted audio is persisted within a personal voice profile (part of Identity 2.0, perhaps) and that it’s accessible via an API. That is, a single service, with a very-large vocabulary of your continuous, speaker dependent (i.e., personal) voice could be invoked by apps that you approve.
This would seem to be a huge boon for IVR applications. Finally, your bank would reliably understand you when you say “account balance” or “let me talk to the operator you lousy piece of…”. More interestingly for our focus here, Yahoo! and Google voice-driven search from your cell phone (etc) would have a large index of your pronunciations of common and industry specific jargon to work with.
Lots of other possibilities of course… drop on by if you care to riff. 🙂
Aside: The other topic… I wanted to tie all of this into the excellent series of posts that Tim Oren has been writing about Machine (Language) Translation, using blog text for language pair seeding. (I guess I just did.) Yes yes, tons of “fidelity” might be lost in early systems, but Voice->Text->Translated Text seems very compelling… and if you could close the loop by going from Translated Text -> Translated Voice… well ok, that’s just crazy talk. 😉