Skip to content

Making Audio More Searchable

January 9, 2014

Anyone who has any experience with podcast production can tell you how much work goes into it. On top of the actual recording and editing, producers must draft descriptive content to summarize the material. This includes keywords, which help make the podcast search-friendly. So far, Google hasn’t made a search engine that can find audio clips. Until they do, producers are forced to hire interns and copywriters to listen to each clip before broadcast, take notes on what they hear, and produce summary and keywords from their notes. This is labor-intensive and prone to human error.

There’s a Better Way: Process Automation

I face this problem each week, when I produce my weekly podcast with my good friend, Justin Davis (@jwd2a). We sit down with studio-quality recording equipment, drink bourbon, and talk about technology and design for an hour. We’re very proud of our show, called Distilled (https://itunes.apple.com/us/podcast/distilled/id574867062). We often hear from our fans how much they love listening to us. Still, we would love to reach a larger audience, and podcasts aren’t the most popular media. If we could rely on some other form of searchable content to drive traffic, we might expand our listener base. So how might we do that?

As an information architect, I often find myself designing and improving large systems of complex parts. I tend to look at big problems as lots of little problems waiting to be decomposed and solved individually. This really helps reduce the weight of the mountain of difficulty, making it easier to feel confident in finding a solution. In the case of this unique problem, decomposition has proven to be a critical aspect of the solution.

decompose single audio file into many clips, processing each clip for text content, and analysing results

The High Level Solution

First, we take the final published MP3 and transcode it into WAV format. Then, taking a tip from crowdsourcing, we chop the raw audio into lots of short clips. Each clip is then submitted to the AT&T Speech API, which analyses the clip and returns a set of words it detected in the clip, along with some other information. These words are then reassembled based on the original order of the clips, resulting in a transcript of the raw audio. Easy, right?

The Technical Nitty Gritty

Using ruby gems and a little bit of custom code, we can achieve our goals relatively easily. For reference, the following sections discuss implementation details that are available as a github gist (https://gist.github.com/agoodman/8337134).

WAV File I/O

The wavefile gem makes this very easy. It supports many different WAV formats and provides a nice abstraction layer for simple reading and writing of WAV sample data. The naive approach was to simply slice the sample data into 30sec clips. Implementation details can be found in wav_split.rb in the gist above.

API Batching

Once the raw audio is chopped up into clips, each clip must be submitted to the Speech API for processing. Each request takes up to one minute to complete, so there must be some framework for processing API requests in parallel. We use threading for that, as shown in api_batch.rb in the gist above.

Post-Processing

The batch as a whole is assigned a hash for unique tracking and process monitoring. Each clip is assigned an index, based on its position in the sequence. The API batching script maintains the index for each clip when saving the results to output files. This makes it easy to reassemble the results to produce the final transcript. The naive approach simply appends all the words together and uses the word confidence data returned by the Speech API to filter out insignificant or inaccurate words. Implementation can be found in speech_post.rb in the gist above.

Lessons Learned

This exercise has been extremely valuable, both from a technical implementation perspective and from a process automation standpoint. Socializing the concept has helped to measure concept viability as a potential product or service to offer as a business. Most of all, we now have some tools that make it easier to produce our podcast, which helps reduce the publishing cost and time, so we can focus on delivering entertaining and educational material for our listeners.

Decoupling Identity from Authentication

September 5, 2013

Many mobile and web apps connect to web services to provide content to the user. This usually means the app developer has control over the web service. Some apps rely on 3rd party services, using OAuth to negotiate the connection. OAuth is an excellent strategy for 3rd party authentication, but that’s not the scope of this discussion. We want to focus on the direct connection between an app and the web service where one developer controls both the client and the service. Let’s look at a case study of the current status quo.

Case Study: Email Address As Login

The typical workflow for new and returning users is to show a sign up/sign in view on initial app launch to allow the user to enter authentication details to login or information required to create an account. In most systems, the vendor (probably you if you’re reading this) wants to collect information like email address and phone number, so they can stay connected with their customer. While that can be incredibly valuable to the vendor, it may not be necessary for the user to derive value from the app. Generally, the user needs only to operate within a persistent session that uniquely and distinctly identifies their actions as belonging to one user. We refer to this as the personal silo.

Users are quite familiar with providing an email address as a required field when creating an account. This strategy presumes the user wants to reveal this information to the vendor. In some cases, users will refuse to create an account because they don’t wish to reveal their email address. In that case, everyone loses. In order to provide maximum value to the user, it is necessary to adjust the on-boarding process to require absolutely nothing from the user. Again, for web apps, this is difficult (maybe impossible). For mobile apps, we’ve developed a strategy that enables vendors to allow the user to use their app without requiring any data entry from the user at all. Here’s how we do it.

Automatic Token-Based Account Creation

On first launch, the app establishes a connection with the web service to create a User object. This object requires no fields and has only a token to uniquely identify the user. At this point, the user is most likely still looking at a splash screen. Since most vendors take this opportunity to show some branding materials and force the user to wait a few seconds, we might as well go off and create the account while they wait. Once the account is created, we store the token on the device. This is the primary means of maintaining user session. This token is used for all subsequent API activity, as the app coordinates efforts with the web service.

Just-In-Time Access Prompting

Many apps ask for a bunch of access right up front, resulting in an unknown number of prompts stacked up when the user launches the app the first time. Often, developers incorrectly assume the user will allow access whenever the system prompts them. This is a slippery slope to a bad user experience. As a rule, your app should function regardless of user-authorized access. If the user declines the location services access prompt, the app should still allow them to navigate a map to find value. If they accept location services, the map should center on their location, but still allow them to navigate. The key here is that the app only prompts when the user has taken some action to achieve some task that may benefit from the services for which access is requested. For example, instead of prompting for push notification access on first launch, wait to prompt until the user has requested a messaging-related feature. That way, they maintain a perceptual context of their action, making it more obvious that they are being prompted for their convenience, as a mechanism for easing them through their current task.

Prefer In-App Messaging Over Email

Vendors like to keep in touch with their customers. This enables a richer, more interactive experience for the customer. Maintaining a conversation with users makes it much easier to guarantee consistent engagement. However, this need not happen via email. In some cases, such as newsletters and other infrequent scenarios, email may be the best choice. In those cases, prompt for access to user contact info only after the user taps the “Subscribe to Newsletter” button. Again, this improves the perception of value in the eyes of the user. It’s clear they are being asked to allow access to their personal information, so they don’t have to type anything. If the goal is to provide real-time information related to the app or the user’s app data, that should be done via push notification. In-app messaging solutions do not require any personal information to be shared by the user. All the major app platforms use anonymous tokens to coordinate messaging between server and device.

Emerging Best Practice

The goal of this strategy is to provide more ease and grace to the on-boarding process. By reducing barriers to entry and leveraging OS features for accessing device capabilities to eliminate the data entry tasks of setting up an app, we dramatically increase the quality of the user experience. This allows the user to dive head first into the value being delivered by the app. Use convenience features to reduce the cognitive load on your users, and don’t require them to do anything before they can use your app. Get them to the content as quickly as you can. They will thank you for it, hopefully with their wallets.

Client-Configurable Web Services

August 6, 2013

Whatever their information architecture (IA) and object model, nearly all modern data providers are adding APIs to their product offerings. Web services offer the diverse language of extensible structured data with the convenience of common formatting standards (JSON, XML). More conveniently still, as producer is free to build the service in any language they like, so is consumer free to implement a client in any language or framework of their choice. All both parties need do is conform to a common protocol.

As web services have evolved, the protocol standards have become more sophisticated. RESTful design and naming standards promoted by the Rails community are quickly emerging as the dominant practice. Last week, we published a gem, called serviceable (https://rubygems.org/gems/serviceable), with support for standard simple endpoints. These endpoints included support for client-side configuration of the response content. In other words, the client could include directives in the query string to dictate the structure of the response data. Today, we took it one step further.

Using last week’s release, you can add only=id,name,created_at to the query string of any serviceable endpoint to return only the id, name, and created_at fields. This week's release adds the ability to configure basic associations as well. If you know how the objects in the API are related, you can include them in the query string. Here's an example of a user request, including a limited user detail and a list of the user's playlists:

GET /api/v1/user.json?only=id,created_at,first_name,last_name&include=playlists
{
  "id": 123,
  "created_at": "20130801T12:34:56.789Z",
  "first_name": "Bill",
  "last_name": "Brasky",
  "playlists": [
    {"id": 123, "name": "Pillaging"},
    {"id": 124, "name": "Drinking"}
  ]
}

Simply by knowing about the relationship between objects in the data model, the client can request rich data sets. Granted, there will need to be some security precautions. The producer will likely want to whitelist the associations that are available to the API consumer. In the end, the result is a high performance web service endpoint architecture that can be easily customized by the client. How many times have you asked?

"oh man, if only this web service could give me some associated information, I could make three requests instead of fifty"

If you're reading this, I suspect it has at least crossed your mind. With serviceable, just ask in the query string. The gem does the rest.

Kinematic Simulation of a Dragonfly Wing with SolidWorks

July 24, 2013

If you’re anything like me, you watched this mesmerizing video clip before you started reading. I’m guessing you watched it twice. I watched it a lot more than that, and it’s still fascinating to me.

Isn’t it amazing how easy it is to watch it over and over? I would have published a shorter clip that loops, so you can stare at it for long stretches like I did. Sadly, YouTube doesn’t allow you to play it back in a loop. So, what exactly are you looking at?

The technical term is entomopter (http://en.wikipedia.org/wiki/Entomopter). That is, a vehicle that mimics insects to achieve flight. Purists will enjoy learning this is a subset of a larger class of devices, called ornithopters, which use flapping wings instead of airfoils to produce lift.

Whereas typical human flight systems rely on airflow over an airfoil to generate lifting force, insects move their wings in a cyclic motion to generate this force. For airfoils, there is usually a consistent flow over the surface. As long as the airframe doesn’t rotate significantly about any axis (pitch, roll, or yaw), the flow relative to the wings is steady. This is not true for insect wings. They change angle of attack continuously and flow is anything but steady. Moreover, the aerodynamics at the scale of insect wings are dominated by turbulent effects, not laminar effects as in high speed fixed wing aircraft. This area of aerodynamic theory is largely unstudied. So, it’s a challenging problem with significant unknowns.

In order to gain some insight into the aerodynamics, I set out to understand the basic wing motion that might produce lift. That meant visualizing the wing motion in a simulation. After sitting for many hours beating my arms back and forth, I began to understand the basic principles. Aside from giving me a reputation as the “crazy flapping wing guy” in the lab, this helped me to envision a linkage system that might produce the motion I was representing with my arms. As I expanded the vision into a simulated environment, it became clear that the driving system for the linkage is critical to the performance of the mechanism.

In the video above, you may notice the pitch and yaw of the wings is dictated by the combined motion of two blocks. If the blocks move together (0deg phase angle between them), the wing motion is pure yaw (back and forth). If the blocks move opposite to each other (180deg phase angle), the wings move in pitch only (rotation about the long axis of the blade). When the phase angle is adjusted correctly, the motion of the wings is exactly what you might expect. In the video, the phase angle is 30deg. The phase angle and geometry of the linkage system dictate the amplitude of wing tip oscillation and the angle of attack at zero yaw. These parameters can be adjusted to achieve different wing motion profiles.

One final note about this mechanism is load balance. Without springs, the push rods must convey all the force to the wings. In other words, the aerodynamic forces resultant from wing motion will be carried in their entirety by the linkage system. That is, unless there is a spring involved somewhere, the drive motor must produce sufficient torque to overcome the aerodynamic load. By introducing a set of springs, we can use the motor to excite the natural oscillation of the system. This way, the springs do the heavy lifting (pun intended), dramatically reducing the wear on the motor.

Recording and Mixing Tips for Audio Podcasting

July 7, 2013

Over the last six months, I’ve been working with Justin Davis, user experience designer and founder of Madera Labs (http://maderalabs.com). We publish a one-hour weekly podcast about design and technology, wherein we drink bourbon and talk about the interaction between human and machine. If you’re reading this, there’s a good chance you’ll enjoy the podcast. It’s called Distilled. Find it on iTunes (https://itunes.apple.com/us/podcast/distilled/id574867062).

Along the way, I’ve learned a lot about audio recording, mixing, and editing. Justin was the front man for a band called Quarter to Nine, so he has a bunch of audio equipment. At some point, we looked at each other and said,

“why the hell are we using an iPhone to record this podcast?”

Once we went from mono to stereo recordings, the editing task changed. It was no longer about redacting sensitive content or trimming uninteresting segments. Instead of recording whatever we wanted and cutting it down to an hour, we started training ourselves to record only an hour. This really helps cut down on the mixing and processing time. Having a routine also really helps, especially when you do it every week. Let’s dig into the details.

Gear and Settings

We’re using a Zoom H4n recorder¬†with Shure SM57 and SM58 XLR mics. Justin uses the SM57, which survived a studio fire and still sounds great. I use the SM58. I’m not entirely sure, but I believe we typically record Justin on the left channel and me on the right. We use 44.1kHz sample rate and 16-bit resolution, resulting in a 60min WAV file about 650MB in size. This file is stored on a SD card, which makes it easy to transfer to my MacBook Pro for post-processing.

Production Style

Until very recently, we recorded everything in high quality and published a mono clip, optimized for voice. This merges everything from left and right channels together evenly into one track. The primary reason for this was originally file size. Mono requires half the data of stereo. Recently, we started publishing in stereo. However, without any mixing, the result is Justin in your left ear and me in your right. That’s not really what we are going for. We added an extra step in the post-processing workflow to mix the two channels inward toward the center. In other words, each channel bleeds over into the other.

Mixing Strategy

First, I run the raw audio through Levelator to smooth out the dynamic range. Next, I open the output file in Audacity. This shows two distinct channels together as a stereo track. I split the track to mono, leaving two distinct mono tracks. I pan Justin’s track 30% left and mine 30% right to give the track a bit more depth. I export to the same quality as the source. Finally, I import into GarageBand for transcoding.

Distribution

It’s important to include meta data about the clip before publishing. We use Podomatic for hosting and delivery of episodes. Most hosting providers will include meta content from ID3 tags in the published RSS feed. iTunes uses only the information in the RSS stream to describe your podcast to listeners, so you want to make sure the content in your RSS feed is accurate and relevant to the episode content. I like to write the title and description for each episode as soon after recording as possible. Otherwise, I’m left listening to the whole hour again to make sure the title matches the discussion.

 

I typically spend about an hour in post-production each week. I’m certain the process could be completely automated, taking the raw audio file and producing MP4 as output. I’d also like to see an automated voice recognition layer that would annotate the conversation with time-stamped tags. I’m picturing a media player with a tag cloud visualization overlay. You could see on a timeline when the conversation topic changes, making it easy to navigate to the segment that interests you.

Uptimetry 2.0: Advanced URL Monitoring with Nokogiri and HTTParty

June 24, 2013

We’re working on a great new feature for the upcoming Uptimetry 2.0 release. In the process, I cobbled together this magnificent ruby one-liner that I simply couldn’t be quiet about.

Nokogiri::HTML(HTTParty.get(url)).xpath("//*/@href").map(&:value).select {|e| e[0..3]=='http'}.select {|e| e.match(URI.parse(url).host).nil?}

That's quite a mouthful. Let's decompose:

HTTParty retrieves the contents of the given url.
HTTParty.get(url)

Nokogiri parses the response body as HTML.
Nokogiri::HTML(...)

Nokogiri performs an XPath match to find elements with href attributes.
.xpath("//*/@href")

From the resulting set, we reduce to the values of the attributes.
.map(&:value)

Then, we select only values starting with "http"
.select {|e| e[0..3]=='http'}

Finally, we remove any URLs pointing to resources on the same domain.
.select {|e| e.match(URI.parse(url).host).nil?}

We're left with an array of URLs linking to external resources. If any of these links are dead, the user experience will suffer. Uptimetry (http://uptimetry.com) already offers a powerful cloud-based solution for URL monitoring. Soon, we will offer the option to crawl your web properties automatically for external links to monitor, saving you time and giving you peace of mind every month.

Semantic Text Processing with CoWordinate

June 19, 2013

Today, we encountered a situation where we needed to determine the part of speech of words within a tag cloud widget. The original intent was to group verbs in one place, nouns in another, isolate/highlight proper names, and eliminate pronouns and prepositions. We also want to color code words based on their part of speech. We could do this on the server when we generate the markup or on the client when we render it. In order to do it at all, though, we need some API that provides meta data about each word.

So, we built one. :)

Introducing CoWordinate, a RESTful web service and JavaScript API for determining context for any word. Simply make a request to the web service endpoint or include a JavaScript code snippet in your web app, and you can see what kind of word it is. It’s super simple to use, whether through raw web service request or using our JavaScript library. Use it today in your apps to help make sense of your meta data presentation layer.

http://cowordinate.com

Follow

Get every new post delivered to your Inbox.

Join 122 other followers

%d bloggers like this: