Leveraging Sound Analysis to the Tune of 300 Sounds

// Written by Jordan Morgan // Apr 11th, 2022 // Read it in about 4 minutes // RE: Sound Analysis

This post is brought to you by Emerge Tools, the best way to build on mobile.

Is it just me, or does Apple seem to roll out more machine learning advancements with nearly every new OS release lately? Sure, CreateML was a big one — but that’s a developer facing tool. However, just beyond things meant for us code wranglers, we can’t ignore that iOS seems to make makes millions of choices each day based off of what custom models conjure up: Siri handling requests, a watch detecting a fall or your iPhone’s mic automatically picking up sounds in the environment.

To that end, I stumbled upon the Sound Analysis framework a few weeks ago and was impressed at its breadth and depth. While I still feel like a novice when it comes to…uh, anything, machine learning related - this framework has done a solid for us all by including a sound classifier that ships right with the framework that can detect over 300 sounds:

All of the sounds that Sound Analysis can find out of the box.

So if you don’t know your neaural networks from your decision trees, you’re in luck. Apple has made it 20 lines of code, give or take, to classify those sounds.

Sound Off

Sound Analysis’ engineers must’ve listened to its developer audience, because looking at it from the start, the framework sounds easy to get started with – yielding only 8-10 top level objects to check out¹:

Some of the objects in the Sound Analysis framework.

If you’re curious which sounds you can detect with the classifier, you can query them all using SNClassifySoundRequest. All it needs is the identifier of a classifier to use. As I mentioned in the lede, there’s one such classifier that ships on-device. As such, to query it, use the extension Apple has given it off of SNClassifierIdentifier:

extension SNClassifierIdentifier {
    /**
     @brief Denotes the first edition of the Apple-provided sound classifier.
     */
    @available(iOS 15.0, *)
    public static let version1: SNClassifierIdentifier
}

From there, pipe it into the sound request object and off we go:

if let req = try? SNClassifySoundRequest(classifierIdentifier: .version1) {
	let sortedSounds: [String] = req.knownClassifications.sorted()
    sortedSounds.forEach {
    	print($0)
    }
}

// As of iOS 15, you get something like:
accordion
acoustic_guitar
air_conditioner
air_horn
aircraft
airplane
alarm_clock
ambulance_siren
applause
artillery_fire
babble
baby_crying
baby_laughter
bagpipes
banjo
...and on 

You might be thinking that those titles look very…programmery. And they do. If you need to make them user facing, just snag some code courtsey of Apple’s demo project which uses an existing table to translate them:

static func displayNameForLabel(_ label: String) -> String {
        let localizationTable = "SoundNames"
        let unlocalized = label.replacingOccurrences(of: "_", with: " ").capitalized
        return Bundle.main.localizedString(forKey: unlocalized,
                                        value: unlocalized,
                                        table: localizationTable)
    }

let concatenatedThoughts = """

Keep in mind, that code could change over any iOS release. So, you're mileage may vary and vary fast with the next WWDC on the horizon.

"""

Then you can use it like so:

let jankyName = "acoustic_guitar"
let nonJankName = displayNameForLabel(jankName) // Acoustic Guitar

Listen to Me

So, we know what sounds we can query for. What next? The flow is fairly easy to grok and looks like this:

Get a hold of a sound classifier (done).
Figure out what sounds you want to listen for (done).
Identify the audio file to use.
Create an observer conforming to SNResultsObserving.

Step 3 is elementary:

let videoURL = getVideoURL()
let analyzer = try SNAudioFileAnalyzer(url: videoURL)

Step 4 is the meat and potatoes of the whole thing. The SNResultsObserving will be the connection point for the code that wants to know which sounds happened where, and the magic that finds that information out. It has one required function:

func request(_ request: SNRequest, didProduce result: SNResult)

And, like most of Apple’s API designs, it also has two optional functions you can use to know tangentially related information. In this case, when things failed or finished:

optional func request(_ request: SNRequest, didFailWithError error: Error)

optional func requestDidComplete(_ request: SNRequest)

The idea here is to handle the results of the observation requests. Since we’re only concerned with applause sounds, the whole process might look like this:

class ApplauseDetector: NSObject, SNResultsObserving {
    private let applauseID: String = "applause"
    @Published var lastestTimeStamp: CMTime = .invalid
    
    func request(_ request: SNRequest, didProduce result: SNResult) {
        guard let result = result as? SNClassificationResult,
              let _ = result.classification(forIdentifier: applauseID) else {
            return
        }
        
        lastestTimeStamp = result.timeRange.start
    }
}

Then, putting it altogether, we could find timestamps for applause from any audio file:

private let detector: ApplauseDetector = ApplauseDetector()
private var subs: [AnyCancellable] = []

guard let audioURL = getAudioURL(),
      let req = try? SNClassifySoundRequest(classifierIdentifier: .version1),
      let analyzer = try? SNAudioFileAnalyzer(url: audioURL) else {
      	return
      }
      
// Add our classifier and result handler              
try? analyzer.add(req, withObserver: self.detector)

// Kick off detection
analyzer.analyze()
                    
// Begin Combine observing and await comments
// For how I should use AsyncSequence 😉
self.detector.$lastestTimeStamp.sink { time in
	print("Found applause at \(time)")
}
.store(in: &self.subs)

And that’s it!

You may find that you need to make some tweaks to get what you’re after. Like many machine learning APIs, there is a confidence factor that you can play with:

let applause = result.classification(forIdentifier: applauseID)
let confidenceMet = applause.confidence > 0.7 // Or Whatever

Or, if you can give the request a duration the sound may occur, that’ll help too:

// Four second window for applause
req.windowDuration = CMTime(seconds: 4, preferredTimescale: 1000)

In our case, there is no easy definition for how long applause is, so you can also use a list of ranges:

public enum SNTimeDurationConstraint {

    case enumeratedDurations([CMTime])

    case durationRange(CMTimeRange)
}

Further, if you don’t have a set audio file, there is also an object for streaming audio and finding things on the fly. That’s SNAudioStreamAnalyzer. Look at Apple’s demo project built in SwiftUI to see how to swing that.

Final Thoughts

These days I can erase my dross attempts at becoming a machine learning expert because…Apple has, in many cases, already done the work for me. The fact that something like this exists, and I just simply happened to stumble upon it while doc divin’ on Apple’s developer portal just goes to show that…what a freakin’ crazy time to be a developer, right? I think that if this shipped, say, even five years ago - it might’ve been the talk of the town.

Instead, Apple has advanced so far up the Machine Learning tree that it was barely a footnote. No matter, go forth and classify sounds free of charge thanks to the fine work from Cupertino & Friends™.

Until next time ✌️

I promise I’m done with these terrible puns. ↩