The Machine Learning Landscape on iOS

// Written by Jordan Morgan // Jul 26th, 2023 // Read it in about 2 minutes // RE: Machine Learning

This post is brought to you by Emerge Tools, the best way to build on mobile.

As part of my research for the next chapter of my book series covering machine learning, I was honestly shocked at how much the landscape has grown for developers in that arena. Core ML has been around since iOS 11, which feels like a lifetime ago. Create ML has more recently surfaced (iOS 15), and it’s all but democratized training up custom models.

What used to be a delicate dance of third-party packages and custom instructions for each model is now housed within a tidy, easy to understand mac app. You used to have to sing for your supper to make a custom model, much less one that Core ML could use - and now? You can use continuity camera to record audio and train up a model all while sipping on coffee.

And, all of that is great.

Where the magic is, in my opinion, is how all of these frameworks have built up an impressive API surface area leveraging models already made in-house for us. In fact, you can pull off some crazy stuff without even knowing there is a model powering it at all. The following is an excerpt taken directly from my book’s upcoming chapter over machine learning.

It covers some of the use-cases and machine learning APIs we can use without creating custom models. Frameworks like Vision, Speech, Sound Analysis and Natural Language do the complicated stuff for us - and we can just call functions to leverage it. It’s honestly amazing - and I don’t think we, as a community, are either paying enough attention to it, or realize all of this stuff is here waiting to be used.

I think you’ll be impressed at how large this list is. Again - straight from my book:

Book excerpt begins…

Vision

Vision applications for machine learning is probably the most recognizable flavor of the four (Vision, Natural Language, Speech and Sound). Using frameworks such as Vision and VisionKit, you can use several under-the-hood models to perform plenty of interesting tasks. If you need to process, change or analyze video or images — the Vision framework is a perfect choice to use.

So, what kinds of things can we do here? Here’s a few:

Classification: Identify certain aspects or general content in an image as a whole.
Saliency: Quantify the key parts of an image. Basically — it answers the question of where are people most likely to focus when they look at a given image.
Alignment: Manage alignment of images.
Similarity: How close in subject matter are two images?
Detection: Label and identify objects within a photo. This is different from classification — one deals with the image as an entire unit, this one finds parts found within an specific image.
Tracking: Find and track objects within a video.
Trajectory Detection: Identify the trajectory of objects in motion from a video.
Contour Detection: Traces the edges of objects in images and videos — kind of like a “magic lasso” in photo editing software.
Text Detection: Find and identify regions of text.
Text Recognition: Similar to the above, but this also extracts the text in addition to simply identifying their presence.
Face Detection: Just like it sounds — find faces in images or videos.
Face tracking: Again, similar to the above — but this tracks faces in real time.
Face Landmarks: Finds facial features in an image, such as where the nose or eyes are.
Face Capture Quality: Compares facial capture quality between a set of images.
Human Body Detection: Finds regions that have, well, humans in them.
Body Poses: Detects landmarks on people in images and videos.
Hand Poses: The same as above, but unique to hand poses. You could use this to detect something like sign language, for example.
Animal Detection: Finds the pets! Limited to cats and dogs.
Barcodes: Finds and recognizes any barcode in an image.
Rectangular Detection: Finds rectangular regions in photos.
Horizon Detection: Analyzes where the horizon angle lies within a photo.
Optical Flow: Analyzes the pattern of motion of objects between video frames.
Person Segmentation: Generates a matte image for a person in an image — basically, it can pull out a person from a photo. This is similar to how the subject matter “lift” works on photos, where you can tap and hold on a person to lift it from a photo in your camera roll.
Document Detection: Finds rectangular regions in images that have text in them.

As you can see, that is a…large list. Perhaps you can see why I think it’s more effective to gain a general understanding of what’s out there in terms of machine learning instead of diving into one particular topic. When you know what you can do, then you can dive into the how later on.

Natural Language

The Natural Language framework has many uses for handling bodies of text. For example, what words are in a sentence? Which are nouns or verbs? Here are some specific capabilities:

Tokenization: Finds and loops through the words in strings.
Language Identification: Recognize the language in text.
Named Entity Recognition: This uses a linguistic tagger to name entities found within a string. For example, if you analyzed this sentence: “This book is popular in Missouri”, it might say that “book” is a product and “Missouri” is a location.
** Parts of Speech**: Identify verbs, sounds, adjectives and other parts of speech in a string.
Word Embedding: This finds similar words or string within a proximity of their vectors. Basically, how close together are similar words?
Sentence Embedding: The same as above, but this time — it operates on entire sentences.
Sentiment Analysis: Attempts to determine the sentiment of a text — is it positive, negative or neutral?

Speech

One of the “smallest” of the four, but no less useful — the Speech framework allows you to recognize speech on any type of audio (live or prerecorded) to create transcriptions, confidence levels and alternative interpretations.

Sound

Finally, we’ve got the Sound Analysis framework. Like the Speech framework, it’s also focused on one job: determining particular sounds in audio. For example, is a baby crying? Is an audience laughing?

Book excerpt ends…

With all of that power - surely there is a killer use case for your own app in there.

Until next time ✌️