Apple study shows LLMs can tell what you’re doing from audio data

Apple researchers have published a study that looks at how LLMs can analyze audio and motion data to get a better overview of user activities. here are the details.

They’re good at it, but not in a creepy way

A new paper titled “Using LLM for Late Multimodal Sensor Fusion for Activity Recognition” provides insight into how Apple might consider incorporating LLM analysis with traditional sensor data to gain a more accurate understanding of user activity.

They argue that it has great potential to make activity analysis more accurate, even in situations where there is not enough sensor data.

From the researchers:

“Sensor data streams provide valuable information about activities and context for downstream applications, although integrating complementary information can be challenging. We show that large language models (LLMs) can be used for late fusion for activity classification from audio and motion time series data. We prepared a subset of the data from the Ego4D dataset for activity recognition across diverse contexts (e.g., household activities, sports). We evaluated LLMs. 12-class zero- and one-shot classification achieved without any task-specific training, with F1-scores significantly higher than chance. Zero-shot classification via LLM-based fusion can enable multimodal temporal applications where there is limited aligned training data to learn shared embedding spaces, additionally, LLM-based fusion without requiring additional memory and computation for targeted application-specific multimodal models. The model can enable deployment.

In other words, LLMs are actually very good at guessing what a user is doing from basic audio and motion signals, even if they aren’t specifically trained for it. Furthermore, when only a single example is given, their accuracy becomes even better.

An important difference is that in this study, the LLMs were not given actual audio recordings, but rather short text descriptions generated by the audio model and the IMU-based motion model (which tracks movement through accelerometer and gyroscope data), as shown below:

apple llm audio study pipeline

dig a little deeper

In the paper, the researchers explain that they used Ego4D, a massive dataset of media shot in a first-person perspective. The data includes thousands of hours of real-world environments and situations, from household tasks to outdoor activities.

From the study:

“We created a dataset of day-to-day activities from the Ego4D dataset by exploring activities of daily living within the provided description. The curated dataset contains 20-second samples from twelve high-level activities: vacuum cleaning, cooking, washing clothes, eating, playing basketball, playing football, playing with pets, reading a book, using a computer, washing dishes, watching TV, exercising/weightlifting. These activities were used for a range of household and fitness tasks. were selected, and larger datasets based on their comprehensiveness.”

The researchers ran audio and motion data through small models that generated text captions and class predictions, then fed those outputs into different LLMs (Gemini-2.5-Pro ​​and QWEN-32B) to see how well they could recognize activity.

Then, Apple compared the performance of these models in two different conditions: one in which they were given a list of 12 possible activities to choose from (closed-set), and another where they were given no choices (open-set).

For each trial, they were given different combinations of audio captions, audio labels, IMU activity prediction data, and additional context, and they did this:

closed set open ended gemini qwen

Finally, the researchers note that the results of this study provide interesting insights into how combining multiple models can benefit activity and health data, especially in cases where raw sensor data alone is insufficient to provide a clear picture of a user’s activity.

Perhaps more importantly, to assist researchers interested in reproducing the results, Apple published supplemental materials alongside the study, including Ego4D segment IDs, timestamps, signals, and one-shot examples used in the experiments.

Accessory Deals on Amazon

Add 9to5Mac as a favorite source on Google
Add 9to5Mac as a favorite source on Google

FTC: We use auto affiliate links that generate income. More.



Leave a Comment