Synthetic data isn’t enough, so Apple wants users’ own data to help improve its AI.

Cupertino is already moving to strengthen one of the largest software ventures in its recent history. Among the moves underway are two fronts: the rumored restructuring of the team behind Siri, whose improved version has been delayed until 2026, and the creation of new techniques designed to improve its language models, with the mission of maintaining its focus on privacy.

A step beyond synthetic data
Apple typically trains its models with synthetic data and data labeled by humans, a solution that has proven effective to a certain extent. It doesn’t always represent the real world. Consequently, it limits the performance of AI products. This has led the technology company led by Tim Cook to develop a new solution that combines synthetic data with anonymized signals from participating devices.

As explained in an article published this week, it all starts with a synthetic message, that is, an email invented by Apple itself with a format that simulates real emails. For example: “Would you like to play tennis tomorrow at 11:30?” From there, several variants are generated that change some elements, such as the sport, time, or tone, to test different possible structures.

These phrases are sent to a portion of the devices whose users have agreed to share analytics with Apple. Here’s where something key happens: each iPhone, iPad, or Mac receives a handful of real emails transformed into local embeddings, that is, mathematical representations that convert each message into a set of numbers that reflect its subject, style, and length. The important thing is that these emails never leave the device.

Leave a Comment

Leave a Reply

Your email address will not be published. Required fields are marked *