Scale AI

Scale AI’s vision is to be the foundational infrastructure behind AI/ML applications. The company began with data labeling and annotation used in building AI/ML models. Data labeling and data annotation involve tagging relevant information or metadata in a dataset to use for training an ML model. To train and build any ML algorithm, the model needs to be grounded on accurate data that is correctly labeled. Scale AI’s core value proposition is built around ensuring companies have correctly labeled to allow them to build effective ML models. By building comprehensive datasets to train AI/ML applications, Scale AI seeks to enable developers to build accurate applications with increased capability and limited vulnerability.

Founding Date

Jun 1, 2016

Headquarters

San Francisco, California

Total Funding

$ 2B

Stage

series f

Employees

501-1000

Careers at Scale AI

Memo

Updated

September 6, 2024

Reading Time

42 min

Thesis

It has been over ten years since Marc Andreessen’s famous pronouncement that software was eating the world. Whether in shopping, entertainment, healthcare, or education, software has become a key component of almost every aspect of life. Now, however, artificial intelligence (AI) and machine learning (ML) are starting to eat software. There are early examples of this, such as Tesla’s Autopilot, Waymo’s driverless taxis, GitHub Copilot, TikTok content recommendations, or on-the-fly context-aware automated website guides like Ramp’s AI tour guide. This could increase, as generative AI directly improves software engineering productivity by 20-45%, primarily by reducing the time it takes for certain tasks such as generating initial code drafts, code correction and refactoring, root-cause analysis, and generating new system designs. But all of this stems from one key thing: data. As Clive Humby pointed out in 2006, “Data is the new oil.”

A persistent issue with building AI / ML applications has been a lack of well-organized data necessary for developing models. This is a self-reinforcing cycle where initial data collection leads to improved AI models, enhancing user experience. Better user experiences attract more users, leading to more data collection. Over time, this cycle continually upgrades the quality of the AI model and the user experience. A scarcity of data extends the timelines required to build AI models and reduces their accuracy. Without a strong dataset to train these AI applications, applications can often exhibit decreased capabilities and increased vulnerabilities. Moreover, a lack of data can often prevent the application from being developed altogether. For example, in medical research, the limited availability of data for diagnosing rare diseases and conditions makes building an accurate AI application for identifying such conditions challenging and often unreliable.

One meaningful benefit of building AI applications has been the dramatic increase in the volume of data available. And the data being generated doesn’t just stem from the digital world, but the physical world as well. As early as the 1960s, technology started impacting the physical world through projects like the Stanford Cart. Through advances like computer vision, sensor fusion, robotics, and autonomous vehicles, the volume of physical data has increased significantly. However, in order to leverage these types of data in building AI applications the data needs to not just be available, but be organized.

That’s where Scale AI comes in. Scale AI’s vision is to be the foundational infrastructure behind AI/ML applications. The company began with data labeling and annotation used in building AI/ML models. Data labeling and data annotation involve tagging relevant information or metadata in a dataset to use for training an ML model. To train and build any ML algorithm, the model needs to be grounded on accurate data that is correctly labeled. Scale AI’s core value proposition is built around ensuring companies have correctly labeled to allow them to build effective ML models. By building comprehensive datasets to train AI/ML applications, Scale AI seeks to enable developers to build accurate applications with increased capability and limited vulnerability.

Weekly Newsletter

Subscribe to the Research Rundown

Founding Story

Scale AI was founded by MIT dropout Alexandr Wang (CEO) and Lucy Guo, Carnegie Mellon dropout and Thiel fellow.

In 2015, Wang enrolled at MIT to study computer science, where he received perfect grades in his first year. Wang had the insight that artificial intelligence and machine learning was going to transform the world.

"First we built machines that could do arithmetic, but the idea that you could have them do these more nuanced tasks that required what we view as humanlike understanding was this very exciting technological concept."

To further explore AI, Wang started out with a fairly small-stakes problem: knowing when to restock his fridge. That project eventually led to the creation of Scale AI. Obsessed with the grocery problem, Wang decided to build a camera inside his fridge to tell him if he was running low on milk. At this point, he realized that there was not enough data available so that he could train his system to quantify the contents of the fridge properly. He also noticed his peers weren’t building AI products despite their training because there was a lack of well-organized data available for them to develop models.

Wang extrapolated this problem out to the implications for AI in general, realizing that data would clearly be a meaningful hurdle. That’s where Scale AI was born. Wang identified that there was a hole in the market. In order to bridge the gap between human and machine-learning capabilities, there was a need for accurately labeled datasets that could train AI models.

During this time, Lucy Guo was at Carnegie Mellon, studying computer science and human-computer interactions. In her second year, she applied for the Thiel Fellowship. It awards $100K to help those motivated enough to build a business, on the condition that they drop out of school. So in 2014, her senior year, she dropped out.

After her initial company failed due to legal issues surrounding food delivery from non-commercial kitchens, she interned at Facebook and was a product designer at Quora and Snapchat. The two met at Quora and Scale AI was accepted into Y-Combinator, shortly thereafter raising a $120K seed round. After exploring several aspects of the infrastructure needed for AI, the team narrowed their efforts down to autonomous vehicles. Self-driving cars needed humans to label the images so that the AI the cars used could be trained on those labeled images.

As the company got started, the team attended Computer Vision and Pattern Recognition, an AI conference. There, according to Wang, they went “booth to booth with a laptop with a demo on it.” As the company grew it became more applicable to other industries related to AI including satellite imagery, ecommerce, and others.

By 2018, the company had grown significantly, and both Wang and Guo were named to the Forbes 30 Under 30 list. At this time, Guo left Scale AI to start a venture capital firm called Backend Capital. According to Guo, the separation was due to a “division in culture and ambition alignment".

Product

To get a foundational understanding of Scale AI, it's important to understand the lifecycle of building a machine-learning model for any given industry vertical. That process begins with data and its sources before moving to data engineering, which is a component of data science.

Scale AI’s core value proposition is built around the data engineering component of this lifecycle. Specifically, Scale AI helps companies with data annotation and labeling of “ground truth” data. This ground truth data refers to correctly labeling data in an expected format, such as tagging a picture of a cat as a “cat” or assisting in differentiating a dog from a cat in an image.

As of September 2024, Scale segments its products into three sections, Build AI, Apply AI, and Evaluate AI, each of which with its own set of capabilities. Under Build AI is the Scale Data Engine, which is further segmented into three use cases: generative AI, government, and automotive. Apply AI segments into two products, Scale Donovan and the Scale GenAI platform. Finally, under Evaluate AI there is Scale Evaluation, which also has three use cases: model development, public sector use cases, and enterprise use cases.

Source: Scale AI

Scale AI’s core products include Scale Data Engine, Scale Donovan, and Scale GenAI, while Scale Evaluation is embedded into the workflows of these three products.

Scale Data Engine

Scale AI’s core product is its data engine. It relies on a workforce of 240K people across Kenya, the Philippines, and Venezuela, managed through a subsidiary, RemoTasks to label data. Companies use the data engine to build and train ML algorithms. The data engine enables customers to collect, curate, and annotate data to train and evaluate models. Companies including Lyft, Toyota, Airbnb, and General Motors pay Scale AI to get high-quality annotated data labeled by human contractors, an ML algorithm, or a mixture of both.

Scale AI offers a comprehensive approach to data labeling by offering automated data labeling, human-only labeling, and human-in-the-loop (HITL) labeling, each with distinct advantages. Automated data labeling utilizes custom machine learning models to efficiently label large datasets with well-known objects, significantly accelerating the labeling process. However, it requires high-quality ground-truth datasets to ensure accuracy and struggles with edge cases.

Human-only labeling, on the other hand, relies on the nuanced understanding and adaptability of human annotators, providing superior quality in complex domains like vision and natural language processing, albeit at a higher cost and slower pace. HITL labeling synergizes both methods, where automated systems handle the bulk of the labeling and human experts review and refine the outputs. This hybrid approach ensures high accuracy and efficiency, leveraging the strengths of both automation and human expertise to produce superior data labels for machine learning applications.

Scale AI annotates many different types of data including 3D sensor fusion, image, video, text, audio, and maps. Although image, video, text, and audio products could be generalized across several industries, 3D sensor fusion and map labeling are typically relevant for autonomous driving, robotics, and augmented and virtual reality (AR/VR).

As part of the data engine offering, Scale AI offers three distinct data annotation solutions tailored to different needs: Scale Rapid, Scale Studio, and Scale Pro.

  1. Scale Rapid is an offering designed for machine learning teams to quickly develop production-quality training data. It allows users to upload data, set up labeling instructions, and get feedback and calibration on preliminary labels within a few hours, enabling a rapid scale-up of the data labeling process to larger volumes. As a self-serve platform with no minimums, users can upload their data, select or create annotation use cases, send tasks to the Scale workforce, and receive high-quality labeled data within hours, making it ideal for rapid project turnaround. Scale AI provides the necessary annotator workforce to ensure the data is labeled accurately and efficiently.

  2. Scale Studio focuses on maximizing the efficiency of users' own labeling teams. It allows users to upload data, choose or create annotation use cases, use their own workforce, and monitor project performance, mainly used for organizations wanting to manage labeling internally and boost productivity. It provides a tool that tracks and visualizes annotator metrics and also provides ML-assisted annotation tooling to speed up annotations. It tracks metrics such as throughput, efficiency, and accuracy.

  3. Scale Pro caters to AI-enabled businesses requiring scalable, high-quality data labeling for complex data formats. It features API integration, dedicated engagement managers for customized project setup, the ability to handle large volumes of production data, and guarantees the highest quality through service-level agreements (SLAs), providing a premium, fully-managed labeling experience.

The difference between Scale Studio/Pro and Scale Rapid is the approach to labeling the data. Scale Rapid requires that the data be annotated by Scale AI, while Scale Studio or Scale Pro requires the company to bring its own annotator workforce. However, each offering is under the umbrella of the Scale Data Engine.

Key Terms

Task: A task is an individual unit of work to be completed. Each task corresponds directly to the data that needs to be labeled, ensuring a one-to-one relationship. For instance, there is a separate task for every image, video, or text that requires labeling.

Project: Within a given project, similar tasks can be organized based on instructions and the use case. All tasks within the project will share the same guidelines and annotation rules. A project is linked to a specific annotation use case, corresponding to a task type. Multiple projects can exist for each use case. For instance, one project might be dedicated to categorizing scenes, while another focuses on annotating images. Each task is explicitly associated with a project to maintain organization.

Batches: On Scale Rapid, projects allow batches of data to be launched to the Scale workforce for labeling, with three types of batches available: self-label, calibration, and production batches. On Scale Studio, projects enable batches of data to be labeled by a customer’s in-house annotation team, with all batches being standard production batches that can be used for various purposes such as self-labeling, experimental batches, or large-scale production pipelines. On Scale Pro, batches can be used to further divide work within high-volume projects, associating tasks with specific internal datasets or marking tasks as part of a weekly submission.

On Scale Rapid, three types of batches can be launched for data labeling. A self-label batch allows users to test their taxonomy setup or labeling experience by creating a batch of data for themselves or team members to label. A calibration batch is a smaller set of tasks sent to the Scale workforce for labeling, providing labeler feedback and enabling quick iteration on taxonomy and instructions; it undergoes fewer quality controls and is primarily used to refine labeling processes. Finally, a production batch involves scaling to larger volumes after iterating through calibration batches and refining quality tasks; it includes rigorous onboarding, training, and periodic performance checks for labelers to ensure high-quality labeling.

Taxonomy: A taxonomy in data annotation is a structured collection of labels, known as annotations, and associated information defined at the project level. Annotations can include various types such as boxes, polygons, points, ellipses, cuboids, events, text responses, list selections, tree selections, dates, linear scales, and rankings. Within a taxonomy, there are classes of annotations (different types of annotation), global attributes (information about the entire task), annotation attributes (details linked to a specific annotation), and link attributes (relationships between two annotations).

For example, a project may involve drawing boxes around all cats and dogs in an image, indicating the total number of cats and dogs. Each cat's box annotation would include an attribute for "sleeping or not sleeping," and each dog's box annotation would include a link attribute to indicate which cat it is looking at. Additionally, a global attribute would ask the labeler to specify the total number of cats and dogs in the image.

Source: Scale AI

Workflows

When initiating a project on Scale, a user will begin by selecting a template that fits the project’s use case, or create their own.

Source: Scale AI

Following this, the user must upload their data. The platform supports various data formats, including images, videos, text, documents, and audio. Once the data is uploaded, the user can choose from a list of available use cases tailored to their data format. Each use case provides a set of labels for building the project's taxonomy and a set of pipelines, which are sequences of stages that a task will go through before being delivered back to the user. Each project is assigned a single pipeline, and all tasks within the project will follow this same pipeline.

  1. Upload Data: The user starts by uploading their data in any of the supported formats.

  2. Select Use Case: The user chooses a use case relevant to their data format, which will define the available labels and pipelines.

  3. Build Taxonomy: The user utilizes the labels provided by the use case to create a comprehensive taxonomy for the project.

  4. Choose Pipeline: The user selects the pipeline that the project will use, ensuring that all tasks will follow the same sequence of stages.

Use Cases

Scale supports various data formats and use cases, including text (content classification, text generation, transcription, named entity recognition, content collection), images (object detection, semantic segmentation, entity extraction), video (object and event detection), PDFs/documents (entity extraction), and audio (same as text use cases except named entity recognition).

Object Detection: Comprehensive annotation for 2D images supports various geometric shapes, such as boxes, polygons, lines, points, cuboids, and ellipses. This task type is ideal for annotating images with vector geometric shapes.

Source: Scale AI

Source: Scale AI

Once a user selects the use cases, they get prompted to create a taxonomy via a set of labels and attributes. These labels and attributes can be added via a visual label maker or a JSON editor for the API.

Source: Scale AI

Instruction Writing

These instructions guide the labelers on how to handle each task accurately. After creating the taxonomy an auto-generated instructions outline is created. In each case, whether the project is net-new or based on a template, it includes the following sections:

  • Summary: Introduce the task, providing useful context like scenery, number of frames, objects to look for, and any unusual aspects.

  • Workflow: Provide a step-by-step guide on task completion, noting initial observations, deductive reasoning, and annotation impacts.

  • Rules: Describe annotation rules applicable to multiple labels or attributes, including well-labeled and poorly-labeled examples.

  • Label/Attribute/Field-Specific Sections: Detail unique rules with examples for each label/attribute/field.

  • Adding Examples: Separate well-labeled and poorly-labeled examples, highlighting differences.

Source: Scale AI

Calibration & Self-Label Batches

After setting up the project, taxonomy, and instructions, the user can proceed with one of the following:

  • Launch a Calibration Batch: This batch is reviewed by Scale labelers who provide feedback on the instructions and deliver the first set of task responses. This step helps identify discrepancies between the instructions and the task responses, allowing the user to refine their instructions.

  • Launch a Self-Label Batch: This option allows the user to test their own taxonomy setup and experience labeling on the Rapid platform firsthand.

Auditing & Improving

Once feedback from the calibration batch is received, the user analyzes the discrepancies and improves their instructions. During the auditing process, the user can use the labeled tasks to create examples embedded in the instructions and quality tasks to enhance the overall project setup.

Source: Scale AI

Project

Once a project has been created, there are different views that a user can interact with. In the main view, a user is able to see the definition of the project, and what each of the tasks, labels, and instructions are for the project.

Source: Scale AI

Under the batches view, a user is able to see the in-progress batches of data that are to be annotated, or completed. The labelers refer to those who have been assigned the task, and it can be to a group, or an individual user, which is dependent on the admin’s configuration. The percentage completed refers to the number of tasks that are done, this can be expanded to individual tasks to be addressed.

Source: Scale AI

The quality lab is a way to evaluate the taskers for their accuracy and view other statistics of the tasks they are handling. There are a few ways to audit these tasks: individually, or within a date range. An admin can also set up training tasks which can be used to help onboard to the project prior to starting labeling.

Source: Scale AI

There are advanced tools that can be used to support high-quality completed tasks. Some of these include:

Nucleus

In August 2020, Scale AI launched Nucleus, a “data debugging SaaS product.” Nucleus provides advanced tooling for understanding, visualizing, curating, and collaborating on a company’s data, allowing teams to build better ML models. Specifically, Nucleus allows for data exploration, debugging of bad labels, comparing accuracy metrics of different versions of ML models, and finding failure cases. This product offering also falls under the Scale Data Engine offering.

Source: Not Boring

Scale Nucleus is a dataset management platform that helps machine learning teams build better datasets by visualizing data, curating interesting slices within datasets, reviewing and managing annotations, and measuring and debugging model performance. By integrating data, labels, and model predictions, Nucleus enables teams to debug models and improve the overall quality of their datasets.

Source: Scale AI

The Nucleus dataset dashboard contains all the datasets associated with a user, whether that be public or private. As a user clicks into a dataset, there’s a dashboard view that provides an overview of the contents of the dataset and statistics associated with the dataset. These datasets can be created by importing images/tasks that a user already has in Scale, via the command line interface, or via the platform’s UI.

Once the data has been uploaded it creates a preview. These datasets can be classified as one of three types: image dataset, video dataset, or lidar dataset. Each of these datasets has specific properties that may or may not be included while creating the dataset. Once the dataset's contents are uploaded, Nucleus supports uploading metadata which associates each dataset item with a scene, ground truth annotation model prediction, and segmentation mask.

Source: Scale AI

Under the explore tab, a user can see all the images, and pieces that belong to the dataset. In this view, a user will also see the type of annotation that is being applied to the dataset items: a geometric annotation, segment annotation, or category annotation.

When each of these pieces of the dataset is clicked on, more data on the dataset item is provided on the right-hand side including the item properties, metadata, display of the annotations, or effects applied to the item. Whether there are any associated tasks associated with the specific item, and any charts, or statistics that the item might be a part of.

Source: Scale AI

The Nucleus query bar filters dataset items or annotations based on the given conditions, which can also be combined with AND or OR statements. This type of search is a structured query search, while a natural language search is akin to what users would type in a Google search. Natural language search enables users to locate images using simple English text queries like "a pedestrian wearing a suit" or "a chaotic intersection."

This functionality applies to all datasets, regardless of their image content, though the quality of the results may vary. Unlike structured querying that depends on annotations, autotags, or predictions, natural language search allows users to search their datasets using natural English. Users can search by similar items in this view, by selecting an image and clicking “Find Similar”.

Source: Scale AI

Autotag is a machine learning feature that adds refined visual similarity scores as metadata to images or objects within a dataset. It starts with an initial set of example items to seed the search.

Source: Scale AI

Through multiple refinement steps, the search is fine-tuned by identifying relevant items from the returned results. Once the search is finalized, Autotag commits the metadata, assigning search scores (the higher, the more relevant) to the top twenty thousand most relevant items in the dataset. These items can then be queried within the Nucleus grid dashboard.

Source: Scale AI

Models

In Nucleus, models represent actual models in the inference pipeline. There are two types of models:

  1. Shell Models: These are empty models without associated artifacts. They are used to upload model predictions into Nucleus, suitable for when inference is done externally and only the results are needed in Nucleus.

  2. Hosted Models: These are real models with associated artifacts (e.g., a trained TensorFlow model). They are used to host models and run inference on a chosen dataset in Nucleus, with predictions automatically linked to the hosted model.

Source: Scale AI

Under the Models View, a user will see additional information and many graphs. Two of the common ones are a precision-recall curve and a confusion matrix.

Source: Scale AI

A precision-recall curve offers a detailed view of a model's performance, showing the tradeoff between precision and recall.

  • High Recall, Low Precision: The model returns many results, but many are incorrect.

  • High Precision, Low Recall: The model returns few results, but most are correct.

  • High Precision and High Recall: The model returns many results, all correctly labeled.

The balance between precision and recall depends on the application's needs. For example, in cancer detection, high recall is crucial to avoid missing cancerous tumors. In contrast, SPAM filters prioritize high precision to avoid misclassifying important emails as SPAM.

Intersection over Union (IoU) is an evaluation metric used in computer vision to confirm label quality. It measures the ratio of the area of overlap between the predicted label and the ground truth label to the area of their union. A ratio closer to 1 indicates a better-trained model.

Source: Scale AI

Confusion matrices are a powerful tool for understanding class confusion in models. They compare predicted and actual classifications, helping to identify misclassifications, such as predicting a traffic sign when the ground truth is a train.

By combining confusion matrices with confidence scores, one can prioritize addressing misclassifications where the model is highly confident but incorrect. These confusions may be due to incorrect or missing labels or insufficient data for certain classes, like traffic signs and trains.

For either of the two graphs, clicking on a piece of data will navigate a user to the filtered view of the image used to create that data point.

In the scenario tests view, a user is presented with the ability to regression test the models that they are using. Regression testing ensures that the quality of the model does not degrade over time, or with the addition of more dataset items. A Validate ScenarioTest is used to monitor model performance in key scenarios. It operates on a subset of data (Slice) and includes various evaluation metrics. Users can compare the model's performance against baseline models or evaluate if it meets specific thresholds, such as checking if the IoU of a model is greater than 0.8 on the selected data slice.

Source: Scale AI

The Item-Level Performance indicates different performances of each evaluation function, which can be specified by the user. The list under the graph indicates which model ran, which data was used, and the overall metrics of the run.

Jobs

Jobs are specific tasks that are running in the pipeline, and their progress is displayed in this view.

Source: Scale AI

Generative AI Platform

The Scale GenAI Platform is a solution designed to enhance the development, deployment, and optimization of Generative AI applications. It leverages advanced Retrieval Augmented Generation (RAG) pipelines to transform proprietary data into high-quality training datasets and embeddings, supporting the fine-tuning of large language models (LLMs) with both proprietary and expert data to improve performance and reduce latency.

Source: Scale AI

The platform includes a range of tools for evaluating and monitoring AI models, ensuring their reliability and accuracy. This includes the ability to compare base models, perform automated and human-in-the-loop benchmarking, and manage test cases with detailed evaluation metrics. These features help in pinpointing model weaknesses and improving accuracy, which is important for applications in various domains, including defense and intelligence.

Source: Scale AI

In terms of deployment, Scale AI's infrastructure allows for the management and monitoring of custom models, with enterprise-grade safety and security built-in. Users can create new deployments, adjust settings, and monitor token consumption and API usage through convenient dashboards. The platform supports both commercial and open-source models, maintaining flexibility and avoiding vendor lock-in.

Source: Scale AI

Security and privacy also apply to the Scale GenAI Platform. It ensures data remains private and secure within virtual private clouds (VPCs) and supports rigorous testing to maintain the integrity of AI applications. By leveraging human-in-the-loop testing and extensive evaluation tools, the platform ensures that AI models are safe, responsible, and effective for various use cases, from employee productivity to customer support and data analysis.

Source: Scale AI

Scale Spellbook is Scale AI’s product intended for developers to build, compare, and deploy large language model apps. Spellbook was announced in November 2022. Its features include scaling CPU and GPU computing, managing model deployments and A/B testing, and monitoring real-time metrics such as uptime, latency, and performance. Spellbook also includes structured testing for ML models through regression tests and model comparisons.

A large language model (LLM) app consolidates prompts, models, and parameters for a specific use case, and users should create separate apps for each use. Examples include apps for converting text to SQL, generating marketing copy, summarizing tweets, and classifying products.

Source: Scale AI

Prompt templates help users generate text by using structured prompts. These templates include pre-defined prompts with placeholders for specific information. Users can select a template and input the necessary details, enabling the model to create text based on that information.

Source: Scale AI

Different versions of the created app can be viewed on the App Variants page. They can also perform actions such as forking, which copies the variant's settings into a new variant; deploying, which creates a deployment from a variant; and viewing or running evaluations to assess the variant's performance through programmatic, human, or AI evaluation.

Source: Scale AI

While crafting a good prompt and providing context examples can enhance LLM performance, it may not always be sufficient or cost-effective. Fine-tuning the model on a larger set of task-specific examples offers some advantages: it eliminates the need for specific prompts, reduces token costs, improves inference latency by not requiring input examples, and allows the model to learn more deeply from numerous examples. As of September 2024, Spellbook supports fine-tuning on OpenAI models and FLAN-T5.

Source: Scale AI

For fine-tuning a model, two key parameters need to be set: epochs and the learning rate modifier. Epochs determine how many times the model trains on the entire dataset. The learning rate modifier adjusts the recommended learning rate multiplicatively. A smaller modifier (<1) results in slower training but, when combined with more epochs, can enhance training quality compared to using a larger modifier.

Source: Scale AI

Evaluations in Spellbook provide quantitative metrics to determine the best variant for specific use cases. Options include Human Evaluations and Programmatic Evaluations. For generative applications, users can employ Scale's human evaluation integrations. They can select a variant and a dataset with at least 20 rows, choosing between Scale's global workforce or their internal workforce. Users must define the task and evaluation criteria for the human evaluators.

Users may opt for programmatic evaluations for other use cases:

  • Classification: Compares ground truth with model outputs, generating F1 scores and accuracy metrics. The F1 score evaluates a model's predictive ability by examining performance in each class individually rather than overall accuracy. The F1 score combines two metrics: precision and recall, providing a balance between the two to give a more comprehensive assessment of the model's effectiveness.

  • Mauve: Measures distribution similarities for longer generations, requiring a dataset of at least 50 rows.

Scale Donovan

Source: Scale AI

Scale Donovan is an AI suite for the federal government. Donovan ingests data from cloud, hybrid, and on-prem sources, organizes data to make it interactive, and enables operators and analysts to ask questions to sensor feeds and map/model data. Further, Donovan produces a course of action, summary report, and other actionable insights to help operators achieve mission objectives.

Information Retrieval

Source: Scale AI

Donovan integrates Retrieval-Augmented Generation (RAG), allowing users to use Large Language Models (LLMs) to interact with mission-related information. Donovan has a chat interface that can extract information from documents and translate them using natural language semantics. Because of the ability to use any model that a user may want, the options for which model works best for a use case is easy to find.

Source: Scale AI

Geospatial Chat

Donovan features geospatial chat, combining geographic filtering with LLM capabilities. Users can interact with a map, select specific areas, and pose location-based questions.

Source: Scale AI

Donovan provides responses relevant to the chosen area and pins locations on the map with enriched metadata from citations, offering detailed and contextual information.

Source: Scale AI

Text-to-API

Donovan's text-to-API feature allows for natural language queries to be translated into API requests, facilitating integration with other applications or systems. This enables Donovan to fetch and relay information from connected systems in natural language, streamlining interactions with complex databases and enhancing productivity and decision-making through partnerships with Flashpoint, Strider, and 4DV.

Report Generation

Users can also generate reports using LLMs, incorporating new data as it becomes available. This feature significantly reduces manual effort, allowing users to focus on strategic tasks and cutting workflow times from hours to minutes, enabling more time for cognitive tasks requiring human context and creativity.

Source: Scale AI

Donovan supports customized report generation through templates designed for common information requests or specific operational documents. Once the report’s general information is added, models that can be chosen will operate and act upon the given information to generate a template report or project that the users can then refine as per their needs.

Source: Scale AI

Scale AI frequently experiments with new products to determine market viability, leading to the partial launch or discontinuation of offerings like Synthetic, Document AI, Ecommerce AI, and Chat. This iterative approach allows Scale AI to identify and focus on products that gain traction and meet customer needs effectively.

Scale Evaluation

The state of AI evaluations limits progress due to several key challenges. One of the main issues is the shortage of high-quality, trustworthy datasets that haven't been overfitted, resulting in less reliable evaluation outcomes. Additionally, the lack of effective tools to help analyze and iterate on these results further hampers the ability to improve models. This comes alongside inconsistencies in model comparisons and unreliable reporting making it difficult to draw meaningful insights. Together, these challenges are creating a bottleneck that restricts advancements in AI development.

According to the company, Scale Evaluation is built to help model developers gain insights into their models, offering detailed analyses of LLM performance and safety across various metrics to support continuous improvement.

Scale Evaluation identifies several key risks associated with LLMs, including the spread of misinformation, where false or misleading information is produced, and unqualified advice on sensitive topics like medical or legal issues that can cause harm. It also highlights concerns about bias, where harmful stereotypes are reinforced, and privacy risks involving the disclosure of personal data. Additionally, Scale Evaluation looks to combat how LLMs can be exploited for cyberattacks and may aid in the acquisition or creation of dangerous substances like bioweapons.

Scale AI has established a reputable evaluation platform by implementing several strategic measures, demonstrated by being selected by the White House to conduct public assessments of AI models from leading developers. Scale AI’s research division SEAL (Safety, Evaluations, and Analysis Lab), supports model-assisted research, enhancing the platform's evaluation capabilities. The company has trained thousands of red team members in advanced tactics and in-house prompt engineering, facilitating high-quality vulnerability testing.

Source: Scale AI

In May 2024, Scale AI introduced the SEAL Leaderboards, a ranking system for LLMs developed by its Safety, Evaluations, and Alignment Lab (SEAL). These leaderboards provide an unbiased comparison of frontier LLMs using curated, private datasets that cannot be exploited, ensuring that rankings accurately reflect model performance and safety. By covering domains such as coding, instruction following, math, and multilinguality, the SEAL Leaderboards offer a comprehensive evaluation process led by vetted domain experts.

Unlike most public benchmarks, Scale's leaderboards maintain integrity through private datasets and limited access to prevent gaming or overfitting. The platform also emphasizes transparency in its evaluation methods, providing insights beyond just rankings. Scale’s goal is to drive better evaluations and promote responsible AI development by continually updating leaderboards, expanding coverage to new domains, and working with trusted third-party organizations to ensure the quality and rigor of assessments.

Market

Customer

Source: Scale AI

As of September 2024, Scale AI defines its customer base as falling into three segments: generative AI, US government, and enterprise.

Scale AI’s generative AI customers include OpenAI, Nvidia, Cohere, and Adept.

In the enterprise segment, the customer base is split by general industries. For example, in the automotive space Scale supports a number of companies including General Motors’s Cruise, Zoox, Nuro, and other autonomous driving companies that require sizable volumes of labeled camera data. Scale AI’s customers include not only autonomous driving companies, but robotics companies as well, including Kodiak Trucks, Embark, Skydio, and Toyota Research Institute.

Under the government segment, Scale AI serves the federal government and defense contractors. Key customers include the US Army, the US Air Force, and the Defense Innovation Unit.

Market Size

The rise of AI can be attributed to several key factors, including increased computing power in AI chips, a growing volume of training data, improved technological bottlenecks (such as vanishing gradients, which led to the discovery of transformers), and a decrease in cloud storage and compute costs. With its data labeling and annotation products, Scale AI started off by targeting data annotation. The data collection and labeling market is estimated to reach $17.1 billion by 2030 and is projected to grow at a CAGR of 28.9% from 2023 to 2030. Since its early days, Scale AI has evolved into a fuller AI infrastructure service provider, ultimately helping companies build production models over time, representing a $27 billion market, growing at over a 20% CAGR.

Source: Generational

Competition

Within the data collection and labeling market, Scale AI faces competition from players such as Amazon Mechanical Turk, Labelbox, Appen, and Hive. These competitors also utilize humans to label data for companies that don't have the resources to do it themselves. Given the commoditized nature of the data labeling industry, companies find it challenging to establish unique competitive advantages beyond operational efficiency.

Scale AI’s competitive advantage in the long term comes from improving its in-house ML labeling algorithms to make the entire human labeling more automated and cheaper, so it can derive economies of scale. As Scale AI expands its operations into different domains, the diversity of the dataset plays a crucial role in training the ML models, and gives Scale AI a significant edge in terms of data quality and variety. In a 2021 overview of Scale AI’s business, Packy McCormick explained:

“Scale would agree that a human-heavy approach isn’t the right one in the long-term, but it’s crucial to the data flywheel. As Scale’s human teams label data, they’re also training Scale’s labeling models. Over time, the ratio of human-to-machine has decreased; more work is being done by the algorithms. The move to more algorithmic tagging is actually a boon for Scale, which has trained its models on more human labels than nearly anyone in the world. It’s much worse for competitors like Appen, which are more akin to Upwork-for-labelers than an AI company.”

With the introduction of new products, Scale AI has evolved from a data collection and labeling company, attempting to become a comprehensive ML infrastructure company. The more traditional players it faces in this category are much less commoditized. This space primarily follows two archetypes: ML companies and enterprise cloud platforms.

  • ML Companies: Companies like Databricks build ML products on top of a key differentiated wedge. For Databricks in particular, this wedge is its data lakehouse, which stores the data that their AI workflows and model training systems consume. Other companies like this include C3, H2O, and Dataiku.

  • Enterprise Cloud Platforms: Companies like AWS have an ML ecosystem as part of their product line, including everything from Mechanical Turk to label data, S3 and Redshift to store that data, and Sagemaker to train ML models on top of that data. Microsoft and Google are building similar platforms on Azure and GCP.

Among these, Scale AI falls in the first category, attempting to build ML tools on top of its wedge of data labeling. However, since Scale AI doesn’t provide its own storage, it relies on external storage solutions like AWS’s S3, which can make Scale AI's subsequent ML products more expensive compared to AWS’s integrated offerings. If a company wanted to use Scale AI for labeling and leverage Sagemaker instead, there isn’t much Scale AI can do to prevent it without offering competitive features.

Data Collection & Labeling Market

Labelbox: Founded in 2018, Labelbox is a training data platform for machine learning applications. As of September 2024, the company had raised a total of $188 million in funding from investors such as Andreessen Horowitz and Snowpoint Ventures. In January 2022, Labelbox raised a $110 million Series D led by Softbank at an undisclosed valuation. Like Scale AI, Labelbox offers a platform for training data for AI models but differs in its more exclusive focus on machine learning applications.

Hive: Founded in 2013, Hive offers cloud-based AI solutions for understanding content, similar to Scale AI. As of September 2024, Hive has raised a total of $121 million from investors including General Catalyst and 8VC. In April 2021, the company raised a $50 million Series D at a $2 billion valuation. While Scale AI has an emphasis on government and enterprise cloud services as its customer base, Hive promotes prebuilt models for marketplaces, dating apps, and other B2C and peer-to-peer oriented companies. As a result, Hive focuses more on real-time content tagging for moderating user-generated content. Scale AI’s government and enterprise focus likely makes its product more useful for companies developing complex cloud services.

Appen: Founded in 2011, Appen collects and labels content to build and improve AI models. In January 2017, Appen was listed on the Australian Securities Exchange. As of September 2024, Appen was trading at a market capitalization of $170 million. Like Scale AI, Appen focuses on enterprise AI solutions including extracting information from paperwork, object detection for autonomous vehicles, and other various data types. Appen highlights its partnerships with AWS, Nvidia, and Salesforce. Both Scale AI and Appen are apparently able to land enterprise and long-term contracts, but the use of both companies highlights a lack of product differentiation and limited moat.

V7 Darwin: Founded in 2018, V7 Darwin helps collect and label image and video content to improve computer vision models. As of September 2024, V7 Darwin had raised $43 million, with a $33 million Series A raise in November 2022. Scale AI focuses on large-scale, enterprise-level data labeling with a mix of human and automated efforts for high accuracy, while V7 Darwin provides an integrated platform for computer vision projects, ideal for smaller teams and individual data scientists.

ML SaaS Market

Databricks: Founded in 2013, Databricks helps companies build ML products and has a custom data storage solution that its AI workflows and model training systems consume. In February 2024, Databricks raised a funding round, following a $500 million Series I at a $43 billion valuation in September 2023. As of September 2024, the company had raised a total of $4 billion in funding across 12 rounds. In comparison to Scale AI, Databricks' unique selling point is its data lakehouse infrastructure, which serves as the foundation for all its ML products, whereas Scale AI has a broader ML focus.

Humanloop: Founded in 2020, Humanloop helps companies fine-tune LLM models in a simplified way through prompt and response ratings. In July 2022, the company raised a $2.6 million seed round led by Index Ventures, bringing its total funding to $2.7 million as of September 2024. Unlike Scale AI which focuses on a wide range of data labeling services through an engineer-first API and platform, Humanloop focuses on natural language processing (NLP) models with no-code first and API second training solutions, indicating a more narrow and beginner-oriented focus in the AI space. Scale AI’s platform is more robust, offering data labeling for videos and documents, helping the company solve beyond generative text.

AWS Machine Learning Suite: AWS ML Suite is a suite of machine learning tools provided by AWS. It competes with Scale AI in providing more than 27 machine learning services, but it is part of a larger suite of cloud services provided by Amazon. Amazon introduced its ML initiatives in 2015. Based on Scale AI’s part partnerships, however, Scale AI can be used alongside AWS or even integrate with it.

C3 AI: Founded in 2009, C3 AI helps companies build custom enterprise AI applications. Its flagship product offers the development, deployment, and operation of AI applications, driving efficiency and cost-effectiveness with a focus on enterprise data management. Scale AI provides customizable solutions tailored to specific needs. Its platform can be adapted to various industries and specific use cases, making it versatile for different AI projects.

Business Model

Scale AI does not publicly disclose its pricing model. It has two pricing tiers: one for enterprise clients, and one for individuals.

Enterprise

Source: Scale AI

Scale AI provides data annotation for the enterprise on a custom pricing basis. Companies pay Scale AI to label data, and the price ranges depending on the volume and the data type (image, video, text, 3D LiDAR, etc.). Scale AI labels the data using a labor source of more than 100K contractors. The company also builds in-house algorithms to ensure the quality of the data. Scale AI also automates the labeling process using its own ML algorithms.

Self-Serve Data Engine

Source: Scale AI

For Scale AI’s self-serve data engine, a client can manage and annotate data for ML projects in one place, but use its own workforce. Scale AI prices this product on a pay-as-you-go basis by credit card. For annotations, the first 1K labeling units are free, while for data management the first 10K images are at no cost.

Traction

One unverified estimate indicated that Scale’s annualized run rate grew from $290 million in 2022 to $760 million in 2023, up 162% YoY. This places it well within the pack of large AI startups. OpenAI brought in $1.3 billion in ARR, while Anthropic brought $200 million in ARR. However, in January 2023, Scale AI laid off 20% of its workforce following excessive hiring in 2021 and 2022. As of September 2024, the company has 900 employees, completed 13 billion annotations, and labeled 87 million generative AI data. The average gross margin of a software company is 75%, however, Scale’s gross margins are closer to 50-60% due to the heavy service component of the data labeling.

Source: Sacra

Scale AI has gone beyond the autonomous vehicle labeling market to pick up large government contracts to label geospatial data. In addition, Scale AI has managed to garner enterprise contracts with companies like Brex and OpenAI for natural language processing and in July 2024 partnered with AWS in a multi-year strategic partnership to increase generative AI adoption via the AWS marketplace. The company has ramped up its release of products in recent years, growing what was previously an exclusively annotation-based product line into something that includes model training, collection, and debugging.

Valuation

In May 2024, Scale reported raising a $1 billion Series F round at a $13.8 billion valuation. This fundraising round brought Scale’s total amount of funding to $1.6 billion over eight rounds as of September 2024. The financing is a mix of primary and secondary, led by existing investor Accel with nearly all existing investors including YC, Nat Friedman, Index Ventures, Founders Fund, Coatue, Thrive Capital, Spark Capital, NVIDIA, Tiger Global Management, Greenoaks, and Wellington Management. This round also included new investors Cisco Investments, DFJ Growth, Intel Capital, ServiceNow Ventures, AMD Ventures, WCM, Amazon, Elad Gil, Meta, and Qualcomm Ventures. This follows a $325 million Series E round at a $7.3 billion valuation co-led by Dragoneer, Greenoaks Capital, and Tiger Global in April 2021.

Key Opportunities

Data Labeling for Specific Industries

Scale AI has focused on developing data labeling and annotation services for specific industries including autonomous driving. Acquiring new customers and expanding to new industries is a key opportunity. Scale AI has already proven itself by labeling a variety of data types; in 2018, Scale AI focused on autonomous driving companies such as GM, Cruise, Lyft, Zoox, and nuTonomy.

In 2024, its customers include government agencies like the Department of Defense, marketplaces like Airbnb, fintech companies like Brex, and AI developer OpenAI. Each has very different data labeling needs, but Scale AI has proven it can win contracts and deliver quality service to each of them.

Partnerships

Creating strategic partnerships with large organizations can significantly drive Scale AI's growth by providing access to extensive and diverse datasets, enhancing credibility, expanding market reach, and enabling joint innovation. For example, Scale AI's partnership with Toyota Research Institute has allowed it to access vast amounts of autonomous driving data, improving the accuracy and performance of its data labeling services.

Additionally, collaborations with companies like OpenAI have bolstered its reputation and technological capabilities, enabling it to innovate and develop new AI solutions tailored to industry needs. These partnerships not only enhance Scale AI's operational efficiency but also facilitate global expansion by leveraging the established international presence of these large organizations.

Geographical Expansion

Scale AI generates most of its revenue from the US but has significant opportunities to expand into other regions. The European AI software market is projected to grow to $191 billion by 2026. In China, while much of the AI market growth has been driven by consumer internet giants like Alibaba and ByteDance, traditional sectors are expected to take the lead in the future. By 2030, AI is anticipated to contribute $600 billion annually to the Chinese economy, with automotive, transportation, and logistics—areas where Scale specializes—expected to account for 64% of that growth.

Weekly Newsletter

Subscribe to the Research Rundown

Key Risks

Regulatory Exposure

One key risk to Scale AI is legislation in the EU, such as the General Data Protection Regulation (GDPR) and the AI Act, requiring data collected on its citizens to be stored in the EU and limiting certain types of AI applications. This legislation means Scale AI may not use data collected in the EU in other geographic areas, requiring it to build additional services to ensure compliance. Additionally, this may lead to fewer AI applications in the EU where Scale AI has customers operate.

Competition

Scale AI is expanding to different parts of the ML stack beyond data labeling, including ML model debugging and evaluation. However, there are many more competitors in each of these spaces in ML infrastructure, including Databricks, Labelbox Model, and Snorkel Flow. Scale AI’s core differentiator is its lower cost of human-in-the-loop data labeling at scale. The competition may have led to major customers, including Samsung, Nvidia, and AirBnB leaving Scale AI in January 2023. However, Scale AI may not have the same product moat with expanding to different parts of the ML infrastructure facing stiff competition.

Risk Margin

Competition is intensifying in Scale's core data labeling market. As companies encounter financial challenges, the competitive focus may shift from features and efficiency to price, potentially reducing any margin expansion Scale gains from increased use of pre-labeling software. Scale's new products are entering a market dominated by established players and may not offer immediate margin benefits, as the company might need to price them competitively to attract new users.

Summary

Scale AI has established itself in the AI space with a focus on helping fuel the best models a company can build with extensive data solutions. The company has expanded into multiple new spaces and the products have evolved to fit as well. Despite making progress and inroads with some sizable companies, the bulk of its business is still based around data labeling. The competition in the ML infrastructure space is fierce and it puts Scale AI in the crosshairs of AWS, GCP, and Microsoft, which benefit from the economies of scale that come with owning their own data systems. Scale AI’s future success will be determined by its ability to execute in newer aspects of the machine learning lifecycle.

Disclosure: Nothing presented within this article is intended to constitute legal, business, investment or tax advice, and under no circumstances should any information provided herein be used or considered as an offer to sell or a solicitation of an offer to buy an interest in any investment fund managed by Contrary LLC (“Contrary”) nor does such information constitute an offer to provide investment advisory services. Information provided reflects Contrary’s views as of a time, whereby such views are subject to change at any point and Contrary shall not be obligated to provide notice of any change. Companies mentioned in this article may be a representative sample of portfolio companies in which Contrary has invested in which the author believes such companies fit the objective criteria stated in commentary, which do not reflect all investments made by Contrary. No assumptions should be made that investments listed above were or will be profitable. Due to various risks and uncertainties, actual events, results or the actual experience may differ materially from those reflected or contemplated in these statements. Nothing contained in this article may be relied upon as a guarantee or assurance as to the future success of any particular company. Past performance is not indicative of future results. A list of investments made by Contrary (excluding investments for which the issuer has not provided permission for Contrary to disclose publicly, Fund of Fund investments and investments in which total invested capital is no more than $50,000) is available at www.contrary.com/investments.

Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by Contrary. While taken from sources believed to be reliable, Contrary has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. Charts and graphs provided within are for informational purposes solely and should not be relied upon when making any investment decision. Please see www.contrary.com/legal for additional important information.

Authors

Vardan Sawhney

Senior Fellow

See articles

Sachin Maini

Editor

See articles

© 2024 Contrary Research · All rights reserved

Privacy Policy

By navigating this website you agree to our privacy policy.