BBoxML Blog

Multimodal AI Models: Reshaping the Data Annotation Landscape for ML Teams

Thu, 02 Apr 2026 00:00:00 GMT

The machine learning landscape is in constant flux, but few developments have been as transformative as the recent proliferation of highly capable multimodal AI models. These models, designed to process and generate information across various data types – text, images, audio, and video – are not merely incremental upgrades; they represent a significant paradigm shift that demands a re-evaluation of established data annotation practices.

The Omnipresent Rise of Multimodal Foundation Models

Recent months have seen key players unveil models with increasingly sophisticated multimodal capabilities. OpenAI, for instance, introduced GPT-4o in May 2024, a model capable of accepting prompt datasets as a mixture of text, audio, image, and video input, and responding with outputs in any combination of these modalities. Similarly, Google's Gemini 1.5 Pro, publicly released with a 1-million token context window in February 2024 and further enhanced through the year, demonstrated impressive abilities to process lengthy video transcripts, codebases, and large documents alongside images and text.

These models underscore a crucial trend: the future of AI often lies in its ability to understand and reason across disparate data types simultaneously, much like humans do. For machine learning teams, this isn't just an interesting research development; it's a direct challenge to established data annotation workflows.

The Annotation Imperative: Beyond Single-Modality Silos

Historically, data annotation has been largely siloed by modality. Image teams labelled images, natural language processing (NLP) teams annotated text, and audio teams processed speech. Multimodal AI shatters these silos, demanding datasets where relationships between modalities are explicitly captured and labelled. Consider these practical implications:

Cross-Modal Referencing: Instead of just labelling a bounding box around a car, you might need to link that car to a specific sentence in a narrative describing its make and model, or an audio clip of its engine sound. This requires annotating relationships, not just entities within a single modality.
Contextual Understanding: A single image of a person might be ambiguous. However, paired with text describing their activity or an audio clip of their speech, the context becomes clear, enabling more precise and rich annotations that capture the full scene.
Complex Instruction Following: Models are now being trained to follow instructions that combine visual and textual cues, e.g., "Identify the red object to the left of the blue one and describe its texture.\

YOLO annotation format explained: YOLO vs COCO vs Pascal VOC for beginners

Thu, 19 Mar 2026 00:00:00 GMT

If you are starting your first object detection project, one of the first confusing questions is usually this: what is the difference between the YOLO annotation format, COCO JSON, and Pascal VOC XML?

That confusion is normal. People often say "export it in YOLO" as if there is one single YOLO format, but then you also hear about YOLOv5, YOLOv8, YOLOv11, YOLOv12, COCO, Pascal VOC, and Google Colab training workflows. For a beginner, that sounds more complicated than it needs to be.

The practical answer is simple: these are mostly different object detection annotation formats and dataset packaging styles, not different definitions of what an object is. Your job is to pick the format that matches the training or tooling workflow you plan to use next.

Short answer

If you want the fastest answer before we unpack the details:

choose YOLO if your next step is a YOLO-style workflow or the BBoxML Google Colab notebook
choose COCO if another tool explicitly asks for COCO JSON
choose Pascal VOC if you already know you need an XML-based or legacy workflow

That simple rule is good enough for most first-time builders.

Format questions are easier once you can see the workflow clearly. BBoxML supports YOLO and COCO export, so you can start with a small labelled project first, then choose the format that matches your next training step.

What the YOLO annotation format actually is

For bounding boxes, the YOLO annotation format is usually:

one image file
one matching .txt label file for that image
one line per object
each line storing the class id plus the bounding box values

A typical YOLO label line looks like this:

0 0.512500 0.431250 0.245000 0.310000

That usually means:

0 = the class id
0.512500 = box centre x
0.431250 = box centre y
0.245000 = box width
0.310000 = box height

Those four box values are typically normalized, which means they are stored relative to image width and height rather than in raw pixel coordinates.

That is why YOLO text files feel lightweight. You do not get a big JSON document or an XML file per image. You get a compact text representation that many object detection workflows already know how to read.

Why people talk about multiple "YOLO formats"

This is the part that trips beginners up.

When people say "YOLO format", they are often mixing together two different ideas:

the dataset layout
the model family or training stack

In practice, many YOLO exports look very similar even when they are named after different model generations.

In BBoxML, the YOLO export options are YOLOv5, YOLOv8, YOLOv11, and YOLOv12, but they all use the same core export shape:

data.yaml
images/train, images/val, images/test
labels/train, labels/val, labels/test
one .txt label file per image

So when beginners ask, "What is the difference between all those YOLO ones?", the useful answer is often: less than you think at the annotation-file level. The bigger difference is usually which training workflow, notebook, or checkpoint family expects that export label.

YOLO vs COCO vs Pascal VOC at a glance

Format	How annotations are stored	Good fit for	Common friction
YOLO	One `.txt` file per image, plus `data.yaml`	Simple training workflows, especially YOLO-style pipelines	Easy to break if class order changes or image/label filenames stop matching
COCO	Structured JSON annotation files plus image folders	Tooling that wants a richer explicit schema	Harder to inspect by eye because everything sits inside JSON
Pascal VOC	One XML file per image	Older or XML-based workflows	More verbose, with more files to manage

What COCO format means

COCO stores annotations in JSON rather than per-image text files.

In BBoxML, a COCO Detection export is organized with image folders plus split annotation files such as:

images/train
images/valid
images/test
annotations/train.json
annotations/valid.json
annotations/test.json

COCO is often a good fit when you want:

a more explicit schema
easier interoperability with tools that expect JSON manifests
one place to inspect categories, images, and annotations together

For many beginners, COCO feels more readable once they understand JSON, but less convenient if they only want to open one label file and check one image quickly.

What Pascal VOC format means

Pascal VOC stores each image annotation in its own XML file.

A Pascal VOC export typically includes:

JPEGImages/
Annotations/
ImageSets/Main/

Each XML file contains the image metadata and the bounding box coordinates for that image.

Pascal VOC is still useful when a downstream tool or older workflow expects it, but for a new solo project it is usually the least convenient format to edit or inspect manually.

Which format should you pick?

If you want the shortest practical answer, use this:

Pick YOLO if your next step is a YOLO-style training workflow or you want the simplest folder-and-text-file layout.
Pick COCO if your tooling expects JSON or you want a more structured annotation manifest.
Pick Pascal VOC if you already know your downstream workflow needs XML.

For BBoxML users, there is one more practical detail worth knowing: the Google Colab notebook always trains with a YOLO checkpoint. COCO Detection and Pascal VOC exports can still work there, but they are converted to YOLO training layout first. If you want the most direct route, YOLO is usually the simplest choice.

Common mistakes beginners make with annotation formats

1. Thinking "YOLO" always means one exact file standard

It does not.

Sometimes "YOLO" means the model family. Sometimes it means the folder layout. Sometimes it only means the per-image text labels. That is why it is better to ask: which training script, notebook, or platform do I need to satisfy?

2. Mixing normalized coordinates with pixel coordinates

This is one of the biggest causes of broken labels.

YOLO bounding boxes are usually stored as normalized values. COCO and Pascal VOC usually store box values in pixel-based forms. If you convert between formats incorrectly, the labels can still look valid in a file while being completely wrong at training time.

3. Letting class order drift

In YOLO, the numeric class id only works if the class list stays in the same order.

If 0 meant car on Monday and 0 means bus on Friday, your dataset is now teaching the wrong thing. This is one reason a tool like BBoxML helps: you manage class names in one workspace and export clean labels from that source of truth.

4. Breaking the image-to-label filename pairing

YOLO is simple, but that simplicity comes with a rule: image files and label files need to line up cleanly.

If the image is frame-001.jpg, the label file needs to match that basename. If files get renamed carelessly during a conversion, you can end up with missing labels or labels attached to the wrong image.

5. Choosing a format before choosing the next workflow

Beginners sometimes obsess over the "best" annotation format before they have decided how they will actually train the model.

That is backwards.

Pick the training workflow first. Then choose the dataset format that fits it best.

6. Assuming a different format automatically means better model quality

The format itself usually is not the main quality driver.

Tight boxes, consistent class rules, enough variety in the images, and clean exports matter more than whether your dataset lives in YOLO text files or a COCO JSON file.

If you want help on that side of the problem, read 7 beginner tips for better object detection labels.

A practical workflow for first-time builders

For a first project, a good pattern is:

decide what you want to detect
keep your class list small
label a small batch consistently
export in the format your next tool expects

In BBoxML, that usually means:

create a project and upload images
create your classes
draw bounding boxes in the browser
save a dataset version
export as YOLO, COCO Detection, or Pascal VOC

If you already have an existing dataset, BBoxML can import a YOLO or COCO zip into a new cloud project, which is useful if you want to clean up labels before the next export.

If you are brand new to the workflow, start with the Getting Started guide or the beginner post on what image labelling is and how to start your first machine learning dataset.

The simplest decision rule

If you still feel unsure, use this shortcut:

choose YOLO for the simplest first export
choose COCO when another tool explicitly asks for COCO JSON
choose Pascal VOC only when a legacy or XML-based workflow requires it

That is enough for most beginners.

You do not need to master every dataset standard before you label your first useful project. You just need to keep your labels consistent and export in a format the next step can actually use.

Next step: create your workspace in onboarding, use Getting Started to build the first dataset version, and return to this guide when you need to choose between YOLO and COCO export.

Where BBoxML fits

BBoxML is built to make this part less messy.

You can prepare your labels in one browser-based workspace, keep your classes consistent, and export the dataset in the format that matches your next step instead of manually reorganizing folders by hand.

If your next goal is your first end-to-end run, use:

Onboarding to start a new account
Getting Started to create your first project
Google Colab Guide to take a saved export into a training notebook
Billing & Credits if you plan to use AI-assisted labelling and want to understand plan limits and credit usage

The best annotation format is usually not the most fashionable one. It is the one that keeps your first workflow simple and your labels clean.

7 beginner tips for better object detection labels

Sun, 15 Mar 2026 00:00:00 GMT

Once you understand what image labelling is, the next problem is usually more practical: how do you label images in a way that actually helps a model perform well?

If you are a solo founder or side-project builder, that question matters a lot. You do not have time for a huge annotation team, and you probably do not want to spend weeks labelling images only to discover the model learned the wrong thing.

The good news is that first projects usually improve more from better dataset decisions than from fancy model changes.

Use these tips as your quality checklist before you scale anything up. If you are still building the first version, Getting Started gives you the shortest path from blank account to a downloadable dataset.

1. Start with one narrow use case

Beginners often start too broad.

"Detect animals" sounds exciting, but it creates immediate confusion:

which animals count?
how small is too small?
do you label toys, drawings, or statues?

A better first project is something like:

detect suitcases in airport-style photos
detect dogs in outdoor photos
detect parcels on a doorstep

The narrower the task, the easier it is to collect consistent examples and write clear labelling rules.

2. Keep your classes simple at first

In object detection, a class is just the name you assign to a type of object, such as dog, car, or suitcase.

Too many classes too early creates weak data. A beginner dataset usually works better when you start with:

one class
one camera angle or scene type
one definition of what should be boxed

For example, start with suitcase before splitting into hard-shell suitcase, soft suitcase, carry-on, and checked luggage.

You can always add more detail later. You cannot easily recover consistency from a confusing first dataset.

3. Make every bounding box tight and consistent

This is one of the most common quality problems in first datasets.

If your boxes are loose, the model learns background pixels as if they belong to the object. If your boxes are inconsistent, the model sees mixed teaching examples.

Good boxes should usually:

sit close to the visible edges of the object
include the full visible object
avoid large amounts of empty background
follow the same rule every time

If one image has a tight box around a dog and the next image includes half the grass around it, the model gets conflicting supervision.

Tight boxes matter even more when the object is small.

4. Get enough images, but focus on variety before raw volume

New builders often ask, "How many images do I need?"

There is no universal number, but for a simple first detector, a rough starting point is:

at least 100 to 300 labelled images for one class
more if the scene changes a lot
a separate validation set that the model never trains on

What matters most is not just image count. It is coverage.

Your dataset should include reasonable variation in:

lighting
distance from camera
object size
background
partial occlusion
orientation

Fifty near-identical images teach less than fifty varied but consistently labelled images.

5. Watch for overfitting early

Overfitting means the model learns your training images too specifically instead of learning the general pattern.

This often happens when:

the dataset is too small
the images are too similar
the validation set looks almost the same as the training set
labels are inconsistent, so the model memorizes noise

The warning sign is usually this: training performance looks great, but real-world performance is disappointing.

To reduce overfitting:

keep a separate validation set from the start
include more scene variety, not just more copies of the same scene
add hard examples, such as cluttered backgrounds or partial occlusion
review mistakes and label edge cases consistently

6. Add negative examples and hard examples

Many first datasets only contain positive examples of the target object. That is a mistake.

Your model also needs to learn what not to detect.

Useful examples include:

images with no target object at all
scenes with similar-looking objects
busy backgrounds
borderline cases you decided to ignore

If you only show clean product-style shots, the model may look excellent in testing and fail as soon as the background gets messy.

7. Learn the few model terms that actually help

You do not need a full machine learning course to get started. A few plain-English concepts go a long way.

What YOLO means

YOLO stands for "You Only Look Once." In practice, people usually mean a family of object detection models and training formats that are popular because they are fast and widely supported.

When someone asks for a YOLO export, they usually mean:

the image files
a text file per image
one row per object
class id plus normalized box coordinates

What COCO means

COCO is another common dataset format. Instead of one text file per image, it usually stores annotations in a structured JSON file.

People often choose COCO when they want:

a more explicit schema
compatibility with training and evaluation tools
support for richer metadata

Neither format is "better" in every case. The right choice is usually whatever your training workflow expects.

What mAP50 means

mAP50 is one of the most common object detection metrics.

A simple way to think about it is:

the model predicts a box
that box is compared with the ground-truth box
if the overlap is good enough, it counts as a match
50 means the overlap threshold is 0.50 IoU

Higher mAP50 is usually better, but it is not the whole story.

A decent beginner rule is:

use mAP50 as one signal
also inspect real predictions by eye
check whether the model misses small objects, duplicates boxes, or confuses similar classes

You are not building a good model if the score looks fine but the boxes are wrong on real images.

A simple checklist before you train

Before exporting your first dataset, ask:

are my class names still simple and stable?
are my boxes tight in the same way across images?
do I have enough variety in backgrounds, size, and lighting?
do I have a validation set separated from training?
have I included hard examples and empty scenes?
does the export format match my training workflow, such as YOLO or COCO?

If you can answer yes to most of those, you are in a much better position than many first-time projects.

If you want to turn these tips into a repeatable workflow, begin in onboarding, follow Getting Started, and check Billing & Credits before you run AI labelling on a bigger image set.

Final thought

For a first object detection project, the goal is not to build a perfect benchmark model. The goal is to create a dataset that teaches the model the right pattern clearly.

That usually comes down to a few unglamorous habits:

narrow scope
consistent classes
tight boxes
enough varied images
honest validation

Those habits scale surprisingly well. If you get them right early, your second dataset and your second model become much easier to improve.

What image labelling is and how to start your first machine learning dataset

Sun, 15 Mar 2026 00:00:00 GMT

If you are brand new to machine learning, image labelling is one of the first practical jobs you will run into. It sounds technical, but the idea is simple: you show a computer examples of what you want it to notice.

For an image model, those examples usually start with humans looking at pictures and marking the important things in them. That marking process is called image labelling or annotation.

If you want to move from the idea stage to a real dataset quickly, pair this guide with BBoxML's Getting Started flow. It turns the basics here into a clear next step: create a project, upload images, label a small batch, and prepare your first export.

What image labelling actually means

Imagine you want a model to spot dogs in photos.

You cannot just tell the computer "this is a dog" once and expect it to understand. You need to give it many examples. For each example image, you mark where the dog is and attach the correct label. Over time, the model learns patterns from those examples.

That means a labelled dataset is really just a teaching set:

the image is the example
the label says what matters in the image
the collection of many labelled images becomes training data

In BBoxML, one common way to do this is by drawing a bounding box around an object and assigning it a class name such as dog, cat, or car.

Why labelling matters so much

When people first hear about machine learning, they often focus on the model. In practice, beginners usually get better results by focusing on the dataset first.

If the labels are unclear, inconsistent, or incomplete, the model learns from messy teaching material. If the labels are accurate and consistent, the model has a much better chance of learning the right pattern.

This is why image labelling is not busywork. It is one of the most important parts of the whole project.

What a first project should look like

Your first machine learning dataset does not need to be large or complicated.

A good first project usually looks like this:

Pick one simple task.
Choose a small set of clear labels.
Label a manageable batch of images.
Export the results in a format your training workflow can use.

For example, you might start with:

one object type, such as dog
50 to 200 images
a single rule for what should be boxed

That is enough to learn the workflow without getting buried in edge cases too early.

How to label images for the first time

If you are about to create your first dataset, this sequence works well:

1. Decide what the model should notice

Be specific. "Animals" is broad. "Dogs in outdoor photos" is much clearer.

The clearer the goal, the easier it is to decide what should and should not be labelled.

2. Write down your label rules

Before you start drawing boxes, decide the rules you will follow.

Examples:

Should partly hidden objects still be labelled?
Should very small objects be ignored?
Should blurry objects be included?

These decisions matter because consistency is often more important than perfection.

3. Keep your classes simple

Beginners often create too many labels too soon. Start with the smallest useful set.

Good starting approach	Harder starting approach
`dog`	`small-dog`, `large-dog`, `puppy`, `running-dog`, `sleeping-dog`
`car`	`sedan`, `hatchback`, `SUV`, `pickup`, `van`

You can always add more detail later once the basic workflow is stable.

4. Label a small batch first

Do not wait until you have labelled thousands of images to review your work.

Label a small batch, then stop and check:

are the boxes placed consistently?
are class names clear?
are there confusing edge cases that need rules?

This quick review saves a lot of rework later.

Common beginner mistakes

Here are a few problems that show up again and again in first projects:

changing class names halfway through the dataset
labelling some difficult examples but skipping similar ones later
starting with too many categories
collecting images before deciding what "good" labels look like

None of these mistakes are unusual. They are part of the learning curve. The goal is simply to catch them early.

What happens after labelling

Once your images are labelled, the dataset can usually be exported into a standard format such as YOLO or COCO. That exported data is what a training pipeline or machine learning engineer will use next.

You do not need to master model training on day one. A strong first step is just this:

understand the problem you want to solve
label a small dataset consistently
export it cleanly

That is already real progress.

Ready to try the workflow on your own images? Start in onboarding, follow Getting Started, and use Billing & Credits if you want to estimate AI usage before you label a larger batch.

A good mindset for your first dataset

Your first dataset is not supposed to be perfect. It is supposed to teach you the workflow.

If you can explain:

what the model should detect
what each class means
how you decided what to label

then you are already doing the important work well.

Machine learning projects become much easier once the dataset has a clear structure. That is exactly why tools like BBoxML exist: to make the first part of the journey feel understandable, not overwhelming.