AI Ready Analyzer Logo
The 'Goldilocks' Problem of Chunking (And How to Solve It)
Chunking Strategies Concept Illustration
AI EngineeringOct 17, 20258 min read

The 'Goldilocks' Problem of Chunking (And How to Solve It)

If you survived the 'Data Hygiene' audit from my last post, you now have a pile of clean text. No mojibake, no headers in the middle of sentences, no whitespace wastelands.

Think of your data like the blueprints for the Death Star. Every time you ask the AI a question, you're sending a reconnaissance droid to find the answer. If the droid brings back the entire dataset, every floor plan, plumbing schematic, and trash compactor location, the Superlaser chokes on the noise. If the droid only brings back a single square inch of blueprint, the Rebels can't see that the small pipe leads to the thermal exhaust port. The chunks need to be the right size.

You can't just shove a 50-page PDF into a vector database. Vector databases have a Goldilocks problem.

  • Too Small: If you chop your text into single sentences, the AI loses context. It sees "He said yes" but has no idea who "He" is or what he agreed to.
  • Too Big: If you chop it into 5-page blocks, you dilute the meaning. The specific answer you need is buried in a mountain of irrelevant text, and the "search" step fails to find it.

You need to find the size that is just right. This process is called Chunking.

Four strategies, in order of sophistication.

Level 1: Fixed-Size Chunking

This is the default setting in most tutorials. You simply set a character limit (e.g., 500 characters) and chop the text blindly.

  • The Logic: "Every 500 characters, cut the tape."
  • The Problem: It creates "Orphaned Context." It will happily cut a sentence in half. The first chunk ends with "The CEO decided to..." and the second chunk starts with "...fire the marketing team." When you retrieve these later, neither chunk makes sense on its own.

Level 2: Recursive Character Chunking

This is what 90% of you should use. It attempts to split by paragraphs first. If the paragraph is too big, it tries sentences. If the sentence is too big, it tries words.

It respects the natural borders of human language.

  • The Logic: "Try to keep paragraphs together. If that fails, break it gently."
  • The Vibe: It feels like how a human would highlight a book.

Level 3: Markdown Split

If you are technical enough to convert your documents to Markdown before uploading (which you should), you unlock a superpower.

You can chunk by Headers. This guarantees that a section under ## Return Policy stays together, and doesn't get mixed with ## Shipping Policy.

Level 4: Semantic Chunking (Galaxy Brain)

This is the new frontier. Instead of splitting by characters (math), you split by meaning (vibes).

The AI scans your text and calculates the "embedding distance" between sentences. If Sentence A and Sentence B are talking about the same topic, they stay together. If Sentence C shifts the topic, the AI creates a cut.

  • The Pros: It creates perfect, topic-isolated chunks.
  • The Cons: It requires running an AI model during the chunking process, which is slower and costs money (tokens).

The Concept:

  1. Embed every sentence.
  2. Compare the "similarity score" of Sentence 1 vs Sentence 2.
  3. If the score drops below 0.7, it means the topic changed. CUT HERE.

Which One to Use

Don't overthink it. Start with Level 2 (Recursive). It's free, fast, and works for 95% of use cases.

  • Use Fixed (Level 1) if you are hacking a prototype in 5 minutes.
  • Use Recursive (Level 2) for almost everything.
  • Use Markdown (Level 3) if you have control over the source data format.
  • Use Semantic (Level 4) only if you have complex, dense data where topics shift rapidly (like a transcript of a messy meeting).

Test your data quality

Upload a sample of your data and let our analyzer spot issues your pipeline might have missed.

Analyze My Data

Stay Updated

Get the top news and articles on all things AI, Data Engineering and martech sent to your inbox daily!