Aggregating 50K+ Product Reviews with AI | aversusb.net

Daniel Rozin

When someone searches "AirPods Pro 2 vs Sony WF-1000XM5," they don't want another opinion piece. They want the aggregate truth — what do thousands of real users actually think, across every source that matters?

That's what we built at aversusb.net. Here's how the review aggregation pipeline works under the hood.

The Challenge: Reviews Are Messy#

Product reviews exist everywhere — Amazon, Reddit, YouTube comments, RTINGS, Wirecutter, G2, Trustpilot. The data is:

Scattered across 50+ sources with different formats
Inconsistent — "AirPods Pro 2" vs "Apple AirPods Pro (2nd Generation)" vs "APP2"
Noisy — spam reviews, promotional content, outdated info
Unstructured — free text with no standard schema

Our job: turn this chaos into structured, reliable comparison data.

Step 1: Entity Resolution#

Before you can aggregate reviews, you need to know which product each review is about. This sounds trivial. It's not.

// These all refer to the same product:
const aliases = [
  "AirPods Pro 2",
  "AirPods Pro (2nd gen)",
  "Apple AirPods Pro 2nd Generation",
  "APP2",
  "AirPods Pro USB-C",
  "airpods pro 2022",
];

Our approach:

Seed entities from manufacturer specs (canonical names, model numbers, SKUs)
Fuzzy matching with normalized strings — strip punctuation, lowercase, expand abbreviations
AI disambiguation — for edge cases, Claude determines whether "Galaxy Buds" in a 2024 review means Buds3, Buds2 Pro, or Buds FE based on surrounding context
Manual overrides — a small lookup table for known tricky cases

Entity resolution accuracy improved from ~78% to ~96% after we added the AI disambiguation step.

Step 2: Multi-Source Collection#

We pull reviews from multiple source types:

Source Type	Examples	What We Extract
Structured	Amazon, G2, RTINGS	Ratings, pros/cons
Semi-structured	Reddit, forums	Opinions in context
Expert	Wirecutter, Tom's Guide	Detailed test results
Social	YouTube, Twitter	Sentiment signals

Each source type gets a different weight in our aggregate scoring:

Expert reviews (Wirecutter, RTINGS): 2x weight — standardized testing
Verified purchase reviews (Amazon): 1.5x weight
Community discussions (Reddit): 1x weight — great for real-world usage patterns
Social mentions: 0.5x weight — signal, not substance

Step 3: Attribute Extraction#

Raw reviews are free text. We need structured attributes.

For earbuds, our target schema:

Sound quality (1-5)
ANC effectiveness (1-5)
Comfort (1-5)
Battery life (hours, verified against specs)
Value for money (1-5)
Build quality (1-5)

We use Claude to extract these from review text:

const prompt = `
  Extract product attribute ratings from this review.
  Only extract attributes explicitly mentioned.
  Return null for attributes not discussed.

  Review: "${reviewText}"
  Product: "${productName}"

  Return JSON: { soundQuality, anc, comfort, batteryLife, value, buildQuality }
`;

Key insight: always return null for unmentioned attributes, never infer. A review that says "great sound" but doesn't mention ANC should not generate an ANC score. This avoids hallucinated data contaminating the aggregate.

Step 4: Sentiment Aggregation#

With structured attributes from thousands of reviews, we compute weighted aggregates:

function aggregateRatings(reviews: ExtractedReview[]): AggregateScore {
  const weighted = reviews.map(r => ({
    ...r,
    weight: SOURCE_WEIGHTS[r.source] * recencyFactor(r.date)
  }));

  // Only include attributes with 10+ data points
  const attributes = ATTRIBUTE_KEYS.filter(attr =>
    weighted.filter(r => r[attr] !== null).length >= 10
  );

  return attributes.reduce((acc, attr) => {
    const valid = weighted.filter(r => r[attr] !== null);
    const sum = valid.reduce((s, r) => s + r[attr] * r.weight, 0);
    const totalWeight = valid.reduce((s, r) => s + r.weight, 0);
    acc[attr] = Math.round((sum / totalWeight) * 10) / 10;
    return acc;
  }, {});
}

The `recencyFactor` gives newer reviews more weight — a 2024 review about firmware-updated ANC is more relevant than a 2022 launch-day review.

What We Learned#

1. Source diversity beats volume#

500 Amazon reviews + 50 Reddit threads + 5 expert reviews gives a more accurate picture than 5,000 Amazon reviews alone. Each source captures different usage patterns and user segments.

2. Negative reviews are more informative#

Users who rate a product 3/5 tend to write the most detailed, attribute-specific reviews. Five-star reviews often say "love it!" (not useful). One-star reviews are often about shipping/returns (not product quality). Three-star reviews are gold.

3. AI extraction needs guardrails#

Without explicit instructions to return null for unmentioned attributes, Claude will sometimes infer ratings from context. "These earbuds have great ANC" does NOT imply anything about comfort. We added validation that rejects any extraction where >80% of attributes are rated — that's a sign the model is filling in blanks.

4. Freshness matters more than you think#

Product firmware updates can dramatically change the experience. The AirPods Pro 2 got significantly better ANC via software update months after launch. Our recency weighting captures this — static aggregation would miss it.

Results#

Our aggregated comparison data now powers thousands of comparison pages on aversusb.net. Each page shows:

Weighted aggregate ratings across all sources
Attribute-by-attribute breakdown
Source count and freshness indicators
Confidence scores (higher when more sources agree)

The average comparison page references data from 15+ review sources and 500+ individual reviews.

Try It#

Browse comparisons at aversusb.net — click any comparison to see the aggregated review data in action. A good place to start: AirPods Pro 2 vs Sony WF-1000XM5.

You may also like our deep dives on ChatGPT vs Claude and Anthropic vs OpenAI for more on the AI models powering this kind of work.

Conclusion#

Aggregating reviews at scale is less about volume and more about source diversity, careful entity resolution, and disciplined extraction. The biggest gains came from refusing to let the model guess — null values are a feature, not a bug. If you're building something similar, start with the boring parts (entity resolution, source weighting) before you reach for fancier modeling. That's where the accuracy lives.

#ai #machine-learning #reviews #engineering #comparison

Share:

Get the best comparisons in your inbox

Weekly digest of trending comparisons, new categories, and expert insights. No spam.

Join 1,000+ readers · Unsubscribe anytime

1 head-to-head comparisons

C
C
VS
ChatGPT vs Claude

Back to Blog

Share this article

Get the best comparisons in your inbox

Related Comparisons