Scraping 900+ Cruise Ship Editor Reviews at Scale

Reading through hundreds of cruise ship reviews manually was driving me crazy. I wanted to compare what professional editors actually thought about different ships—the dining, activities, cabins—but clicking through 900+ ship pages one by one? No thanks.

So I built a scraper that pulls editor reviews for every cruise ship across all major cruise lines. ~900 ships, 4 review sections each, all saved to a nice structured JSONL or CSV file.

GitHub Repository: cruise-reviews-extractor

What It Actually Scrapes

This isn't pulling user reviews (that's coming later). It grabs professional editor reviews for 4 sections of each ship:

Overview - General editor review with pros/cons
Dining - Restaurant reviews and dining options
Activities - Onboard activities and entertainment
Cabins - Cabin types and accommodations

The scraper hits every cruise line (Royal Caribbean, Carnival, Norwegian, etc.) and extracts the editor's take on these sections for each ship. Some smaller ships don't have all sections—that's fine, it just saves null for those.

Why Editor Reviews?

User reviews are all over the place—"OMG BEST SHIP EVER!!!" vs "literally sinking garbage" for the same ship. Editor reviews are more consistent. They follow a structure, cover specific aspects, and actually help you compare apples to apples.

Plus, the data structure is predictable, which makes scraping way easier.

Key Features

1. ScraperAPI Integration

I tried building my own proxy rotation setup first. Spoiler: it was a nightmare. Managing proxy pools, handling rate limits, dealing with CAPTCHAs... I spent more time debugging the infrastructure than actually scraping.

Then I switched to ScraperAPI and honestly wish I'd done it from day one. Here's the setup:

const ScraperAPI = require('scraperapi-sdk');

const scraperClient = new ScraperAPI(process.env.SCRAPER_API_KEY);

async function fetchShipData(url) {
  try {
    const response = await scraperClient.get(url, {
      country_code: 'us',
      render: true, // Enable JavaScript rendering if needed
    });

    return response;
  } catch (error) {
    console.error('ScraperAPI request failed:', error);
    throw error;
  }
}

Why this works so well:

They handle all the proxy rotation stuff automatically
JavaScript-heavy sites? No problem, they render it
CAPTCHAs just... work (still feels like magic)
You only pay when the request actually succeeds
Way less headaches than managing your own proxy pool

2. Actual Output Structure

Here's what the data looks like for each ship (JSONL format - one JSON object per line):

{
  "cruiseline": "royal-caribbean",
  "ship": "symphony-of-the-seas",
  "editor_review": {
    "review": "Symphony of the Seas is the world's largest cruise ship...",
    "pros": ["Incredible variety of dining", "Outstanding entertainment"],
    "cons": ["Can feel crowded", "Long lines at peak times"]
  },
  "dining": {
    "review": "With over 20 dining venues...",
    "restaurants": ["Wonderland", "Chops Grille", "Jamie's Italian"]
  },
  "activities": {
    "review": "The Ultimate Abyss slide...",
    "activities": ["Ice skating", "Zip line", "Rock climbing", "Water slides"]
  },
  "cabins": {
    "review": "Staterooms range from cozy interior cabins..."
  },
  "scraped_at": "2025-10-15T12:30:45.123Z"
}

If a ship doesn't have a dedicated dining/activities/cabins page (common for smaller river cruise ships), those fields are just null. No big deal—the scraper handles it gracefully.

3. Multiple Output Formats

Choose between JSONL (JSON Lines) or CSV based on your needs:

// JSONL output - one JSON object per line
{"shipName":"Symphony of the Seas","cruiseLine":"Royal Caribbean"...}
{"shipName":"Harmony of the Seas","cruiseLine":"Royal Caribbean"...}
{"shipName":"Carnival Vista","cruiseLine":"Carnival Cruise Line"...}

// CSV output - traditional spreadsheet format
shipName,cruiseLine,yearBuilt,passengerCapacity,rating
"Symphony of the Seas","Royal Caribbean",2018,6680,4.5
"Harmony of the Seas","Royal Caribbean",2016,6687,4.4

JSONL Benefits:

Easy to append new data
Streaming-friendly for large datasets
Simple to parse line-by-line
Compact and efficient

CSV Benefits:

Opens directly in Excel/Google Sheets
Compatible with data analysis tools
Universal format support
Easy to share with non-technical users

4. Concurrent Scraping (The Smart Way)

Here's where it gets interesting. For each ship, the scraper hits 4 URLs simultaneously:

Base URL (overview/editor review)
/dining page
/activities page
/cabins page

All 4 requests fire at once, then the scraper moves to the next ship. With SCRAPERAPI_MAX_THREADS=3, that's 3 ships × 4 URLs = 12 concurrent requests at peak.

// Simplified version of the actual scraper
async function scrapeShip(cruiseline, ship) {
  const baseUrl = `https://site.com/${cruiseline}/${ship}`;

  // Fire all 4 requests at once
  const [overview, dining, activities, cabins] = await Promise.all([
    scrapeSection(baseUrl),
    scrapeSection(`${baseUrl}/dining`),
    scrapeSection(`${baseUrl}/activities`),
    scrapeSection(`${baseUrl}/cabins`)
  ]);

  return {
    cruiseline,
    ship,
    editor_review: overview,
    dining,
    activities,
    cabins,
    scraped_at: new Date().toISOString()
  };
}

Why this works:

ScraperAPI handles the proxy rotation automatically
Each section fails independently (dining might 404, but overview succeeds)
You can scrape 900 ships in a few hours instead of days

5. Smart Caching System

Avoid re-scraping the same data:

const fs = require('fs').promises;
const crypto = require('crypto');

class ScraperCache {
  constructor(cacheDir = '.cache') {
    this.cacheDir = cacheDir;
  }

  getCacheKey(url) {
    return crypto.createHash('md5').update(url).digest('hex');
  }

  async get(url) {
    const key = this.getCacheKey(url);
    const cachePath = `${this.cacheDir}/${key}.json`;

    try {
      const data = await fs.readFile(cachePath, 'utf-8');
      const cached = JSON.parse(data);

      // Cache expires after 30 days
      if (Date.now() - cached.timestamp < 30 * 24 * 60 * 60 * 1000) {
        console.log(`Cache hit: ${url}`);
        return cached.data;
      }
    } catch (error) {
      // Cache miss
    }

    return null;
  }

  async set(url, data) {
    const key = this.getCacheKey(url);
    const cachePath = `${this.cacheDir}/${key}.json`;

    await fs.mkdir(this.cacheDir, { recursive: true });
    await fs.writeFile(cachePath, JSON.stringify({
      url,
      timestamp: Date.now(),
      data
    }));
  }
}

// Usage
const cache = new ScraperCache();

async function extractWithCache(url) {
  const cached = await cache.get(url);
  if (cached) return cached;

  const data = await extractShipData(url);
  await cache.set(url, data);
  return data;
}

6. Resume Capability

Interrupted scraping? Pick up right where you left off:

const fs = require('fs').promises;

class ScrapingSession {
  constructor(sessionFile = '.session.json') {
    this.sessionFile = sessionFile;
  }

  async loadSession() {
    try {
      const data = await fs.readFile(this.sessionFile, 'utf-8');
      return JSON.parse(data);
    } catch {
      return { completed: [], failed: [] };
    }
  }

  async saveProgress(completed, failed) {
    await fs.writeFile(this.sessionFile, JSON.stringify({
      completed,
      failed,
      lastUpdated: new Date().toISOString()
    }, null, 2));
  }

  async scrapeWithResume(urls) {
    const session = await this.loadSession();
    const remaining = urls.filter(url =>
      !session.completed.includes(url)
    );

    console.log(`Resuming: ${remaining.length} ships remaining`);

    for (const url of remaining) {
      try {
        const data = await extractShipData(url);
        session.completed.push(url);
        await this.saveProgress(session.completed, session.failed);
      } catch (error) {
        session.failed.push(url);
        await this.saveProgress(session.completed, session.failed);
      }
    }
  }
}

Technical Implementation

Architecture

┌──────────────────┐
│  Ship URL List   │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Resume Check    │  ← Load previous session
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Cache Layer     │  ← Check if already scraped
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  ScraperAPI      │  ← Fetch ship data
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Data Parser     │  ← Extract structured data
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Output Writer   │  ← Export to JSONL/CSV
└──────────────────┘

Error Handling

Production-ready error handling with retries:

async function extractWithRetry(url, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      const data = await fetchShipData(url);
      return parseShipData(data);
    } catch (error) {
      console.error(`Attempt ${attempt} failed for ${url}:`, error.message);

      if (attempt === maxRetries) {
        throw new Error(`Failed after ${maxRetries} attempts: ${url}`);
      }

      // Exponential backoff
      const backoffMs = Math.pow(2, attempt) * 1000;
      console.log(`Retrying in ${backoffMs}ms...`);
      await new Promise(resolve => setTimeout(resolve, backoffMs));
    }
  }
}

Getting Started

You'll need Node.js (version 16 or higher) and a ScraperAPI account. They have a free tier that's perfect for testing—sign up here if you don't have one yet.

Clone the repo and install dependencies:

git clone https://github.com/digitalevenings/cruise-reviews-extractor.git
cd cruise-reviews-extractor
npm install

Create a .env file with your API key:

SCRAPER_API_KEY=your_api_key_here
OUTPUT_FORMAT=jsonl  # or 'csv'
CONCURRENCY=5
CACHE_ENABLED=true

Then just run it:

# Basic usage
npm start

# Want CSV instead?
npm start -- --format csv --output ships.csv

# Something crashed? Resume where you left off
npm start -- --resume

# Cache getting too big? Clear it out
npm run cache:clear

What Can You Do With This Data?

1. Compare Dining Across Cruise Lines

Ever wonder which cruise line has the best restaurants? Now you can actually analyze it:

const fs = require('fs');
const ships = fs.readFileSync('editor-reviews.jsonl', 'utf-8')
  .split('
')
  .filter(Boolean)
  .map(JSON.parse);

// Find ships with exceptional dining reviews
const topDining = ships
  .filter(s => s.dining?.review?.includes('exceptional'))
  .map(s => `${s.cruiseline}/${s.ship}`);

console.log('Ships with exceptional dining:', topDining);

2. Analyze Common Pros and Cons

What do editors consistently praise or criticize?

const allPros = ships
  .flatMap(s => s.editor_review?.pros || [])
  .reduce((acc, pro) => {
    acc[pro] = (acc[pro] || 0) + 1;
    return acc;
  }, {});

// Most mentioned pros
console.log(Object.entries(allPros)
  .sort((a, b) => b[1] - a[1])
  .slice(0, 10));

3. Build a Comparison Tool

Export specific ships to compare side-by-side:

// Compare Royal Caribbean's biggest ships
const rcShips = ships.filter(s =>
  s.cruiseline === 'royal-caribbean' &&
  s.ship.includes('seas')
);

// Export for analysis
fs.writeFileSync('royal-caribbean-comparison.json',
  JSON.stringify(rcShips, null, 2)
);

Best Practices

Respect ScraperAPI Limits

Monitor your API usage:

// ScraperAPI provides usage stats in response headers
const response = await scraperClient.get(url);
console.log('API Credits Remaining:', response.headers['x-credits-remaining']);

Clean Data Storage

Organize output files by scraping session:

const timestamp = new Date().toISOString().split('T')[0];
const outputFile = `data/ships_${timestamp}.jsonl`;

Monitor Progress

Log scraping statistics:

console.log(`
Scraping Complete:
  Total Ships: ${total}
  Successful: ${successful}
  Failed: ${failed}
  Cached: ${cached}
  Duration: ${duration}s
`);

Quick Note on Ethics

Look, I built this for educational purposes and personal research. If you use it, just don't be that person who hammers a website with 1000 requests per second and gets everyone's IP banned.

Some common sense rules:

Check robots.txt before scraping
Add delays between requests (seriously, don't be greedy)
If there's an official API, use that instead
Don't scrape data you shouldn't have access to
Be respectful of the servers you're hitting

The nice thing about using ScraperAPI is they handle a lot of this automatically—rotating IPs, throttling requests, respecting rate limits. But still, use your brain and don't abuse it.

Performance Tips

Optimize Concurrency

Balance speed and reliability:

// Too high: May trigger rate limits or blocks
const concurrency = 20; // ❌

// Optimal: Fast but respectful
const concurrency = 5; // ✅

// Conservative: For sensitive targets
const concurrency = 2; // ✅

Use Selective Scraping

Only scrape what you need:

// Instead of scraping all ships
const allShips = await scrapeAllShips(); // ❌

// Filter first, then scrape
const recentShips = shipList.filter(s => s.yearBuilt >= 2020);
const data = await scrapeShips(recentShips); // ✅

Common Issues (And How to Fix Them)

Getting 429 errors from ScraperAPI?

You're probably hitting rate limits. Dial back the concurrency:

const limit = pLimit(2); // Try 2 instead of 5

Scraper keeps crashing with parsing errors?

The HTML structure probably changed. Add some validation:

if (!html.includes('ship-info')) {
  throw new Error('Ship info section not found');
}

Cache folder taking up too much space?

Yeah, that happens. Just clear it:

npm run cache:clear

Want to Contribute?

Found a bug? Have an idea for a cool feature? PRs are totally welcome. Whether it's fixing typos in the docs or adding a whole new feature, I appreciate any help.

The repo is on GitHub—feel free to open an issue or submit a pull request.

What's Next?

Right now, this scraper gets editor reviews only. That's perfect for getting a professional perspective, but the next big feature is user reviews.

Imagine having thousands of actual passenger reviews for each ship—ratings, complaints, highlights—all structured and searchable. That's coming soon. The challenge is handling pagination (some popular ships have 50+ pages of reviews) and dealing with varied formats.

Some other ideas I'm considering:

Sentiment analysis on the reviews (positive/negative/neutral scoring)
A simple web UI to search and compare ships
Integration with Airtable or Google Sheets
Automated weekly scraping to catch new ships

The scraper's been running solid for months now. ScraperAPI handles all the proxy/CAPTCHA nonsense, and the caching system means you can stop and resume anytime without wasting API calls.

If you use this or have ideas for improvements, open an issue on GitHub. Always happy to chat about scraping challenges.

Happy scraping! 🚢

Repository: github.com/digitalevenings/cruise-reviews-extractor