Scraping 900+ Cruise Ship Editor Reviews at Scale
Built a Node.js scraper that extracts editor reviews for ~900 cruise ships. Covers overview, dining, activities, and cabins sections using ScraperAPI with concurrent scraping, caching, and auto-resume.
Digital Evenings
Author
Reading through hundreds of cruise ship reviews manually was driving me crazy. I wanted to compare what professional editors actually thought about different ships—the dining, activities, cabins—but clicking through 900+ ship pages one by one? No thanks.
So I built a scraper that pulls editor reviews for every cruise ship across all major cruise lines. ~900 ships, 4 review sections each, all saved to a nice structured JSONL or CSV file.
GitHub Repository: cruise-reviews-extractor
What It Actually Scrapes
This isn't pulling user reviews (that's coming later). It grabs professional editor reviews for 4 sections of each ship:
- Overview - General editor review with pros/cons
- Dining - Restaurant reviews and dining options
- Activities - Onboard activities and entertainment
- Cabins - Cabin types and accommodations
The scraper hits every cruise line (Royal Caribbean, Carnival, Norwegian, etc.) and extracts the editor's take on these sections for each ship. Some smaller ships don't have all sections—that's fine, it just saves null for those.
Why Editor Reviews?
User reviews are all over the place—"OMG BEST SHIP EVER!!!" vs "literally sinking garbage" for the same ship. Editor reviews are more consistent. They follow a structure, cover specific aspects, and actually help you compare apples to apples.
Plus, the data structure is predictable, which makes scraping way easier.
Key Features
1. ScraperAPI Integration
I tried building my own proxy rotation setup first. Spoiler: it was a nightmare. Managing proxy pools, handling rate limits, dealing with CAPTCHAs... I spent more time debugging the infrastructure than actually scraping.
Then I switched to ScraperAPI and honestly wish I'd done it from day one. Here's the setup:
const ScraperAPI = require('scraperapi-sdk');
const scraperClient = new ScraperAPI(process.env.SCRAPER_API_KEY);
async function fetchShipData(url) {
try {
const response = await scraperClient.get(url, {
country_code: 'us',
render: true, // Enable JavaScript rendering if needed
});
return response;
} catch (error) {
console.error('ScraperAPI request failed:', error);
throw error;
}
}
Why this works so well:
- They handle all the proxy rotation stuff automatically
- JavaScript-heavy sites? No problem, they render it
- CAPTCHAs just... work (still feels like magic)
- You only pay when the request actually succeeds
- Way less headaches than managing your own proxy pool
2. Actual Output Structure
Here's what the data looks like for each ship (JSONL format - one JSON object per line):
{
"cruiseline": "royal-caribbean",
"ship": "symphony-of-the-seas",
"editor_review": {
"review": "Symphony of the Seas is the world's largest cruise ship...",
"pros": ["Incredible variety of dining", "Outstanding entertainment"],
"cons": ["Can feel crowded", "Long lines at peak times"]
},
"dining": {
"review": "With over 20 dining venues...",
"restaurants": ["Wonderland", "Chops Grille", "Jamie's Italian"]
},
"activities": {
"review": "The Ultimate Abyss slide...",
"activities": ["Ice skating", "Zip line", "Rock climbing", "Water slides"]
},
"cabins": {
"review": "Staterooms range from cozy interior cabins..."
},
"scraped_at": "2025-10-15T12:30:45.123Z"
}
If a ship doesn't have a dedicated dining/activities/cabins page (common for smaller river cruise ships), those fields are just null. No big deal—the scraper handles it gracefully.
3. Multiple Output Formats
Choose between JSONL (JSON Lines) or CSV based on your needs:
// JSONL output - one JSON object per line
{"shipName":"Symphony of the Seas","cruiseLine":"Royal Caribbean"...}
{"shipName":"Harmony of the Seas","cruiseLine":"Royal Caribbean"...}
{"shipName":"Carnival Vista","cruiseLine":"Carnival Cruise Line"...}
// CSV output - traditional spreadsheet format
shipName,cruiseLine,yearBuilt,passengerCapacity,rating
"Symphony of the Seas","Royal Caribbean",2018,6680,4.5
"Harmony of the Seas","Royal Caribbean",2016,6687,4.4
JSONL Benefits:
- Easy to append new data
- Streaming-friendly for large datasets
- Simple to parse line-by-line
- Compact and efficient
CSV Benefits:
- Opens directly in Excel/Google Sheets
- Compatible with data analysis tools
- Universal format support
- Easy to share with non-technical users
4. Concurrent Scraping (The Smart Way)
Here's where it gets interesting. For each ship, the scraper hits 4 URLs simultaneously:
- Base URL (overview/editor review)
- /dining page
- /activities page
- /cabins page
All 4 requests fire at once, then the scraper moves to the next ship. With SCRAPERAPI_MAX_THREADS=3, that's 3 ships × 4 URLs = 12 concurrent requests at peak.
// Simplified version of the actual scraper
async function scrapeShip(cruiseline, ship) {
const baseUrl = `https://site.com/${cruiseline}/${ship}`;
// Fire all 4 requests at once
const [overview, dining, activities, cabins] = await Promise.all([
scrapeSection(baseUrl),
scrapeSection(`${baseUrl}/dining`),
scrapeSection(`${baseUrl}/activities`),
scrapeSection(`${baseUrl}/cabins`)
]);
return {
cruiseline,
ship,
editor_review: overview,
dining,
activities,
cabins,
scraped_at: new Date().toISOString()
};
}
Why this works:
- ScraperAPI handles the proxy rotation automatically
- Each section fails independently (dining might 404, but overview succeeds)
- You can scrape 900 ships in a few hours instead of days
5. Smart Caching System
Avoid re-scraping the same data:
const fs = require('fs').promises;
const crypto = require('crypto');
class ScraperCache {
constructor(cacheDir = '.cache') {
this.cacheDir = cacheDir;
}
getCacheKey(url) {
return crypto.createHash('md5').update(url).digest('hex');
}
async get(url) {
const key = this.getCacheKey(url);
const cachePath = `${this.cacheDir}/${key}.json`;
try {
const data = await fs.readFile(cachePath, 'utf-8');
const cached = JSON.parse(data);
// Cache expires after 30 days
if (Date.now() - cached.timestamp < 30 * 24 * 60 * 60 * 1000) {
console.log(`Cache hit: ${url}`);
return cached.data;
}
} catch (error) {
// Cache miss
}
return null;
}
async set(url, data) {
const key = this.getCacheKey(url);
const cachePath = `${this.cacheDir}/${key}.json`;
await fs.mkdir(this.cacheDir, { recursive: true });
await fs.writeFile(cachePath, JSON.stringify({
url,
timestamp: Date.now(),
data
}));
}
}
// Usage
const cache = new ScraperCache();
async function extractWithCache(url) {
const cached = await cache.get(url);
if (cached) return cached;
const data = await extractShipData(url);
await cache.set(url, data);
return data;
}
6. Resume Capability
Interrupted scraping? Pick up right where you left off:
const fs = require('fs').promises;
class ScrapingSession {
constructor(sessionFile = '.session.json') {
this.sessionFile = sessionFile;
}
async loadSession() {
try {
const data = await fs.readFile(this.sessionFile, 'utf-8');
return JSON.parse(data);
} catch {
return { completed: [], failed: [] };
}
}
async saveProgress(completed, failed) {
await fs.writeFile(this.sessionFile, JSON.stringify({
completed,
failed,
lastUpdated: new Date().toISOString()
}, null, 2));
}
async scrapeWithResume(urls) {
const session = await this.loadSession();
const remaining = urls.filter(url =>
!session.completed.includes(url)
);
console.log(`Resuming: ${remaining.length} ships remaining`);
for (const url of remaining) {
try {
const data = await extractShipData(url);
session.completed.push(url);
await this.saveProgress(session.completed, session.failed);
} catch (error) {
session.failed.push(url);
await this.saveProgress(session.completed, session.failed);
}
}
}
}
Technical Implementation
Architecture
┌──────────────────┐
│ Ship URL List │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Resume Check │ ← Load previous session
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Cache Layer │ ← Check if already scraped
└────────┬─────────┘
│
▼
┌──────────────────┐
│ ScraperAPI │ ← Fetch ship data
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Data Parser │ ← Extract structured data
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Output Writer │ ← Export to JSONL/CSV
└──────────────────┘
Error Handling
Production-ready error handling with retries:
async function extractWithRetry(url, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
const data = await fetchShipData(url);
return parseShipData(data);
} catch (error) {
console.error(`Attempt ${attempt} failed for ${url}:`, error.message);
if (attempt === maxRetries) {
throw new Error(`Failed after ${maxRetries} attempts: ${url}`);
}
// Exponential backoff
const backoffMs = Math.pow(2, attempt) * 1000;
console.log(`Retrying in ${backoffMs}ms...`);
await new Promise(resolve => setTimeout(resolve, backoffMs));
}
}
}
Getting Started
You'll need Node.js (version 16 or higher) and a ScraperAPI account. They have a free tier that's perfect for testing—sign up here if you don't have one yet.
Clone the repo and install dependencies:
git clone https://github.com/digitalevenings/cruise-reviews-extractor.git
cd cruise-reviews-extractor
npm install
Create a .env file with your API key:
SCRAPER_API_KEY=your_api_key_here
OUTPUT_FORMAT=jsonl # or 'csv'
CONCURRENCY=5
CACHE_ENABLED=true
Then just run it:
# Basic usage
npm start
# Want CSV instead?
npm start -- --format csv --output ships.csv
# Something crashed? Resume where you left off
npm start -- --resume
# Cache getting too big? Clear it out
npm run cache:clear
What Can You Do With This Data?
1. Compare Dining Across Cruise Lines
Ever wonder which cruise line has the best restaurants? Now you can actually analyze it:
const fs = require('fs');
const ships = fs.readFileSync('editor-reviews.jsonl', 'utf-8')
.split('
')
.filter(Boolean)
.map(JSON.parse);
// Find ships with exceptional dining reviews
const topDining = ships
.filter(s => s.dining?.review?.includes('exceptional'))
.map(s => `${s.cruiseline}/${s.ship}`);
console.log('Ships with exceptional dining:', topDining);
2. Analyze Common Pros and Cons
What do editors consistently praise or criticize?
const allPros = ships
.flatMap(s => s.editor_review?.pros || [])
.reduce((acc, pro) => {
acc[pro] = (acc[pro] || 0) + 1;
return acc;
}, {});
// Most mentioned pros
console.log(Object.entries(allPros)
.sort((a, b) => b[1] - a[1])
.slice(0, 10));
3. Build a Comparison Tool
Export specific ships to compare side-by-side:
// Compare Royal Caribbean's biggest ships
const rcShips = ships.filter(s =>
s.cruiseline === 'royal-caribbean' &&
s.ship.includes('seas')
);
// Export for analysis
fs.writeFileSync('royal-caribbean-comparison.json',
JSON.stringify(rcShips, null, 2)
);
Best Practices
Respect ScraperAPI Limits
Monitor your API usage:
// ScraperAPI provides usage stats in response headers
const response = await scraperClient.get(url);
console.log('API Credits Remaining:', response.headers['x-credits-remaining']);
Clean Data Storage
Organize output files by scraping session:
const timestamp = new Date().toISOString().split('T')[0];
const outputFile = `data/ships_${timestamp}.jsonl`;
Monitor Progress
Log scraping statistics:
console.log(`
Scraping Complete:
Total Ships: ${total}
Successful: ${successful}
Failed: ${failed}
Cached: ${cached}
Duration: ${duration}s
`);
Quick Note on Ethics
Look, I built this for educational purposes and personal research. If you use it, just don't be that person who hammers a website with 1000 requests per second and gets everyone's IP banned.
Some common sense rules:
- Check robots.txt before scraping
- Add delays between requests (seriously, don't be greedy)
- If there's an official API, use that instead
- Don't scrape data you shouldn't have access to
- Be respectful of the servers you're hitting
The nice thing about using ScraperAPI is they handle a lot of this automatically—rotating IPs, throttling requests, respecting rate limits. But still, use your brain and don't abuse it.
Performance Tips
Optimize Concurrency
Balance speed and reliability:
// Too high: May trigger rate limits or blocks
const concurrency = 20; // ❌
// Optimal: Fast but respectful
const concurrency = 5; // ✅
// Conservative: For sensitive targets
const concurrency = 2; // ✅
Use Selective Scraping
Only scrape what you need:
// Instead of scraping all ships
const allShips = await scrapeAllShips(); // ❌
// Filter first, then scrape
const recentShips = shipList.filter(s => s.yearBuilt >= 2020);
const data = await scrapeShips(recentShips); // ✅
Common Issues (And How to Fix Them)
Getting 429 errors from ScraperAPI?
You're probably hitting rate limits. Dial back the concurrency:
const limit = pLimit(2); // Try 2 instead of 5
Scraper keeps crashing with parsing errors?
The HTML structure probably changed. Add some validation:
if (!html.includes('ship-info')) {
throw new Error('Ship info section not found');
}
Cache folder taking up too much space?
Yeah, that happens. Just clear it:
npm run cache:clear
Want to Contribute?
Found a bug? Have an idea for a cool feature? PRs are totally welcome. Whether it's fixing typos in the docs or adding a whole new feature, I appreciate any help.
The repo is on GitHub—feel free to open an issue or submit a pull request.
What's Next?
Right now, this scraper gets editor reviews only. That's perfect for getting a professional perspective, but the next big feature is user reviews.
Imagine having thousands of actual passenger reviews for each ship—ratings, complaints, highlights—all structured and searchable. That's coming soon. The challenge is handling pagination (some popular ships have 50+ pages of reviews) and dealing with varied formats.
Some other ideas I'm considering:
- Sentiment analysis on the reviews (positive/negative/neutral scoring)
- A simple web UI to search and compare ships
- Integration with Airtable or Google Sheets
- Automated weekly scraping to catch new ships
The scraper's been running solid for months now. ScraperAPI handles all the proxy/CAPTCHA nonsense, and the caching system means you can stop and resume anytime without wasting API calls.
If you use this or have ideas for improvements, open an issue on GitHub. Always happy to chat about scraping challenges.
Happy scraping! 🚢
Repository: github.com/digitalevenings/cruise-reviews-extractor