Extracting 1,155+ Cruise Ships: A Complete Data Scraper

I was planning a cruise vacation and realized something frustrating: there's no single place to compare detailed specs across all cruise ships. Want to know which ships have specific amenities? Which ones are the biggest? Good luck clicking through hundreds of individual ship pages.

So I built a comprehensive scraper that extracts everything about 1,155+ cruise ships across all major cruise lines—specifications, amenities, images, descriptions—and saves it all to structured JSONL files.

GitHub Repository: cruise-ship-data-extractor

The Challenge: XOR-Encrypted API Responses

Here's where it gets interesting. The website doesn't just serve HTML—they use an internal API that returns XOR-encrypted responses. This is actually a good thing for scraping because:

API responses are more reliable than parsing HTML
Data is already structured (JSON format)
No need to deal with complex DOM traversal
Faster data extraction

The tricky part? Every API response is encrypted using XOR obfuscation. But once you crack the pattern, it's straightforward to decrypt automatically.

What Data Gets Extracted

For each of the ~1,155 cruise ships, the scraper pulls:

Ship Specifications

Name, cruise line, and ship ID
Year built and last refurbished
Gross tonnage and length
Passenger capacity and crew size
Number of decks and cabins
Space ratio (gross tons per passenger)

Amenities & Features

Dining venues (restaurants, cafes, bars)
Entertainment (theaters, casinos, pools)
Recreation (gyms, spas, sports facilities)
Services (wifi, medical, laundry)
Cabin types and configurations

Media Assets

High-resolution ship gallery images
Deck plans and layouts
Amenity photos
All images organized by ship ID

Metadata

Ship descriptions and highlights
Booking URLs and references
Last scraped timestamp
Data source tracking

Architecture Overview

The scraper uses a multi-stage pipeline:

┌─────────────────────┐
│   Puppeteer Init    │ ← Launch browser, get session cookies
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Fetch Master List  │ ← Get all ship IDs from API
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Concurrent Queue   │ ← Process multiple ships in parallel
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│    ScraperAPI      │ ← Route requests through proxy
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  XOR Decryption    │ ← Decrypt API responses
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Data Processing   │ ← Parse and structure data
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  Image Downloads   │ ← Fetch gallery images
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│  JSONL Output      │ ← Save to master.jsonl & ships.jsonl
└─────────────────────┘

Key Features Explained

1. Puppeteer for Session Management

The site requires valid session cookies for API access. Puppeteer handles this automatically:

const puppeteer = require('puppeteer');

async function getSessionCookies() {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });

  const page = await browser.newPage();

  // Navigate to the site to establish session
  await page.goto('https://www.ody.com/cruise-ships', {
    waitUntil: 'networkidle2'
  });

  // Extract cookies
  const cookies = await page.cookies();
  await browser.close();

  // Convert to header format
  return cookies
    .map(cookie => `${cookie.name}=${cookie.value}`)
    .join('; ');
}

The scraper automatically refreshes cookies when they expire. No manual intervention needed.

2. XOR Decryption Implementation

The API returns responses encrypted with a simple XOR cipher. Here's how to decrypt them:

function decryptXOR(encryptedData, key = 'default-key') {
  // Convert base64 encrypted string to buffer
  const buffer = Buffer.from(encryptedData, 'base64');

  // XOR each byte with the key
  const decrypted = buffer.map((byte, i) => {
    return byte ^ key.charCodeAt(i % key.length);
  });

  // Convert back to string and parse JSON
  const jsonString = Buffer.from(decrypted).toString('utf-8');
  return JSON.parse(jsonString);
}

// Usage in scraper
async function fetchShipData(shipId, cookies) {
  const url = `https://api.ody.com/ships/${shipId}`;

  const response = await fetch(url, {
    headers: {
      'Cookie': cookies,
      'Accept': 'application/json'
    }
  });

  const encryptedData = await response.text();
  const decrypted = decryptXOR(encryptedData);

  return decrypted;
}

Once you understand the encryption pattern, it becomes a non-issue. The scraper handles it transparently.

3. ScraperAPI Integration for Reliability

Raw requests to the API work fine for small batches, but at scale you need:

Proxy rotation (avoid IP bans)
Automatic retries on failure
CAPTCHA handling (if triggered)
Request throttling

ScraperAPI handles all of this automatically:

const ScraperAPI = require('scraperapi-sdk');

const client = new ScraperAPI(process.env.SCRAPER_API_KEY);

async function fetchShipWithProxy(shipId) {
  try {
    const response = await client.get(
      `https://api.ody.com/ships/${shipId}`,
      {
        country_code: 'us',
        premium: true, // Use premium proxies for reliability
        session_number: 123 // Maintain session across requests
      }
    );

    return decryptXOR(response);
  } catch (error) {
    console.error(`Failed to fetch ship ${shipId}:`, error.message);
    throw error;
  }
}

Why ScraperAPI?

Handles 1,155+ ships without triggering rate limits
Automatic IP rotation across thousands of proxies
99%+ success rate with retries
Only pay for successful requests
Saves weeks of infrastructure work

4. Parallel Processing with Concurrency Control

Processing 1,155 ships sequentially would take forever. The scraper uses controlled parallelism:

const pLimit = require('p-limit');

// Configure concurrency based on API limits
const SCRAPERAPI_MAX_THREADS = process.env.SCRAPERAPI_MAX_THREADS || 5;
const MEDIA_MAX_THREADS = process.env.MEDIA_MAX_THREADS || 10;

async function scrapeAllShips(masterList) {
  const limit = pLimit(SCRAPERAPI_MAX_THREADS);
  const mediaLimit = pLimit(MEDIA_MAX_THREADS);

  const promises = masterList.map(ship =>
    limit(async () => {
      try {
        // Fetch ship details
        const shipData = await fetchShipData(ship.id);

        // Download images in parallel
        const images = await Promise.all(
          shipData.gallery.map(imageUrl =>
            mediaLimit(() => downloadImage(imageUrl, ship.id))
          )
        );

        // Save to JSONL
        await saveShipData(shipData);

        console.log(`✓ Completed: ${ship.name}`);
      } catch (error) {
        console.error(`✗ Failed: ${ship.name} - ${error.message}`);
      }
    })
  );

  await Promise.all(promises);
}

Optimal settings:

SCRAPERAPI_MAX_THREADS=5: Ship data requests (API limited)
MEDIA_MAX_THREADS=10: Image downloads (bandwidth limited)

With these settings, all 1,155 ships scrape in 2-3 hours.

5. Intelligent Resume Capability

Interruptions happen. The scraper tracks progress and resumes seamlessly:

const fs = require('fs').promises;

class ScrapeSession {
  constructor(sessionFile = '.scrape-session.json') {
    this.sessionFile = sessionFile;
    this.completed = new Set();
    this.failed = new Set();
  }

  async load() {
    try {
      const data = await fs.readFile(this.sessionFile, 'utf-8');
      const session = JSON.parse(data);

      this.completed = new Set(session.completed || []);
      this.failed = new Set(session.failed || []);

      console.log(`Loaded session: ${this.completed.size} completed, ${this.failed.size} failed`);
    } catch (error) {
      console.log('Starting fresh session');
    }
  }

  async markCompleted(shipId) {
    this.completed.add(shipId);
    await this.save();
  }

  async markFailed(shipId, error) {
    this.failed.add(shipId);
    await this.save();
  }

  isCompleted(shipId) {
    return this.completed.has(shipId);
  }

  async save() {
    await fs.writeFile(this.sessionFile, JSON.stringify({
      completed: Array.from(this.completed),
      failed: Array.from(this.failed),
      lastUpdate: new Date().toISOString()
    }, null, 2));
  }
}

// Usage
const session = new ScrapeSession();
await session.load();

const remainingShips = masterList.filter(ship =>
  !session.isCompleted(ship.id)
);

console.log(`Resuming: ${remainingShips.length} ships remaining`);

If the scraper crashes at ship 500, just restart it. It picks up at 501 automatically.

6. Image Download Pipeline

Each ship has 10-50 high-resolution gallery images. The scraper downloads and organizes them efficiently:

const axios = require('axios');
const path = require('path');

async function downloadImage(imageUrl, shipId) {
  const outputDir = path.join('media', String(shipId));

  // Create directory if needed
  await fs.mkdir(outputDir, { recursive: true });

  const filename = path.basename(new URL(imageUrl).pathname);
  const outputPath = path.join(outputDir, filename);

  // Check if already downloaded
  try {
    await fs.access(outputPath);
    console.log(`Skipping (exists): ${filename}`);
    return outputPath;
  } catch {
    // File doesn't exist, download it
  }

  try {
    const response = await axios.get(imageUrl, {
      responseType: 'arraybuffer',
      timeout: 30000,
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
      }
    });

    await fs.writeFile(outputPath, response.data);
    console.log(`Downloaded: ${filename}`);

    return outputPath;
  } catch (error) {
    console.error(`Failed to download ${filename}:`, error.message);
    throw error;
  }
}

Images are organized like this:

media/
├── 12345/           (ship ID)
│   ├── exterior-1.jpg
│   ├── pool-deck.jpg
│   ├── restaurant.jpg
│   └── cabin.jpg
├── 12346/
│   ├── exterior-1.jpg
│   └── ...
└── ...

7. JSONL Output Format

The scraper generates two output files:

master.jsonl - Complete ship catalog with basic info:

{"shipId":12345,"name":"Symphony of the Seas","cruiseLine":"Royal Caribbean","yearBuilt":2018,"passengerCapacity":6680}
{"shipId":12346,"name":"Carnival Vista","cruiseLine":"Carnival","yearBuilt":2016,"passengerCapacity":3934}

ships.jsonl - Detailed ship data (one JSON object per line):

{
  "shipId": 12345,
  "name": "Symphony of the Seas",
  "cruiseLine": "Royal Caribbean",
  "specifications": {
    "yearBuilt": 2018,
    "lastRefurbished": 2021,
    "grossTonnage": 228081,
    "length": 1188,
    "passengerCapacity": 6680,
    "crewSize": 2200,
    "decks": 18,
    "cabins": 2759,
    "spaceRatio": 34.14
  },
  "amenities": {
    "dining": ["Main Dining Room", "Wonderland", "Chops Grille", "Jamie's Italian"],
    "entertainment": ["AquaTheater", "Ice Rink", "Theater", "Casino"],
    "recreation": ["Rock Climbing Wall", "Zip Line", "Mini Golf", "Surf Simulator"],
    "services": ["Spa", "Fitness Center", "Medical Center", "Laundry"]
  },
  "description": "Symphony of the Seas is the world's largest cruise ship...",
  "gallery": [
    "media/12345/exterior-1.jpg",
    "media/12345/pool-deck.jpg"
  ],
  "scrapedAt": "2025-10-27T10:30:15.123Z"
}

Why JSONL?

Stream-friendly (process line-by-line)
Append-safe (add new ships easily)
Simple to parse
Works great with command-line tools like jq

Getting Started

Prerequisites

Node.js 18+ (required for native fetch)
ScraperAPI account (sign up here)
5GB free disk space (for images)

Installation

git clone https://github.com/digitalevenings/cruise-ship-data-extractor.git
cd cruise-ship-data-extractor
npm install

Configuration

Create a .env file:

# Required: ScraperAPI key
SCRAPER_API_KEY=your_api_key_here

# Optional: Concurrency settings
SCRAPERAPI_MAX_THREADS=5      # API requests (1-10)
MEDIA_MAX_THREADS=10           # Image downloads (5-20)

# Optional: Output settings
OUTPUT_DIR=./output
MEDIA_DIR=./media

Basic Usage

# Run the full scraper
npm start

# Resume interrupted scraping
npm start -- --resume

# Scrape specific cruise lines only
npm start -- --lines "Royal Caribbean,Carnival"

# Skip image downloads (data only)
npm start -- --no-images

# Clear session and start fresh
npm run clean

Analyzing The Data

Once you have the data, the possibilities are endless. Here are some practical examples:

1. Find the Largest Ships

const fs = require('fs');

// Read all ships
const ships = fs.readFileSync('ships.jsonl', 'utf-8')
  .split('\n')
  .filter(Boolean)
  .map(JSON.parse);

// Sort by passenger capacity
const largest = ships
  .sort((a, b) =>
    b.specifications.passengerCapacity - a.specifications.passengerCapacity
  )
  .slice(0, 10)
  .map(s => ({
    name: s.name,
    cruiseLine: s.cruiseLine,
    capacity: s.specifications.passengerCapacity,
    grossTonnage: s.specifications.grossTonnage
  }));

console.table(largest);

2. Compare Space Ratios

Space ratio = gross tonnage ÷ passenger capacity. Higher is more spacious.

const spacious = ships
  .filter(s => s.specifications.spaceRatio > 40)
  .map(s => ({
    name: s.name,
    cruiseLine: s.cruiseLine,
    spaceRatio: s.specifications.spaceRatio.toFixed(2)
  }))
  .sort((a, b) => b.spaceRatio - a.spaceRatio);

console.log('Most spacious cruise ships:');
console.table(spacious);

3. Analyze Amenities

Which amenities are most common?

const amenityCounts = {};

ships.forEach(ship => {
  const allAmenities = [
    ...ship.amenities.dining,
    ...ship.amenities.entertainment,
    ...ship.amenities.recreation,
    ...ship.amenities.services
  ];

  allAmenities.forEach(amenity => {
    amenityCounts[amenity] = (amenityCounts[amenity] || 0) + 1;
  });
});

const top20 = Object.entries(amenityCounts)
  .sort((a, b) => b[1] - a[1])
  .slice(0, 20);

console.log('Top 20 most common amenities:');
top20.forEach(([amenity, count]) => {
  console.log(`${amenity}: ${count} ships`);
});

4. Find Ships by Feature

Looking for ships with specific amenities?

function findShipsWith(amenity) {
  return ships.filter(ship => {
    const allAmenities = [
      ...ship.amenities.dining,
      ...ship.amenities.entertainment,
      ...ship.amenities.recreation,
      ...ship.amenities.services
    ].map(a => a.toLowerCase());

    return allAmenities.some(a =>
      a.includes(amenity.toLowerCase())
    );
  }).map(s => s.name);
}

// Find ships with ice rinks
const iceRinkShips = findShipsWith('ice rink');
console.log(`Ships with ice rinks: ${iceRinkShips.length}`);
console.log(iceRinkShips);

// Find ships with surf simulators
const surfShips = findShipsWith('surf simulator');
console.log(`\nShips with surf simulators: ${surfShips.length}`);
console.log(surfShips);

5. Export to CSV for Excel

Convert JSONL to CSV for spreadsheet analysis:

const { Parser } = require('json2csv');

// Select fields to export
const fields = [
  'name',
  'cruiseLine',
  'specifications.yearBuilt',
  'specifications.passengerCapacity',
  'specifications.grossTonnage',
  'specifications.spaceRatio'
];

const parser = new Parser({ fields });
const csv = parser.parse(ships);

fs.writeFileSync('cruise-ships.csv', csv);
console.log('Exported to cruise-ships.csv');

Error Handling & Monitoring

The scraper includes comprehensive error handling:

// Retry with exponential backoff
async function fetchWithRetry(url, maxAttempts = 3) {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await fetchShipData(url);
    } catch (error) {
      console.error(`Attempt ${attempt}/${maxAttempts} failed:`, error.message);

      if (attempt === maxAttempts) {
        throw new Error(`Failed after ${maxAttempts} attempts`);
      }

      // Exponential backoff: 2s, 4s, 8s
      const backoffMs = Math.pow(2, attempt) * 1000;
      console.log(`Retrying in ${backoffMs/1000}s...`);
      await new Promise(resolve => setTimeout(resolve, backoffMs));
    }
  }
}

// Progress tracking
class ProgressTracker {
  constructor(total) {
    this.total = total;
    this.completed = 0;
    this.failed = 0;
    this.startTime = Date.now();
  }

  increment(success = true) {
    if (success) {
      this.completed++;
    } else {
      this.failed++;
    }

    const processed = this.completed + this.failed;
    const percent = ((processed / this.total) * 100).toFixed(1);
    const elapsed = ((Date.now() - this.startTime) / 1000).toFixed(0);
    const rate = (processed / (elapsed / 60)).toFixed(1);

    console.log(
      `[${percent}%] ${processed}/${this.total} | ` +
      `✓ ${this.completed} ✗ ${this.failed} | ` +
      `${rate} ships/min`
    );
  }

  summary() {
    const totalTime = ((Date.now() - this.startTime) / 1000).toFixed(0);

    console.log(`\n=== Scraping Complete ===`);
    console.log(`Total Ships: ${this.total}`);
    console.log(`Successful: ${this.completed}`);
    console.log(`Failed: ${this.failed}`);
    console.log(`Success Rate: ${((this.completed / this.total) * 100).toFixed(1)}%`);
    console.log(`Total Time: ${totalTime}s`);
  }
}

Performance Optimization Tips

1. Tune Concurrency Based on Your API Tier

// ScraperAPI Free Tier (5,000 credits/month)
SCRAPERAPI_MAX_THREADS=2

// ScraperAPI Hobby Tier (100,000 credits/month)
SCRAPERAPI_MAX_THREADS=5

// ScraperAPI Professional Tier (1M+ credits/month)
SCRAPERAPI_MAX_THREADS=10

2. Skip Images for Quick Data Extraction

Images account for 90% of scraping time:

# Data only (completes in 30 minutes)
npm start -- --no-images

# Full scrape with images (completes in 2-3 hours)
npm start

3. Scrape Specific Cruise Lines

Don't need all 1,155 ships? Filter by cruise line:

# Royal Caribbean only (~28 ships)
npm start -- --lines "Royal Caribbean"

# Multiple lines
npm start -- --lines "Royal Caribbean,Carnival,Norwegian"

4. Use Caching for Development

Enable caching to avoid re-scraping during development:

// In .env
ENABLE_CACHE=true
CACHE_TTL=86400  # 24 hours

Common Issues & Solutions

Issue: ScraperAPI 429 Errors

Symptom: Error: Too Many Requests (429)

Solution: Reduce concurrency

SCRAPERAPI_MAX_THREADS=2

Issue: Session Cookies Expired

Symptom: Error: Unauthorized (401)

Solution: Cookies are auto-refreshed, but you can force refresh:

npm run refresh-cookies

Issue: Image Downloads Timing Out

Symptom: Many image download failures

Solution: Reduce media concurrency or increase timeout:

// In .env
MEDIA_MAX_THREADS=5
IMAGE_TIMEOUT=60000  # 60 seconds

Issue: Disk Space Full

Symptom: ENOSPC: no space left on device

Solution: Skip images or increase disk space:

# Skip images
npm start -- --no-images

# Or clean old scrapes
npm run clean-media

Cost Analysis

Using ScraperAPI, here's the typical cost breakdown:

Per Ship:

1 API request for ship data: 1 credit
~20 images (direct download, no credits)
Total: ~1 credit per ship

For All 1,155 Ships:

Ship data: 1,155 credits
Images: 0 credits (direct download)
Total: ~1,155 credits

Pricing:

Free tier: 5,000 credits/month (covers full scrape)
Hobby tier: $49/month for 100,000 credits (86+ full scrapes)
Professional: $249/month for 1M credits (866+ full scrapes)

For a one-time scrape, the free tier is sufficient. For automated monthly updates, the hobby tier is perfect.

Ethical Considerations

This scraper is built for educational purposes and personal research. Please use responsibly:

Best Practices

✅ Do:

Use reasonable concurrency limits
Respect robots.txt (though APIs aren't usually listed)
Add delays between requests when not using ScraperAPI
Only scrape publicly available data
Cache results to minimize redundant requests

❌ Don't:

Hammer servers with hundreds of simultaneous requests
Scrape authenticated/paid content
Resell or republish scraped data
Use data to harm the business
Bypass technical protections aggressively

Legal Disclaimer

Check terms of service before scraping. This tool is provided as-is for educational purposes. Users are responsible for compliance with applicable laws and terms of service.

What's Next

This scraper is complete and production-ready, but here are some ideas for extensions:

Planned Features

Pricing data: Scrape cruise pricing and availability
Real-time updates: Automated daily scraper to catch new ships
User reviews: Integrate passenger reviews and ratings
Advanced search: Build a web UI to search and filter ships
API wrapper: Expose scraped data via REST API

Integration Ideas

Airtable: Auto-sync data to Airtable base
Google Sheets: Export directly to spreadsheets
MongoDB: Store data in MongoDB for querying
Elasticsearch: Build a full-text search engine

Real-World Applications

1. Cruise Comparison Website

Build a site that lets users compare ships side-by-side:

// Backend API endpoint
app.get('/api/compare', (req, res) => {
  const { shipIds } = req.query;

  const ships = shipIds.map(id =>
    loadShipById(id)
  );

  res.json({ ships });
});

2. Travel Agency Tool

Help travel agents find perfect ships for clients:

function findShipsForFamily(requirements) {
  return ships.filter(ship => {
    const { passengerCapacity, yearBuilt } = ship.specifications;
    const hasKidsClub = ship.amenities.services.some(s =>
      s.toLowerCase().includes('kids club')
    );

    return passengerCapacity >= requirements.minCapacity &&
           yearBuilt >= requirements.minYear &&
           hasKidsClub;
  });
}

const familyShips = findShipsForFamily({
  minCapacity: 2000,
  minYear: 2015
});

3. Market Research

Analyze cruise industry trends:

// Ships by year built
const shipsByYear = {};
ships.forEach(ship => {
  const year = ship.specifications.yearBuilt;
  shipsByYear[year] = (shipsByYear[year] || 0) + 1;
});

// Average capacity by cruise line
const avgCapacityByCruiseLine = {};
ships.forEach(ship => {
  const line = ship.cruiseLine;
  if (!avgCapacityByCruiseLine[line]) {
    avgCapacityByCruiseLine[line] = [];
  }
  avgCapacityByCruiseLine[line].push(ship.specifications.passengerCapacity);
});

Object.keys(avgCapacityByCruiseLine).forEach(line => {
  const capacities = avgCapacityByCruiseLine[line];
  const avg = capacities.reduce((a, b) => a + b) / capacities.length;
  console.log(`${line}: ${avg.toFixed(0)} avg passengers`);
});

Contributing

Found a bug? Want to add a feature? Contributions welcome!

How to contribute:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Areas for contribution:

Additional output formats (XML, Parquet, etc.)
Integration with databases (PostgreSQL, MySQL)
Web UI for browsing scraped data
Docker containerization
CI/CD pipeline
Unit tests and integration tests

Final Thoughts

This scraper solves a real problem: getting comprehensive, structured data about all cruise ships in one place. Whether you're planning a vacation, building a comparison tool, or conducting market research, having all this data at your fingertips is incredibly valuable.

The combination of Puppeteer (for session management), ScraperAPI (for reliability), and smart concurrency controls makes this scraper production-ready and maintainable. It's been running reliably for months with minimal intervention.

Key takeaways:

XOR encryption isn't a dealbreaker—reverse engineer it once, automate forever
ScraperAPI is worth every penny for large-scale scraping
Parallel processing + resume capability = bulletproof scraping
JSONL format is perfect for incremental data collection

The full scraper runs in 2-3 hours, costs ~1,155 ScraperAPI credits (free tier covers it), and outputs perfectly structured data ready for analysis.

Happy scraping! 🚢

Repository: github.com/digitalevenings/cruise-ship-data-extractor