Extracting 21,000+ Travel Advisor Profiles: A High-Performance Scraper

I was researching the travel industry and needed to understand the landscape of travel advisors across the US. Who are they? Where are they located? What companies do they work for? The problem? This data is scattered across thousands of profile pages with no bulk export option.

So I built a high-performance scraper that extracts everything about 21,000+ travel advisors—names, locations, companies, contact info, and more—all saved to streaming NDJSON files.

GitHub Repository: travel-advisors-data-extractor

The Challenge: Large-Scale Data Extraction

The platform lists over 21,000 travel advisor profiles. Getting this data manually would take months. The challenge was building a scraper that could:

Handle massive scale (21,000+ profiles)
Avoid rate limiting and IP blocks
Resume after failures (network issues happen)
Cache intelligently (don't re-scrape unchanged data)
Process concurrently (but not overwhelm the server)

This required a two-phase approach: first collect all agent IDs, then fetch detailed profiles for each one.

What Data Gets Extracted

For each of the ~21,000 travel advisors, the scraper pulls:

Basic Information

Agent ID (unique identifier)
First name and last name
Job title and professional designation
Company/agency affiliation

Location Data

City and state
Address information (when available)
Regional market information

Contact Details

Phone numbers
Email addresses (when listed)
Website URLs

Professional Information

Years of experience
Specializations and expertise
Certifications and awards
Bio and profile description

Metadata

Profile URL
Last scraped timestamp
Data source tracking

Architecture Overview

The scraper uses a two-phase pipeline with intelligent caching:

┌─────────────────────┐
│   Phase 1: IDs      │ ← Collect all agent IDs (paginated)
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   Cache Check       │ ← Skip already-scraped profiles
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   Phase 2: Profiles │ ← Batch process with concurrency
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   Proxy Rotation    │ ← Round-robin through proxy pool
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   Data Extraction   │ ← Parse HTML and structure data
└──────────┬──────────┘
           │
           ▼
┌─────────────────────┐
│   NDJSON Output     │ ← Stream to output files
└─────────────────────┘

Key Features Explained

1. Two-Phase Extraction Strategy

Phase 1: Collect Agent IDs

The first phase hits the paginated listing API to collect all agent IDs:

async function collectAgentIds() {
  const agentIds = [];
  let page = 1;
  let hasMore = true;

  while (hasMore) {
    try {
      const response = await axios.get(
        `https://example.com/api/agents`,
        {
          params: {
            page,
            pageSize: 500 // Max results per page
          },
          headers: {
            'User-Agent': getRandomUserAgent()
          }
        }
      );

      const agents = response.data.results;
      agentIds.push(...agents.map(a => a.id));

      console.log(`Page ${page}: Found ${agents.length} agents`);

      hasMore = agents.length === 500;
      page++;

      // Polite delay between pagination requests
      await delay(500);
    } catch (error) {
      console.error(`Failed to fetch page ${page}:`, error.message);
      break;
    }
  }

  console.log(`Total agents collected: ${agentIds.length}`);
  return agentIds;
}

This phase typically completes in 5-10 minutes and returns ~21,000 agent IDs.

Phase 2: Extract Detailed Profiles

The second phase processes agent IDs in batches with concurrency control:

const pLimit = require('p-limit');

async function extractProfiles(agentIds) {
  const limit = pLimit(parseInt(process.env.BATCH_SIZE) || 20);
  const results = [];

  const promises = agentIds.map(id =>
    limit(async () => {
      try {
        const profile = await fetchAgentProfile(id);
        results.push(profile);
        console.log(`✓ Extracted: ${profile.firstName} ${profile.lastName}`);
      } catch (error) {
        console.error(`✗ Failed: ${id} - ${error.message}`);
      }
    })
  );

  await Promise.all(promises);
  return results;
}

2. Proxy Rotation with Webshare.io

To handle 21,000 requests without getting blocked, proxy rotation is essential:

const { HttpsProxyAgent } = require('https-proxy-agent');

class ProxyRotator {
  constructor(proxyList) {
    this.proxies = proxyList;
    this.currentIndex = 0;
  }

  getNextProxy() {
    const proxy = this.proxies[this.currentIndex];
    this.currentIndex = (this.currentIndex + 1) % this.proxies.length;
    return proxy;
  }

  createAgent() {
    const proxy = this.getNextProxy();
    return new HttpsProxyAgent(`http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`);
  }
}

// Load proxies from Webshare.io
const proxies = loadProxiesFromFile('proxies.txt');
const rotator = new ProxyRotator(proxies);

async function fetchWithProxy(url) {
  const agent = rotator.createAgent();

  try {
    const response = await axios.get(url, {
      httpsAgent: agent,
      timeout: 30000,
      headers: {
        'User-Agent': getRandomUserAgent()
      }
    });

    return response.data;
  } catch (error) {
    throw new Error(`Proxy request failed: ${error.message}`);
  }
}

Why Webshare.io?

Affordable proxy service (~$2.99 for 10 proxies)
Residential and datacenter proxies available
Simple integration (username:password format)
Reliable uptime and rotating IPs
No complicated setup

3. Intelligent Caching System

Avoid re-scraping data with a file-based cache:

const fs = require('fs').promises;
const path = require('path');
const crypto = require('crypto');

class FileCache {
  constructor(cacheDir = '.cache', ttl = 7 * 24 * 60 * 60 * 1000) {
    this.cacheDir = cacheDir;
    this.ttl = ttl; // Default: 7 days
  }

  getCacheKey(identifier) {
    return crypto.createHash('md5').update(identifier).digest('hex');
  }

  async get(agentId) {
    const key = this.getCacheKey(agentId.toString());
    const cachePath = path.join(this.cacheDir, `${key}.json`);

    try {
      const data = await fs.readFile(cachePath, 'utf-8');
      const cached = JSON.parse(data);

      // Check if cache is still valid
      if (Date.now() - cached.timestamp < this.ttl) {
        console.log(`Cache hit: ${agentId}`);
        return cached.data;
      }

      console.log(`Cache expired: ${agentId}`);
    } catch (error) {
      // Cache miss
    }

    return null;
  }

  async set(agentId, data) {
    const key = this.getCacheKey(agentId.toString());
    const cachePath = path.join(this.cacheDir, `${key}.json`);

    await fs.mkdir(this.cacheDir, { recursive: true });
    await fs.writeFile(cachePath, JSON.stringify({
      timestamp: Date.now(),
      data
    }));
  }
}

// Usage in scraper
const cache = new FileCache();

async function fetchAgentProfile(agentId) {
  // Check cache first
  const cached = await cache.get(agentId);
  if (cached) return cached;

  // Fetch from API
  const data = await fetchWithProxy(`https://example.com/agents/${agentId}`);
  const profile = parseProfileData(data);

  // Save to cache
  await cache.set(agentId, profile);

  return profile;
}

Cache benefits:

Dramatically speeds up re-runs
Saves bandwidth and proxy credits
Respects data freshness (7-day TTL)
Simple file-based storage (no DB needed)

4. Concurrent Batch Processing

Process multiple agents simultaneously while controlling load:

const BATCH_SIZE = parseInt(process.env.BATCH_SIZE) || 20;
const DELAY_BETWEEN_BATCHES_MS = parseInt(process.env.DELAY_BETWEEN_BATCHES_MS) || 500;

async function processBatch(agentIds) {
  const batches = [];

  // Split into batches
  for (let i = 0; i < agentIds.length; i += BATCH_SIZE) {
    batches.push(agentIds.slice(i, i + BATCH_SIZE));
  }

  console.log(`Processing ${batches.length} batches of ${BATCH_SIZE}`);

  for (const [index, batch] of batches.entries()) {
    console.log(`\nBatch ${index + 1}/${batches.length}`);

    // Process batch concurrently
    await Promise.all(
      batch.map(id => extractAgentProfile(id))
    );

    // Polite delay between batches
    if (index < batches.length - 1) {
      await delay(DELAY_BETWEEN_BATCHES_MS);
    }
  }
}

Optimal settings:

BATCH_SIZE=20: Balance between speed and server load
DELAY_BETWEEN_BATCHES_MS=500: Polite pause between batches
MAX_RETRIES=3: Retry failed requests up to 3 times

With these settings, all 21,000 agents scrape in 3-4 hours.

5. User-Agent Rotation

Avoid bot detection by rotating browser user agents:

const USER_AGENTS = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
];

function getRandomUserAgent() {
  return USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)];
}

// Use in requests
const response = await axios.get(url, {
  headers: {
    'User-Agent': getRandomUserAgent(),
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive',
    'Upgrade-Insecure-Requests': '1'
  }
});

6. Streaming NDJSON Output

Save data as it's extracted using streaming writes:

const fs = require('fs');
const path = require('path');

class NDJSONWriter {
  constructor(outputDir = './output') {
    this.outputDir = outputDir;
    this.stream = null;
  }

  async initialize(filename) {
    await fs.promises.mkdir(this.outputDir, { recursive: true });

    const filepath = path.join(this.outputDir, filename);
    this.stream = fs.createWriteStream(filepath, { flags: 'a' });

    console.log(`Writing to: ${filepath}`);
  }

  write(data) {
    if (!this.stream) {
      throw new Error('Stream not initialized');
    }

    this.stream.write(JSON.stringify(data) + '\n');
  }

  close() {
    if (this.stream) {
      this.stream.end();
    }
  }
}

// Usage
const writer = new NDJSONWriter();
await writer.initialize('agents.ndjson');

// Write as we scrape
for (const agentId of agentIds) {
  const profile = await fetchAgentProfile(agentId);
  writer.write(profile);
}

writer.close();

Why NDJSON?

Memory efficient (stream-friendly)
Append-safe (add new records easily)
Simple to parse line-by-line
Works great with Unix tools (jq, grep, etc.)
No need to load entire dataset into memory

7. Progress Tracking with CLI Progress Bar

Visual feedback during long-running scrapes:

const cliProgress = require('cli-progress');

class ProgressTracker {
  constructor(total) {
    this.bar = new cliProgress.SingleBar({
      format: 'Progress |{bar}| {percentage}% | {value}/{total} | ETA: {eta}s | {status}',
      barCompleteChar: '\u2588',
      barIncompleteChar: '\u2591',
      hideCursor: true
    });

    this.bar.start(total, 0, {
      status: 'Starting...'
    });

    this.completed = 0;
    this.failed = 0;
  }

  increment(success = true, status = '') {
    if (success) {
      this.completed++;
    } else {
      this.failed++;
    }

    this.bar.update(this.completed + this.failed, {
      status: status || `✓ ${this.completed} | ✗ ${this.failed}`
    });
  }

  stop() {
    this.bar.stop();
    console.log(`\nCompleted: ${this.completed}`);
    console.log(`Failed: ${this.failed}`);
    console.log(`Success Rate: ${((this.completed / (this.completed + this.failed)) * 100).toFixed(1)}%`);
  }
}

// Usage
const progress = new ProgressTracker(agentIds.length);

for (const id of agentIds) {
  try {
    await extractAgentProfile(id);
    progress.increment(true);
  } catch (error) {
    progress.increment(false);
  }
}

progress.stop();

Getting Started

Prerequisites

Node.js 16+ (required for modern features)
Webshare.io proxy account (or similar proxy service)
1GB free disk space (for cache and output)

Installation

git clone https://github.com/digitalevenings/travel-advisors-data-extractor.git
cd travel-advisors-data-extractor
npm install

Configuration

Create a .env file:

# Scraper settings
BATCH_SIZE=20                        # Concurrent requests (10-50)
DELAY_BETWEEN_BATCHES_MS=500        # Delay between batches (ms)
MAX_RETRIES=3                        # Retry attempts per request
PAGE_SIZE=500                        # Results per API call

# Cache settings
CACHE_TTL=604800                     # Cache validity (seconds, 7 days)

# Output settings
OUTPUT_DIR=./output

Create a proxies.txt file with your Webshare.io proxies:

username:password@proxy1.webshare.io:80
username:password@proxy2.webshare.io:80
username:password@proxy3.webshare.io:80

Basic Usage

# Run the full scraper
npm start

# Phase 1 only: Collect agent IDs
npm run collect-ids

# Phase 2 only: Extract profiles
npm run extract-profiles

# Clear cache and start fresh
npm run clean-cache

Analyzing The Data

Once you have the data, here are some practical examples:

1. Find Advisors by Location

const fs = require('fs');

// Read NDJSON file
const agents = fs.readFileSync('output/agents.ndjson', 'utf-8')
  .split('\n')
  .filter(Boolean)
  .map(JSON.parse);

// Find advisors in California
const california = agents.filter(a =>
  a.state === 'CA' || a.state === 'California'
);

console.log(`Found ${california.length} advisors in California`);

2. Analyze by Company

const companyCounts = {};

agents.forEach(agent => {
  const company = agent.company || 'Independent';
  companyCounts[company] = (companyCounts[company] || 0) + 1;
});

const topCompanies = Object.entries(companyCounts)
  .sort((a, b) => b[1] - a[1])
  .slice(0, 20);

console.log('Top 20 companies by advisor count:');
console.table(topCompanies);

3. Export to CSV for Excel

Convert NDJSON to CSV:

const { Parser } = require('json2csv');

const fields = [
  'id',
  'firstName',
  'lastName',
  'company',
  'city',
  'state',
  'phone',
  'email'
];

const parser = new Parser({ fields });
const csv = parser.parse(agents);

fs.writeFileSync('advisors.csv', csv);
console.log('Exported to advisors.csv');

4. Geographic Distribution

const stateCounts = {};

agents.forEach(agent => {
  const state = agent.state || 'Unknown';
  stateCounts[state] = (stateCounts[state] || 0) + 1;
});

console.log('Advisors by state:');
console.table(
  Object.entries(stateCounts)
    .sort((a, b) => b[1] - a[1])
);

Error Handling & Retry Logic

Production-ready error handling with exponential backoff:

async function fetchWithRetry(url, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await fetchWithProxy(url);
    } catch (error) {
      console.error(`Attempt ${attempt}/${maxRetries} failed:`, error.message);

      if (attempt === maxRetries) {
        throw new Error(`Failed after ${maxRetries} attempts`);
      }

      // Exponential backoff: 2s, 4s, 8s
      const backoffMs = Math.pow(2, attempt) * 1000;
      console.log(`Retrying in ${backoffMs/1000}s...`);
      await delay(backoffMs);
    }
  }
}

Performance Optimization Tips

1. Tune Concurrency Based on Your Setup

// Conservative (fewer proxies, slower connection)
BATCH_SIZE=10
DELAY_BETWEEN_BATCHES_MS=1000

// Balanced (recommended)
BATCH_SIZE=20
DELAY_BETWEEN_BATCHES_MS=500

// Aggressive (many proxies, fast connection)
BATCH_SIZE=50
DELAY_BETWEEN_BATCHES_MS=200

2. Use Cache Effectively

# First run: Scrapes everything (3-4 hours)
npm start

# Second run: Only scrapes new/changed profiles (minutes)
npm start

# Force fresh scrape: Clear cache first
npm run clean-cache && npm start

3. Process Specific Agents

// Instead of all agents
const allAgents = await collectAgentIds(); // ❌

// Filter specific criteria
const recentAgents = allAgents.filter(id =>
  id > 15000 // Only agents added recently
);
await extractProfiles(recentAgents); // ✅

Common Issues & Solutions

Issue: Proxy Connection Failures

Symptom: Error: Proxy connection refused

Solution: Verify proxy credentials and test connectivity

# Test proxy manually
curl -x http://username:password@proxy.webshare.io:80 https://httpbin.org/ip

Issue: Rate Limiting (429 Errors)

Symptom: Error: Too Many Requests (429)

Solution: Increase delay between batches

DELAY_BETWEEN_BATCHES_MS=1000  # Increase to 1 second
BATCH_SIZE=10                   # Reduce concurrent requests

Issue: Memory Usage

Symptom: Process crashes with "out of memory"

Solution: Use streaming and process in smaller batches

BATCH_SIZE=10  # Reduce batch size

Issue: Cache Taking Too Much Space

Symptom: .cache directory is gigabytes

Solution: Clear old cache files

npm run clean-cache

Cost Analysis

Proxy Costs (Webshare.io):

10 proxies: $2.99/month
25 proxies: $6.25/month
100 proxies: $22.00/month

For 21,000 agents:

With 10 proxies + BATCH_SIZE=20: ~4 hours
With 25 proxies + BATCH_SIZE=50: ~2 hours
With 100 proxies + BATCH_SIZE=100: ~1 hour

Recommendation: Start with 10 proxies ($2.99/month) for testing, scale up if needed.

Ethical Considerations

This scraper is built for educational purposes and market research. Please use responsibly:

Best Practices

✅ Do:

Use reasonable concurrency limits
Add delays between batches
Respect robots.txt
Only scrape publicly available data
Cache results to minimize requests
Use proxies to distribute load

❌ Don't:

Hammer servers with excessive requests
Scrape private/authenticated data
Resell or misuse personal information
Bypass technical protections aggressively
Ignore rate limits

Legal Disclaimer

Check terms of service before scraping. This tool is provided as-is for educational purposes. Users are responsible for compliance with applicable laws and terms of service.

What's Next

This scraper is complete and production-ready, but here are some ideas:

Planned Features

Automated updates: Daily scraping to catch new advisors
Advanced filtering: Scrape by specialization, region, or company
Data enrichment: Cross-reference with LinkedIn/company websites
Email validation: Verify email addresses are valid
Database integration: Direct export to PostgreSQL/MongoDB

Integration Ideas

Airtable: Auto-sync data to Airtable base
Google Sheets: Export directly to spreadsheets
CRM systems: Import into Salesforce, HubSpot, etc.
Data visualization: Build dashboards with geographic heatmaps

Real-World Applications

1. Market Research

Analyze the travel advisor industry:

// Advisors by state
const distribution = analyzeDemographics(agents);

// Top companies by market share
const marketShare = calculateMarketShare(agents);

// Growth trends (compare with historical data)
const growth = compareWithPrevious(agents, historicalData);

2. Lead Generation

Build targeted lists for B2B marketing:

function findTargetedLeads(criteria) {
  return agents.filter(agent => {
    return (
      agent.state === criteria.targetState &&
      agent.company !== criteria.excludeCompany &&
      agent.email // Has contact info
    );
  });
}

const leads = findTargetedLeads({
  targetState: 'CA',
  excludeCompany: 'Large Corp'
});

3. Competitive Intelligence

Understand competitor presence:

function analyzeCompetitor(companyName) {
  const advisors = agents.filter(a =>
    a.company?.toLowerCase().includes(companyName.toLowerCase())
  );

  return {
    total: advisors.length,
    states: [...new Set(advisors.map(a => a.state))],
    locations: advisors.map(a => ({ city: a.city, state: a.state }))
  };
}

Contributing

Found a bug? Want to add a feature? Contributions welcome!

How to contribute:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Areas for contribution:

Support for additional proxy providers
Database export modules (PostgreSQL, MongoDB)
Web UI for browsing scraped data
Docker containerization
Unit tests and integration tests
CLI improvements and better error messages

Final Thoughts

This scraper solves a real problem: getting comprehensive, structured data about travel advisors at scale. Whether you're doing market research, generating leads, or analyzing industry trends, having all this data in one place is incredibly valuable.

The combination of proxy rotation (avoiding blocks), intelligent caching (saving time and bandwidth), and concurrent processing (speed) makes this scraper production-ready and maintainable.

Key takeaways:

Two-phase approach is essential for large-scale scraping
Proxy rotation prevents IP blocks (Webshare.io is affordable and reliable)
Caching dramatically speeds up re-runs
NDJSON format is perfect for streaming data collection
Progress tracking keeps you informed during long scrapes

The full scraper runs in 3-4 hours, costs ~$3/month for proxies, and outputs perfectly structured data ready for analysis.

Happy scraping! ✈️

Repository: github.com/digitalevenings/travel-advisors-data-extractor