Extracting 21,000+ Travel Advisor Profiles: A High-Performance Scraper
Built a production-ready Node.js scraper that extracts comprehensive profiles for 21,000+ travel advisors from a travel advisory platform using proxy rotation, caching, and concurrent processing with streaming NDJSON output.
Digital Evenings
Author
I was researching the travel industry and needed to understand the landscape of travel advisors across the US. Who are they? Where are they located? What companies do they work for? The problem? This data is scattered across thousands of profile pages with no bulk export option.
So I built a high-performance scraper that extracts everything about 21,000+ travel advisors—names, locations, companies, contact info, and more—all saved to streaming NDJSON files.
GitHub Repository: travel-advisors-data-extractor
The Challenge: Large-Scale Data Extraction
The platform lists over 21,000 travel advisor profiles. Getting this data manually would take months. The challenge was building a scraper that could:
- Handle massive scale (21,000+ profiles)
- Avoid rate limiting and IP blocks
- Resume after failures (network issues happen)
- Cache intelligently (don't re-scrape unchanged data)
- Process concurrently (but not overwhelm the server)
This required a two-phase approach: first collect all agent IDs, then fetch detailed profiles for each one.
What Data Gets Extracted
For each of the ~21,000 travel advisors, the scraper pulls:
Basic Information
- Agent ID (unique identifier)
- First name and last name
- Job title and professional designation
- Company/agency affiliation
Location Data
- City and state
- Address information (when available)
- Regional market information
Contact Details
- Phone numbers
- Email addresses (when listed)
- Website URLs
Professional Information
- Years of experience
- Specializations and expertise
- Certifications and awards
- Bio and profile description
Metadata
- Profile URL
- Last scraped timestamp
- Data source tracking
Architecture Overview
The scraper uses a two-phase pipeline with intelligent caching:
┌─────────────────────┐
│ Phase 1: IDs │ ← Collect all agent IDs (paginated)
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Cache Check │ ← Skip already-scraped profiles
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Phase 2: Profiles │ ← Batch process with concurrency
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Proxy Rotation │ ← Round-robin through proxy pool
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ Data Extraction │ ← Parse HTML and structure data
└──────────┬──────────┘
│
▼
┌─────────────────────┐
│ NDJSON Output │ ← Stream to output files
└─────────────────────┘
Key Features Explained
1. Two-Phase Extraction Strategy
Phase 1: Collect Agent IDs
The first phase hits the paginated listing API to collect all agent IDs:
async function collectAgentIds() {
const agentIds = [];
let page = 1;
let hasMore = true;
while (hasMore) {
try {
const response = await axios.get(
`https://example.com/api/agents`,
{
params: {
page,
pageSize: 500 // Max results per page
},
headers: {
'User-Agent': getRandomUserAgent()
}
}
);
const agents = response.data.results;
agentIds.push(...agents.map(a => a.id));
console.log(`Page ${page}: Found ${agents.length} agents`);
hasMore = agents.length === 500;
page++;
// Polite delay between pagination requests
await delay(500);
} catch (error) {
console.error(`Failed to fetch page ${page}:`, error.message);
break;
}
}
console.log(`Total agents collected: ${agentIds.length}`);
return agentIds;
}
This phase typically completes in 5-10 minutes and returns ~21,000 agent IDs.
Phase 2: Extract Detailed Profiles
The second phase processes agent IDs in batches with concurrency control:
const pLimit = require('p-limit');
async function extractProfiles(agentIds) {
const limit = pLimit(parseInt(process.env.BATCH_SIZE) || 20);
const results = [];
const promises = agentIds.map(id =>
limit(async () => {
try {
const profile = await fetchAgentProfile(id);
results.push(profile);
console.log(`✓ Extracted: ${profile.firstName} ${profile.lastName}`);
} catch (error) {
console.error(`✗ Failed: ${id} - ${error.message}`);
}
})
);
await Promise.all(promises);
return results;
}
2. Proxy Rotation with Webshare.io
To handle 21,000 requests without getting blocked, proxy rotation is essential:
const { HttpsProxyAgent } = require('https-proxy-agent');
class ProxyRotator {
constructor(proxyList) {
this.proxies = proxyList;
this.currentIndex = 0;
}
getNextProxy() {
const proxy = this.proxies[this.currentIndex];
this.currentIndex = (this.currentIndex + 1) % this.proxies.length;
return proxy;
}
createAgent() {
const proxy = this.getNextProxy();
return new HttpsProxyAgent(`http://${proxy.username}:${proxy.password}@${proxy.host}:${proxy.port}`);
}
}
// Load proxies from Webshare.io
const proxies = loadProxiesFromFile('proxies.txt');
const rotator = new ProxyRotator(proxies);
async function fetchWithProxy(url) {
const agent = rotator.createAgent();
try {
const response = await axios.get(url, {
httpsAgent: agent,
timeout: 30000,
headers: {
'User-Agent': getRandomUserAgent()
}
});
return response.data;
} catch (error) {
throw new Error(`Proxy request failed: ${error.message}`);
}
}
Why Webshare.io?
- Affordable proxy service (~$2.99 for 10 proxies)
- Residential and datacenter proxies available
- Simple integration (username:password format)
- Reliable uptime and rotating IPs
- No complicated setup
3. Intelligent Caching System
Avoid re-scraping data with a file-based cache:
const fs = require('fs').promises;
const path = require('path');
const crypto = require('crypto');
class FileCache {
constructor(cacheDir = '.cache', ttl = 7 * 24 * 60 * 60 * 1000) {
this.cacheDir = cacheDir;
this.ttl = ttl; // Default: 7 days
}
getCacheKey(identifier) {
return crypto.createHash('md5').update(identifier).digest('hex');
}
async get(agentId) {
const key = this.getCacheKey(agentId.toString());
const cachePath = path.join(this.cacheDir, `${key}.json`);
try {
const data = await fs.readFile(cachePath, 'utf-8');
const cached = JSON.parse(data);
// Check if cache is still valid
if (Date.now() - cached.timestamp < this.ttl) {
console.log(`Cache hit: ${agentId}`);
return cached.data;
}
console.log(`Cache expired: ${agentId}`);
} catch (error) {
// Cache miss
}
return null;
}
async set(agentId, data) {
const key = this.getCacheKey(agentId.toString());
const cachePath = path.join(this.cacheDir, `${key}.json`);
await fs.mkdir(this.cacheDir, { recursive: true });
await fs.writeFile(cachePath, JSON.stringify({
timestamp: Date.now(),
data
}));
}
}
// Usage in scraper
const cache = new FileCache();
async function fetchAgentProfile(agentId) {
// Check cache first
const cached = await cache.get(agentId);
if (cached) return cached;
// Fetch from API
const data = await fetchWithProxy(`https://example.com/agents/${agentId}`);
const profile = parseProfileData(data);
// Save to cache
await cache.set(agentId, profile);
return profile;
}
Cache benefits:
- Dramatically speeds up re-runs
- Saves bandwidth and proxy credits
- Respects data freshness (7-day TTL)
- Simple file-based storage (no DB needed)
4. Concurrent Batch Processing
Process multiple agents simultaneously while controlling load:
const BATCH_SIZE = parseInt(process.env.BATCH_SIZE) || 20;
const DELAY_BETWEEN_BATCHES_MS = parseInt(process.env.DELAY_BETWEEN_BATCHES_MS) || 500;
async function processBatch(agentIds) {
const batches = [];
// Split into batches
for (let i = 0; i < agentIds.length; i += BATCH_SIZE) {
batches.push(agentIds.slice(i, i + BATCH_SIZE));
}
console.log(`Processing ${batches.length} batches of ${BATCH_SIZE}`);
for (const [index, batch] of batches.entries()) {
console.log(`\nBatch ${index + 1}/${batches.length}`);
// Process batch concurrently
await Promise.all(
batch.map(id => extractAgentProfile(id))
);
// Polite delay between batches
if (index < batches.length - 1) {
await delay(DELAY_BETWEEN_BATCHES_MS);
}
}
}
Optimal settings:
BATCH_SIZE=20: Balance between speed and server loadDELAY_BETWEEN_BATCHES_MS=500: Polite pause between batchesMAX_RETRIES=3: Retry failed requests up to 3 times
With these settings, all 21,000 agents scrape in 3-4 hours.
5. User-Agent Rotation
Avoid bot detection by rotating browser user agents:
const USER_AGENTS = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
];
function getRandomUserAgent() {
return USER_AGENTS[Math.floor(Math.random() * USER_AGENTS.length)];
}
// Use in requests
const response = await axios.get(url, {
headers: {
'User-Agent': getRandomUserAgent(),
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1'
}
});
6. Streaming NDJSON Output
Save data as it's extracted using streaming writes:
const fs = require('fs');
const path = require('path');
class NDJSONWriter {
constructor(outputDir = './output') {
this.outputDir = outputDir;
this.stream = null;
}
async initialize(filename) {
await fs.promises.mkdir(this.outputDir, { recursive: true });
const filepath = path.join(this.outputDir, filename);
this.stream = fs.createWriteStream(filepath, { flags: 'a' });
console.log(`Writing to: ${filepath}`);
}
write(data) {
if (!this.stream) {
throw new Error('Stream not initialized');
}
this.stream.write(JSON.stringify(data) + '\n');
}
close() {
if (this.stream) {
this.stream.end();
}
}
}
// Usage
const writer = new NDJSONWriter();
await writer.initialize('agents.ndjson');
// Write as we scrape
for (const agentId of agentIds) {
const profile = await fetchAgentProfile(agentId);
writer.write(profile);
}
writer.close();
Why NDJSON?
- Memory efficient (stream-friendly)
- Append-safe (add new records easily)
- Simple to parse line-by-line
- Works great with Unix tools (
jq,grep, etc.) - No need to load entire dataset into memory
7. Progress Tracking with CLI Progress Bar
Visual feedback during long-running scrapes:
const cliProgress = require('cli-progress');
class ProgressTracker {
constructor(total) {
this.bar = new cliProgress.SingleBar({
format: 'Progress |{bar}| {percentage}% | {value}/{total} | ETA: {eta}s | {status}',
barCompleteChar: '\u2588',
barIncompleteChar: '\u2591',
hideCursor: true
});
this.bar.start(total, 0, {
status: 'Starting...'
});
this.completed = 0;
this.failed = 0;
}
increment(success = true, status = '') {
if (success) {
this.completed++;
} else {
this.failed++;
}
this.bar.update(this.completed + this.failed, {
status: status || `✓ ${this.completed} | ✗ ${this.failed}`
});
}
stop() {
this.bar.stop();
console.log(`\nCompleted: ${this.completed}`);
console.log(`Failed: ${this.failed}`);
console.log(`Success Rate: ${((this.completed / (this.completed + this.failed)) * 100).toFixed(1)}%`);
}
}
// Usage
const progress = new ProgressTracker(agentIds.length);
for (const id of agentIds) {
try {
await extractAgentProfile(id);
progress.increment(true);
} catch (error) {
progress.increment(false);
}
}
progress.stop();
Getting Started
Prerequisites
- Node.js 16+ (required for modern features)
- Webshare.io proxy account (or similar proxy service)
- 1GB free disk space (for cache and output)
Installation
git clone https://github.com/digitalevenings/travel-advisors-data-extractor.git
cd travel-advisors-data-extractor
npm install
Configuration
Create a .env file:
# Scraper settings
BATCH_SIZE=20 # Concurrent requests (10-50)
DELAY_BETWEEN_BATCHES_MS=500 # Delay between batches (ms)
MAX_RETRIES=3 # Retry attempts per request
PAGE_SIZE=500 # Results per API call
# Cache settings
CACHE_TTL=604800 # Cache validity (seconds, 7 days)
# Output settings
OUTPUT_DIR=./output
Create a proxies.txt file with your Webshare.io proxies:
username:password@proxy1.webshare.io:80
username:password@proxy2.webshare.io:80
username:password@proxy3.webshare.io:80
Basic Usage
# Run the full scraper
npm start
# Phase 1 only: Collect agent IDs
npm run collect-ids
# Phase 2 only: Extract profiles
npm run extract-profiles
# Clear cache and start fresh
npm run clean-cache
Analyzing The Data
Once you have the data, here are some practical examples:
1. Find Advisors by Location
const fs = require('fs');
// Read NDJSON file
const agents = fs.readFileSync('output/agents.ndjson', 'utf-8')
.split('\n')
.filter(Boolean)
.map(JSON.parse);
// Find advisors in California
const california = agents.filter(a =>
a.state === 'CA' || a.state === 'California'
);
console.log(`Found ${california.length} advisors in California`);
2. Analyze by Company
const companyCounts = {};
agents.forEach(agent => {
const company = agent.company || 'Independent';
companyCounts[company] = (companyCounts[company] || 0) + 1;
});
const topCompanies = Object.entries(companyCounts)
.sort((a, b) => b[1] - a[1])
.slice(0, 20);
console.log('Top 20 companies by advisor count:');
console.table(topCompanies);
3. Export to CSV for Excel
Convert NDJSON to CSV:
const { Parser } = require('json2csv');
const fields = [
'id',
'firstName',
'lastName',
'company',
'city',
'state',
'phone',
'email'
];
const parser = new Parser({ fields });
const csv = parser.parse(agents);
fs.writeFileSync('advisors.csv', csv);
console.log('Exported to advisors.csv');
4. Geographic Distribution
const stateCounts = {};
agents.forEach(agent => {
const state = agent.state || 'Unknown';
stateCounts[state] = (stateCounts[state] || 0) + 1;
});
console.log('Advisors by state:');
console.table(
Object.entries(stateCounts)
.sort((a, b) => b[1] - a[1])
);
Error Handling & Retry Logic
Production-ready error handling with exponential backoff:
async function fetchWithRetry(url, maxRetries = 3) {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await fetchWithProxy(url);
} catch (error) {
console.error(`Attempt ${attempt}/${maxRetries} failed:`, error.message);
if (attempt === maxRetries) {
throw new Error(`Failed after ${maxRetries} attempts`);
}
// Exponential backoff: 2s, 4s, 8s
const backoffMs = Math.pow(2, attempt) * 1000;
console.log(`Retrying in ${backoffMs/1000}s...`);
await delay(backoffMs);
}
}
}
Performance Optimization Tips
1. Tune Concurrency Based on Your Setup
// Conservative (fewer proxies, slower connection)
BATCH_SIZE=10
DELAY_BETWEEN_BATCHES_MS=1000
// Balanced (recommended)
BATCH_SIZE=20
DELAY_BETWEEN_BATCHES_MS=500
// Aggressive (many proxies, fast connection)
BATCH_SIZE=50
DELAY_BETWEEN_BATCHES_MS=200
2. Use Cache Effectively
# First run: Scrapes everything (3-4 hours)
npm start
# Second run: Only scrapes new/changed profiles (minutes)
npm start
# Force fresh scrape: Clear cache first
npm run clean-cache && npm start
3. Process Specific Agents
// Instead of all agents
const allAgents = await collectAgentIds(); // ❌
// Filter specific criteria
const recentAgents = allAgents.filter(id =>
id > 15000 // Only agents added recently
);
await extractProfiles(recentAgents); // ✅
Common Issues & Solutions
Issue: Proxy Connection Failures
Symptom: Error: Proxy connection refused
Solution: Verify proxy credentials and test connectivity
# Test proxy manually
curl -x http://username:password@proxy.webshare.io:80 https://httpbin.org/ip
Issue: Rate Limiting (429 Errors)
Symptom: Error: Too Many Requests (429)
Solution: Increase delay between batches
DELAY_BETWEEN_BATCHES_MS=1000 # Increase to 1 second
BATCH_SIZE=10 # Reduce concurrent requests
Issue: Memory Usage
Symptom: Process crashes with "out of memory"
Solution: Use streaming and process in smaller batches
BATCH_SIZE=10 # Reduce batch size
Issue: Cache Taking Too Much Space
Symptom: .cache directory is gigabytes
Solution: Clear old cache files
npm run clean-cache
Cost Analysis
Proxy Costs (Webshare.io):
- 10 proxies: $2.99/month
- 25 proxies: $6.25/month
- 100 proxies: $22.00/month
For 21,000 agents:
- With 10 proxies + BATCH_SIZE=20: ~4 hours
- With 25 proxies + BATCH_SIZE=50: ~2 hours
- With 100 proxies + BATCH_SIZE=100: ~1 hour
Recommendation: Start with 10 proxies ($2.99/month) for testing, scale up if needed.
Ethical Considerations
This scraper is built for educational purposes and market research. Please use responsibly:
Best Practices
✅ Do:
- Use reasonable concurrency limits
- Add delays between batches
- Respect robots.txt
- Only scrape publicly available data
- Cache results to minimize requests
- Use proxies to distribute load
❌ Don't:
- Hammer servers with excessive requests
- Scrape private/authenticated data
- Resell or misuse personal information
- Bypass technical protections aggressively
- Ignore rate limits
Legal Disclaimer
Check terms of service before scraping. This tool is provided as-is for educational purposes. Users are responsible for compliance with applicable laws and terms of service.
What's Next
This scraper is complete and production-ready, but here are some ideas:
Planned Features
- Automated updates: Daily scraping to catch new advisors
- Advanced filtering: Scrape by specialization, region, or company
- Data enrichment: Cross-reference with LinkedIn/company websites
- Email validation: Verify email addresses are valid
- Database integration: Direct export to PostgreSQL/MongoDB
Integration Ideas
- Airtable: Auto-sync data to Airtable base
- Google Sheets: Export directly to spreadsheets
- CRM systems: Import into Salesforce, HubSpot, etc.
- Data visualization: Build dashboards with geographic heatmaps
Real-World Applications
1. Market Research
Analyze the travel advisor industry:
// Advisors by state
const distribution = analyzeDemographics(agents);
// Top companies by market share
const marketShare = calculateMarketShare(agents);
// Growth trends (compare with historical data)
const growth = compareWithPrevious(agents, historicalData);
2. Lead Generation
Build targeted lists for B2B marketing:
function findTargetedLeads(criteria) {
return agents.filter(agent => {
return (
agent.state === criteria.targetState &&
agent.company !== criteria.excludeCompany &&
agent.email // Has contact info
);
});
}
const leads = findTargetedLeads({
targetState: 'CA',
excludeCompany: 'Large Corp'
});
3. Competitive Intelligence
Understand competitor presence:
function analyzeCompetitor(companyName) {
const advisors = agents.filter(a =>
a.company?.toLowerCase().includes(companyName.toLowerCase())
);
return {
total: advisors.length,
states: [...new Set(advisors.map(a => a.state))],
locations: advisors.map(a => ({ city: a.city, state: a.state }))
};
}
Contributing
Found a bug? Want to add a feature? Contributions welcome!
How to contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Areas for contribution:
- Support for additional proxy providers
- Database export modules (PostgreSQL, MongoDB)
- Web UI for browsing scraped data
- Docker containerization
- Unit tests and integration tests
- CLI improvements and better error messages
Final Thoughts
This scraper solves a real problem: getting comprehensive, structured data about travel advisors at scale. Whether you're doing market research, generating leads, or analyzing industry trends, having all this data in one place is incredibly valuable.
The combination of proxy rotation (avoiding blocks), intelligent caching (saving time and bandwidth), and concurrent processing (speed) makes this scraper production-ready and maintainable.
Key takeaways:
- Two-phase approach is essential for large-scale scraping
- Proxy rotation prevents IP blocks (Webshare.io is affordable and reliable)
- Caching dramatically speeds up re-runs
- NDJSON format is perfect for streaming data collection
- Progress tracking keeps you informed during long scrapes
The full scraper runs in 3-4 hours, costs ~$3/month for proxies, and outputs perfectly structured data ready for analysis.
Happy scraping! ✈️
Repository: github.com/digitalevenings/travel-advisors-data-extractor