Scaling Scraping on AWS Lambda with Bright Data: Complete Guide

19 min read
AWS Lambda
Bright Data
large-scale scraping
serverless

Learn how to build large-scale scraping systems using AWS Lambda and Bright Data. Comprehensive guide covering cost optimization, error handling, scalability with practical implementation examples and best practices.

Why AWS Lambda + Bright Data?

In large-scale web scraping, traditional EC2-based solutions face challenges with infrastructure management complexity and costs. The combination of AWS Lambda and Bright Data fundamentally solves these problems.

Traditional Challenges vs Solutions

Traditional Challenges:

  • High operational costs of 24/7 running EC2 instances
  • Complex configuration management during scaling
  • Manual IP blocking countermeasures

AWS Lambda + Bright Data Solutions:

  • Complete pay-per-use billing model
  • Auto-scaling (up to 1,000 concurrent executions)
  • Automatic IP management via global proxy network

AWS Lambda Scaling Mechanics

AWS Lambda Scaling Behavior

Scaling Specifications (2025 Latest)

According to AWS official documentation, Lambda scaling behavior includes:

  • Concurrent Execution Scaling Rate: 1,000 instances per 10 seconds
  • Request Processing Capacity: 10,000 requests per second (per 10-second interval)
  • Function-Level Limits: Each function can scale independently

This theoretically enables processing up to 1 million requests per minute.

Implementation Architecture

Basic Configuration

SQS Queue ──> Lambda Function ──> Bright Data ──> Target Websites
    ↓              ↓                    ↓
DynamoDB      CloudWatch          Result Storage
(State)       (Monitoring)         (S3/RDS)

Component Overview

  1. SQS: Scraping task queue management
  2. Lambda: Actual scraping processing
  3. Bright Data: Proxy and CAPTCHA bypass
  4. DynamoDB: State management and rate limiting
  5. S3/RDS: Result data persistence

Lambda Function Implementation

Main Processing Function

import json
import boto3
import requests
from datetime import datetime, timedelta
import os

# Environment variables
BRIGHT_DATA_USERNAME = os.environ['BRIGHT_DATA_USERNAME']
BRIGHT_DATA_PASSWORD = os.environ['BRIGHT_DATA_PASSWORD']
BRIGHT_DATA_ENDPOINT = os.environ['BRIGHT_DATA_ENDPOINT']

def lambda_handler(event, context):
    """
    Main Lambda function
    Process SQS messages and execute scraping
    """
    
    results = []
    
    # Process SQS batch messages
    for record in event['Records']:
        try:
            # Parse message
            message = json.loads(record['body'])
            url = message['url']
            scraping_config = message.get('config', {})
            
            # Execute scraping
            result = scrape_with_bright_data(url, scraping_config)
            
            # Save result
            save_result_to_s3(result, url)
            
            results.append({
                'url': url,
                'status': 'success',
                'data_size': len(result)
            })
            
        except Exception as e:
            print(f"Error processing: {url} - {str(e)}")
            results.append({
                'url': url,
                'status': 'error',
                'error': str(e)
            })
    
    return {
        'statusCode': 200,
        'body': json.dumps({
            'processed': len(results),
            'results': results
        })
    }

def scrape_with_bright_data(url, config):
    """
    Scraping using Bright Data proxy
    """
    
    # Bright Data proxy configuration
    proxy_config = {
        'http': f'http://{BRIGHT_DATA_USERNAME}:{BRIGHT_DATA_PASSWORD}@{BRIGHT_DATA_ENDPOINT}',
        'https': f'http://{BRIGHT_DATA_USERNAME}:{BRIGHT_DATA_PASSWORD}@{BRIGHT_DATA_ENDPOINT}'
    }
    
    # Header configuration (detection avoidance)
    headers = {
        'User-Agent': config.get('user_agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'),
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.7,ja;q=0.3',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive'
    }
    
    try:
        response = requests.get(
            url,
            proxies=proxy_config,
            headers=headers,
            timeout=30,
            verify=True
        )
        
        response.raise_for_status()
        
        return {
            'url': url,
            'status_code': response.status_code,
            'content': response.text,
            'headers': dict(response.headers),
            'timestamp': datetime.now().isoformat()
        }
        
    except requests.exceptions.RequestException as e:
        raise Exception(f"Scraping error: {str(e)}")

Error Handling and Retry Functionality

import time
import random
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=1, max_delay=60):
    """
    Decorator for retry with exponential backoff
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        raise e
                    
                    # Exponential backoff + jitter
                    delay = min(base_delay * (2 ** attempt), max_delay)
                    jitter = random.uniform(0, 0.1) * delay
                    time.sleep(delay + jitter)
                    
                    print(f"Retry {attempt + 1}/{max_retries}: {str(e)}")
            
            return None
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3)
def robust_scrape(url, config):
    """
    Enhanced error handling scraping function
    """
    return scrape_with_bright_data(url, config)

State Management with DynamoDB

Table Design

import boto3
from boto3.dynamodb.conditions import Key, Attr
from decimal import Decimal

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('scraping-state')

def manage_scraping_state(url, action='get'):
    """
    Scraping state management
    Used for duplicate execution prevention and rate limiting
    """
    
    url_hash = hashlib.md5(url.encode()).hexdigest()
    
    if action == 'lock':
        # Set processing state
        try:
            table.put_item(
                Item={
                    'url_hash': url_hash,
                    'url': url,
                    'status': 'processing',
                    'timestamp': Decimal(str(time.time())),
                    'ttl': int(time.time() + 3600)  # 1 hour TTL
                },
                ConditionExpression='attribute_not_exists(url_hash)'
            )
            return True
        except ClientError as e:
            if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
                return False  # Already processing
            raise
    
    elif action == 'complete':
        # Update to completed state
        table.update_item(
            Key={'url_hash': url_hash},
            UpdateExpression='SET #status = :status, completion_time = :time',
            ExpressionAttributeNames={'#status': 'status'},
            ExpressionAttributeValues={
                ':status': 'completed',
                ':time': Decimal(str(time.time()))
            }
        )
        return True
    
    elif action == 'get':
        # Get state
        response = table.get_item(Key={'url_hash': url_hash})
        return response.get('Item')

Cost Optimization Strategies

1. Lambda Cost Reduction

def optimize_lambda_performance():
    """
    Lambda function performance optimization
    """
    
    # Memory configuration optimization (1769MB recommended)
    # - Balance between CPU performance and memory
    # - Network bandwidth optimization
    
    # Concurrent execution limit (cost control)
    RESERVED_CONCURRENCY = 100  # Initial setting
    
    # Batch processing to reduce SQS costs
    MAX_BATCH_SIZE = 10  # SQS messages
    
    return {
        'memory': 1769,
        'timeout': 120,  # 2 minutes
        'reserved_concurrency': RESERVED_CONCURRENCY,
        'batch_size': MAX_BATCH_SIZE
    }

2. Bright Data Cost Management

def track_bright_data_usage():
    """
    Bright Data usage tracking and limiting
    """
    
    # Usage tracking
    usage_table = dynamodb.Table('bright-data-usage')
    
    def log_request(url, bytes_used):
        today = datetime.now().strftime('%Y-%m-%d')
        
        usage_table.update_item(
            Key={'date': today},
            UpdateExpression='ADD total_bytes :bytes, request_count :count',
            ExpressionAttributeValues={
                ':bytes': bytes_used,
                ':count': 1
            }
        )
    
    def check_daily_limit():
        today = datetime.now().strftime('%Y-%m-%d')
        response = usage_table.get_item(Key={'date': today})
        
        if 'Item' in response:
            total_bytes = response['Item'].get('total_bytes', 0)
            # Daily limit: 100GB
            return total_bytes < (100 * 1024 * 1024 * 1024)
        
        return True
    
    return {
        'log_request': log_request,
        'check_limit': check_daily_limit
    }

3. Cost-Effectiveness Calculation

ComponentTraditional (EC2)Lambda + Bright DataReduction
Infrastructure$500/month$50/month90% ↓
Proxy$300/month$200/month33% ↓
Management Hours40hrs/month5hrs/month87% ↓
Total$800/month$250/month69% ↓

Monitoring and Alert Configuration

CloudWatch Monitoring

import boto3

cloudwatch = boto3.client('cloudwatch')

def setup_monitoring():
    """
    CloudWatch monitoring configuration
    """
    
    # Custom metrics
    metrics = [
        {
            'MetricName': 'ScrapingSuccessRate',
            'Namespace': 'ScrapingSystem',
            'Value': success_rate,
            'Unit': 'Percent'
        },
        {
            'MetricName': 'AverageResponseTime',
            'Namespace': 'ScrapingSystem',
            'Value': avg_response_time,
            'Unit': 'Milliseconds'
        },
        {
            'MetricName': 'BrightDataUsage',
            'Namespace': 'ScrapingSystem',
            'Value': bytes_used,
            'Unit': 'Bytes'
        }
    ]
    
    for metric in metrics:
        cloudwatch.put_metric_data(
            Namespace=metric['Namespace'],
            MetricData=[metric]
        )

def create_alarms():
    """
    CloudWatch alarm configuration
    """
    
    alarms = [
        {
            'AlarmName': 'HighErrorRate',
            'MetricName': 'Errors',
            'Threshold': 10,
            'ComparisonOperator': 'GreaterThanThreshold'
        },
        {
            'AlarmName': 'HighCostAlert',
            'MetricName': 'BrightDataUsage',
            'Threshold': 80 * 1024 * 1024 * 1024,  # 80GB
            'ComparisonOperator': 'GreaterThanThreshold'
        }
    ]
    
    return alarms

Deployment and CI/CD

SAM Template Example

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31

Parameters:
  BrightDataUsername:
    Type: String
    Description: Bright Data Username
  BrightDataPassword:
    Type: String
    NoEcho: true
    Description: Bright Data Password

Resources:
  ScrapingFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: src/
      Handler: lambda_function.lambda_handler
      Runtime: python3.9
      MemorySize: 1769
      Timeout: 120
      ReservedConcurrencyLimit: 100
      Environment:
        Variables:
          BRIGHT_DATA_USERNAME: !Ref BrightDataUsername
          BRIGHT_DATA_PASSWORD: !Ref BrightDataPassword
          BRIGHT_DATA_ENDPOINT: brd-customer-hl_123456-zone-datacenter.brd.superproxy.io:22225
      Events:
        SQSEvent:
          Type: SQS
          Properties:
            Queue: !GetAtt ScrapingQueue.Arn
            BatchSize: 10

  ScrapingQueue:
    Type: AWS::SQS::Queue
    Properties:
      VisibilityTimeoutSeconds: 180
      MessageRetentionPeriod: 1209600  # 14 days

  StateTable:
    Type: AWS::DynamoDB::Table
    Properties:
      TableName: scraping-state
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions:
        - AttributeName: url_hash
          AttributeType: S
      KeySchema:
        - AttributeName: url_hash
          KeyType: HASH
      TimeToLiveSpecification:
        AttributeName: ttl
        Enabled: true

Performance Tuning

1. Concurrent Execution Optimization

def calculate_optimal_concurrency(target_rps, avg_duration):
    """
    Calculate optimal concurrent execution count
    
    Args:
        target_rps: Target requests per second
        avg_duration: Average execution time (seconds)
    
    Returns:
        Recommended concurrent execution count
    """
    
    # Little's Law: L = λ × W
    # L = concurrent executions, λ = arrival rate, W = processing time
    optimal_concurrency = target_rps * avg_duration
    
    # Consider AWS Lambda limits
    max_concurrency = min(optimal_concurrency, 1000)
    
    return {
        'recommended': max_concurrency,
        'scaling_time': max_concurrency / 100,  # Scale at 100/second
        'max_rps': max_concurrency / avg_duration
    }

# Usage example
performance = calculate_optimal_concurrency(
    target_rps=100,  # 100 requests/second
    avg_duration=5   # 5 seconds/request
)
print(f"Recommended concurrency: {performance['recommended']}")

2. Memory Usage Optimization

import psutil
import tracemalloc

def memory_profiler(func):
    """
    Memory usage profiler
    """
    def wrapper(*args, **kwargs):
        tracemalloc.start()
        
        # Memory before execution
        process = psutil.Process()
        memory_before = process.memory_info().rss / 1024 / 1024  # MB
        
        try:
            result = func(*args, **kwargs)
        finally:
            # Memory after execution
            memory_after = process.memory_info().rss / 1024 / 1024  # MB
            current, peak = tracemalloc.get_traced_memory()
            tracemalloc.stop()
            
            print(f"Memory usage: {memory_after - memory_before:.2f}MB")
            print(f"Peak memory: {peak / 1024 / 1024:.2f}MB")
        
        return result
    return wrapper

Common Issues and Solutions

Q1: Lambda function times out

Cause: Large page processing takes too long Solution:

def handle_large_pages(url, max_size=10*1024*1024):  # 10MB limit
    """
    Handle large page processing limits
    """
    response = requests.get(url, stream=True)
    
    content = b""
    for chunk in response.iter_content(chunk_size=8192):
        content += chunk
        if len(content) > max_size:
            raise Exception(f"Page size exceeds limit: {len(content)} bytes")
    
    return content.decode('utf-8')

Q2: Bright Data costs higher than expected

Cause: Not understanding bandwidth calculation method Solution:

def optimize_bandwidth_usage():
    """
    Bandwidth usage optimization
    """
    
    # Get only necessary headers
    headers = {
        'Range': 'bytes=0-1023'  # First 1KB only
    }
    
    # Enable compression
    headers['Accept-Encoding'] = 'gzip, deflate, br'
    
    # Filter unnecessary resources
    headers['Accept'] = 'text/html'  # HTML only
    
    return headers

Q3: Hitting concurrent execution limits

Cause: Sudden traffic spikes Solution:

def implement_rate_limiting():
    """
    Rate limiting implementation
    """
    
    # Control message delivery rate with SQS
    # Adjust DelaySeconds and VisibilityTimeout
    
    sqs_config = {
        'DelaySeconds': 1,  # 1 second delay
        'VisibilityTimeoutSeconds': 180,  # 3 minutes
        'MaxReceiveCount': 3  # Maximum retry count
    }
    
    return sqs_config

Real Implementation Case Study

E-commerce Price Monitoring System

Requirements:

  • Price monitoring for 1,000 sites × 10,000 products
  • Hourly updates
  • 99.9% availability

Implementation Results:

  • Processing Capacity: 100K requests/hour
  • Success Rate: 99.7%
  • Monthly Cost: $180 (reduced from $800)
  • Response Time: Average 3.2 seconds
# Actual configuration example
PRODUCTION_CONFIG = {
    'lambda_memory': 1769,
    'lambda_timeout': 120,
    'reserved_concurrency': 200,
    'sqs_batch_size': 10,
    'retry_attempts': 3,
    'bright_data_pool': 'datacenter',
    'monitoring_interval': 300  # 5 minutes
}

Summary: Next-Generation Scraping Architecture

The combination of AWS Lambda and Bright Data represents the new standard for large-scale scraping.

Key Benefits:

  1. Cost Efficiency: 69% cost reduction compared to traditional solutions
  2. Scalability: Up to 1 million requests/minute processing capacity
  3. Reduced Operational Load: No server management required
  4. High Availability: Leveraging AWS's 99.99% SLA

Recommended Use Cases:

  • E-commerce price monitoring
  • Real estate data collection
  • News & social media monitoring
  • SEO competitive analysis

This technology stack makes large-scale data collection, previously only feasible for large enterprises, easily accessible to small and medium businesses.

Get Started Now: Try Bright Data free trial to test proxy services and build next-generation scraping systems with AWS Lambda and Bright Data. For detailed implementation, also refer to Bright Data pricing explanation.


Information in this article is current as of January 6, 2025. AWS Lambda and Bright Data service specifications may change, so please check official documentation for the latest updates.

Related Articles

Related articles feature coming soon.