How Much Does It Really Cost to Run Browser Based Web Scraping at Scale?
Learn the real cost of running 1000 browser based web scraping requests. Compare commercial JS rendering providers vs cloud setups, and find out when it’s cheaper to run your own scraping infrastructure.
Introduction
Web scraping at scale comes with a variety of challenges. Once you move from local development to production, issues inevitably arise. The most common are anti-scraping mechanisms, but there are other, subtler problems. In this blog, I’ll focus on one of those silent but significant challenges:
Running browsers at scale
Why Use Browsers for Web Crawling and Scraping?
There are two main reasons to use browsers:
JavaScript Rendering:
Some websites won’t display content unless JavaScript is rendered. That's something that only a browser can handle properly.Avoiding Detection:
Sending raw HTTP requests instead of using a browser can quickly flag your scraper as a bot. That can lead to bans and force frequent proxy changes. This is an oversimplification, but enough to make the point.
How much does it cost to run 1000 requests in a browser?
Comercial Options
Before diving into technical details, it's helpful to benchmark against commercial providers. We're not talking about advanced anti-scraping solutions, just basic JavaScript rendering.
The following prices reflect high-volume use, assuming you're already spending thousands of euros monthly. If you're under €1,000/month, expect higher costs.
Providers | Price JS-Render [$/1000 requests] |
---|---|
0.364 | |
0.798 | |
0.374 | |
0.310 | |
0.394 |
These prices give a good sense of market rates. Self-hosting is generally cheaper.
Real Cost of Running 1,000 Browser Requests (Excluding Proxies)
We're focusing on the browser cost only. Proxy pricing varies greatly depending on the vendor, quality, and target sites.
The Key Cost Variables of running a pool of browsers in the cloud is the Cloud Provider.
Cloud Provider
The pricing breakdown shown below is based on the following assumptions:
Each browser instance needs ~2GB RAM and 1 CPU.
Average page load time: 10 seconds.
More powerful machines might decrease the time required to load a page in the browser, but they are also more expensive.
Serverless Function (Lambda)
Description
Using serverless functions is ideal to absorb bursts of requests that need to be handled in near real time.
However, part of the requests you are sending will suffer the cold start extra time, as the docker image needs to be loaded into the serverless function. For instance, Google Cloud Functions promises cold starts of 2 seconds. There are some other cloud providers like Scaleway Serverless Functions where the cold start is closer to 15 seconds.
Keep in mind that if you send one request to a serverless function, you will be charged not only for the execution time of your request, but for the entire time the lambda is up and running. This retention period is not explicitly documented but is generally observed to last between 5 to 15 minutes of inactivity.
Pricing
The cost of executing 1000 requests follows the following formula:
Where:
CC
is the compute cost per second.T
is the average execution time per invocation in seconds.F
is the fixed cost per 1 million requests.
For instance, for the Google Cloud Functions is calculated as follows:
Cloud Provider | Compute Cost of 2 GB and 1 cpu | Request Cost per 1M Invocations | Total Cost |
---|---|---|---|
0.0000333 | 0.20 | 0.33320 | |
0.000052 | 0.20 | 0.52020 | |
0.000029 | 0.40 | 0.29040 | |
0.000024 | 0.15 | 0.24015 |
Virtual Servers (on demand)
Description
Unlike serverless functions, virtual servers require you to manage the infrastructure, adding complexity. Additionally, launching a new virtual server takes minutes, significantly longer than the seconds needed for serverless function cold starts.
The main benefit of this approach is that it can reduce the costs of serverless functions by a factor of ~3.
Pricing
We are comparing machines with 4 Gb of Ram and 2 Cpus. This means we can run 2 browsers in the same machine. Remember 1 browser needs around 2Gb and 1 Cpu to run smoothly.
In this case the formula to calculate the `Total Cost` is a bit different:
Where:
c_vm
is the cost of the Virtual Machine [$/h].T
is the average execution time per invocation in seconds.N
is amount of browser you can run in the instance. In this case this is equal to 2 as the machines have 4 Gb of RAM and 2 CPUs.
For instance, for the AWS EC2 is calculated as follows:
Cloud Provider | Machine | Cost per hour |
|
c7i.large | 0.08925 | 0.12395 | |
F2s v2 | 0.0846 | 0.1175 | |
c2d-highcpu-2 | 0.07496 | 0.1041138 | |
POP2-HC-2C-4G | 0.06100 | 0.084722 |
Long term commitments of consumption allows you to access to more competitive pricing. Some cloud providers offer ~30% and ~50% savings for 1 and 3 years commitment respectively.
Threshold (n)
Knowing the costs of running 1000 requests in a browser (c)
, we can now calculate the number of requests (threshold) where it makes no longer sense to keep externalizing the execution of browsers (from a cost perspective).
As a rule of thumb we can use the formula below, to know when it's better to internalize or externalize your pool of browsers.
Where:
p
is the price per request offered by the commercial solutions [$/request]c
is the cost per request [$/request]n
number of requests per month [requests/month]s
salary of a senior data engineer [$/month]
It makes sense to run your own pool of browsers (or even proxies) if the formula above is false
.
Even if it is false
there might still be reasons to externalize it like:
Quick time to market to validate products fast.
Keep the focus of your company on those parts that are part of the core, externalize the rest.
In the formula above we are considering that at least you need 2 engineers to ensure your pool of browsers is always up and running and there is always someone available (holidays, sick leaves, etc) in case something breaks.
Assuming your team does not have the knowledge of managing their own infrastructure, they are forced to use the Serverless Function (Lambda).
Cost (c)
: $0.24015/1000 requests
Price Comercial Solution (p)
: $0.364/1000 requests (Blat solution).
Salary (s)
: $80.361,86 per year (considering average salary of a data engineer in Germany)
So, with these prices (p)
consider to internalize your pool of browsers at 3.6 million requests / day
.
In case your team does have the knowledge (and the time) of managing their own infrastructure, they might be interested in using the Virtual Servers (on-demand).
In this case, their cost (c)
most probably is $0.084722/1000 requests
instead of $0.24015/1000 requests
.
If we do the numbers again,
Looks like your limit in this case is not 3.6 million requests, but 1.6 million requests instead.
Tip: keep in mind that these calculations only take into account the cost of running browsers in the cloud, without considering proxies. Adding the cost of proxies will increase the costs
(c)
, reducing the benefit(p-c)
from the commercial solution, so increasing the threshold(n)
.
Conclusion
If you're scraping at scale (1.6M - 3.6M requests/day or more), talk to your browser provider. You may be eligible for significant discounts, or it might be time to consider other browser providers or building your own pool.