Scraping Upwork Jobs [2024]
Oct 9, 2024
Learn how to scrape Upwork to get the lastest jobs posted in the platform
Scraping Upwork
Who can be interested?
If you are a freelancer offering your services at Upwork, there is a chance you might have felt constrained by the limited job searching capabilities offered by upwork. This limitation usually has 2 origins:
[Job Search Alert Frequency] Upwork does not inform me quick enough when new jobs are posted. In this article we cover how to fix this problem, by scraping upwork.
[Job Search Recommendation] The filters that upwork allow are too generic, resulting in spending hours of my time spent reading upwork job posts that will never do because those are not suitable for my skills. This topic is covered in this other article.
How can we scrape upwork job posts?
Find the target URL
Upwork has a specific URL that you can use to download the jobs posted available in the website. And there is not need to be logged in to access to it. Copy the link below in your browser and you will see upwork jobs.
If you start playing with the filters in the `advance search`, you will see how the url updates dynamically.
Example: let's say we want to search for the `web scraping` jobs and sort them by `recency`. Here is the new url:
Also, if you paginate, you can see how the url is also updated (in this case page 2). But it works for other pages as well.
So far so good, thanks to this URL we have access to the Upwork jobs without need to login.
Extract the URL content
The urls above will work in any browser without need to login. However, if you try to do a GET request using CURL in a terminal
you will quickly see that you are banned with a `403` error. In this case, a way to fix this is to use `headless browsers`. Here a tutorial and also a video how to do the first steps in playwright. In case you want to dockerize your application you have a tutorial here.
The tutorials above will allow you to do web scraping of Upwork for free, but you will need to spend some time if you have not done that before. Also you will need proxies depending on the amount of requests that you end up doing per day. There are lots of proxy providers, here we recommend proxyscrape. So with the `headless browsers` and `proxies` you should be good to go!
In case you do not want to spend time and you are looking for some solution that manages playwright and proxies internally. We recommend to use Zenrows web unblocker. With a simple request you will be able to extract the HTML content. Here an example of the request:
You can get the ZENROWS_USER
registering in Zenrows.
Parse URL content
Unfortunately, the url content is an HTML. The job posts are there in the HTML, but it might be a bit tricky to extract the extract content we are looking for. Extracting the HTML content isolating the specific data you are interested in is called parsing. You can use well known python library like Scrapy or BeautifulSoup for converting HTML to JSON using CSS or XPATH selectors.
In this blog you can see the XPATHs that work for us in October 2024
root
.//section[@data-ev-label='search_result_impression' and @data-ev-page_number='1']
root.job_listings.*
.//article[@data-ev-label='search_results_impression' and contains(@class, 'job-tile')]
root.job_listings.*.title
.//a[contains(@data-test, 'job-tile-title-link')]//text()
root.job_listings.*.posted_date
.//span[text()='Posted']/following-sibling::span[1]/text()
root.job_listings.*.job_type
.//li[@data-test='job-type-label']/strong/text()
root.job_listings.*.job_url
.//a[contains(@class, 'up-n-link')]/@href
root.job_listings.*.experience_level
.//li[@data-test='experience-level']/strong/text()
root.job_listings.*.estimated_budget
.//strong[contains(@class, 'mr-1')]/following-sibling::strong/text()
root.job_listings.*.description
.//p[contains(@class, 'text-body-sm')]/descendant::text()
root.job_listings.*.skills
.//div[contains(@class, 'air3-token-container')]
root.job_listings.*.skills.*
.//span[@data-v-d8f62af2]/text()
root.job_listings.*.estimated_time
.//li[@data-test='duration-label']//strong[position()=2]//text()
root.job_listings.*.hourly_rate
.//strong[contains(text(), 'Hourly: ')]/text()
Example:
If you want to extract the root.job_listings.*.hourly_rate
for each job post, you need to combine the xpath for root
, root.job_listings.*
and root.job_listings.*.hourly_rate
, resulting in:
.//section[@data-ev-label='search_result_impression' and @data-ev-page_number='1']//article[@data-ev-label='search_results_impression' and contains(@class, 'job-tile')]//strong[contains(text(), 'Hourly: ')]/text()
If you follow this approach you will be able to parse Upwork for free.
All-in-One Solution
Keep in mind that web is quite dynamic. So chances are that the xpaths provided above will not last forever, as there will be some Upwork update that sooner or later will end up breaking the selectors. You might find similar problems with proxies. If they are working now, it does not mean they will work in the future.
At Blat we are constantly improving our AI agent that's able to generate production ready-web scraping code in minutes. It manages proxies and also parsing internally. In case there is some update in the web, this is detected and the web scraping algorithm. In case you want to consume our all-in-one scraping solution, send a request here. We will send you a BLAT_API_KEY
that will allow you to access the following endpoint:
here you can see an example of the JSON output provided by the API:
Questions / Follow - Ups
Imagine all the cool automations you could do based on the information available on the internet? Literally, sky's the limit! Blat is here to help you. We will be happy to hear from you and the data you would need to feed your automations. If you know the url that you would like to scrape, just send a request here and we will come back to you with a solution.
Do you have any question? Do not hesitate to get in touch with us :)