How to earn money with web scraping in 2024?

4 APRIL, 2024 WEB-SCRAPING EXPERT OPINION
How to earn money with web scraping in 2024?
Interview with Pierluigi Vinciguerra, Co Founder and CTO at Databoutique

What is the most sought-after data in 2023/2024? Which types/themes/categories of datasets are the most popular?

It’s difficult to say, since web scraping, while it’s becoming mainstream since the latest developments in AI and LLMs, which heavily rely on it, it’s still far from mass adoption.

One of the most common use cases for web scraping is price comparison and market intelligence: every company would like to know where their products are sold and at what price, and how their competitors are behaving.

Another valuable piece of information comes from the inventory levels hidden in some websites. Imagine being able to monitor a company by scraping daily the inventory levels in their stores or warehouses: by doing so, you can easily estimate their revenues, best products, and so on. This requires an accurate data collection but as you can imagine is a gold mine.

Last but not least, we have all the location data: Airbnb, Hotels, Real Estate. They can describe the economic trend of a country or a city if collected for a long period.

What are some ways to earn money through web scraping today? Who would be potential buyers, and what are the platforms or marketplaces available?

I can see three ways to make money with web scraping, and they are not mutually exclusive.

The first and most obvious is by doing some gigs as a freelancer. You could see it like your 9-5 job.

Then you can sell your code on places like Apify Store, where you can basically sell your code (they call it Actor) and people could run it on the Apify platform and get the results.

Last but not least, you can sell the dataset resulting from your scraper on Databoutique.com. This is a new marketplace for web scraped data, we just opened some months ago and we’re working on bringing more traffic on the platform while shipping new features every week, so unluckily for the moment, you won’t get rich overnight.
The idea behind it is quite simple: until today, web scraping seems more like a tailor-made suit: it’s expensive, it’s made for you, and the seller will have many difficulties in selling the same to another buyer.

We wanted to sell the H&M shirt instead: standard datasets that cover the basic needs of the buyer, quality checked but at a lower price point.
Think about it: even if you have a Saas who relies on web scraped data, so theoretically the service is the same for everyone, you will always need some more new websites to scrape for the new customers, and this will make your solution expensive, reducing the number of potential customers. But it’s also true that if these websites are new for me, there is for sure someone else already scraping them.

What we did was to create a data marketplace where people already scraping some websites can upload their datasets (if they comply with the rules), adapted to certain predefined data schemas. In this way, we’re building a huge catalog of datasets, that since they’re standardized and quality-checked, can be bundled with other providers’ datasets, adding the chances to be bought. And the more a dataset is bought, the less it could cost since the extraction costs are the same, and the less it costs, the more buyers it will attract, generating a positive flywheel for the mass adoption of web scraping.

What does the web scraper's tool box include? Which software and services would be effective in gathering data?

Things changed a lot from when I started doing web scraping 10 years ago: today the toolkit for a web scraper is quite varied. First of all, you’ll need a web scraping framework like Scrapy in Python, for all the websites that don’t have any anti-bot installed.

Then you’ll need one or more proxy providers, as your operations start to scale.

On top, you’ll need a browser automation tool like Playwright, Puppeteer, or Selenium, when things start to get complicated.
Last but not least, for websites with protections that rely heavily on browser fingerprinting, you’ll need an anti-detect browser like Octo to mimic a real user browsing them.

In the middle of these macro layers, there are tons of tools that are specific for some issues, like TLS fingerprinting or Human-like mouse movements.

What will be the biggest technical challenges for web scraping in 2024? Is web scraping facing new challenges due to LLMs and AI?

The biggest technical challenge is still the anti-bot evasions. There are more and more sophisticated techniques for blocking bots but luckily we also have more and more tools to compete. I think LLMs and AI are not a big issue, they could complement the writing of the code. At the moment we’re seeing some products approaching the market that are AI powered, both for auto-parsing the HTML and the anti-bot evasion.

Which are the most challenging websites to scrape? Could you also provide some insight into the protective systems that are particularly difficult to bypass?

Generally speaking, websites where scarce items are sold (Hermes bags or sneakers, tickets, and so on) are the hardest to scrape. In these cases, usually a legit fingerprint is not enough but the scraper also should behave like a human, like clicking around instead of browsing pages by using a direct URL and so on. Usually on these websites, you’re blocked even if you browse them and do something strange like clicking very fast around.

Are there any legal issues that web scrapers should keep in mind? Could you comment on the recent Bright Data/Meta case and whether it will change the perception and legal status of web scraping?

I’m not a lawyer, so if the readers have any doubt about their operations it’s better to call a real one instead of listening to my suggestions. By the way, there are some golden rules for being 100% safe when scraping:
  • Do not scrape any personal information.
  • Do not scrape any copyrighted information, especially if you plan to resell them as they are.
  • Do not scrape anything behind a login or non-publicly accessible
  • Do not hurt the target’s website business.

About the Meta vs Bright Data sentence, the ruling is very specific for the case and the ToS of Meta so I would not generalize anything from it. But It’s a matter of fact that web scraping, when done ethically and respecting the target website, it’s a practice that is completely legal and should not be seen as a grey area. In the end, it’s a tool like a hammer: it can be used for good, like building houses, or for bad, like breaking glasses on parked cars. It depends on who’s in charge of the tool’s usage to understand what can be done and what not.

Is there a place where one can learn about web scraping and interact with the community?

Thanks for this question, so I can add my shameless plug. I’ve started almost 2 years ago my newsletter about web scraping called The Web Scraping Club. I write about my experiences in web scraping, the tools I’m testing, how to bypass anti-bots, and so on.

The idea came to me since I could not find a place where to know practically what to do when I needed to bypass some anti-bot. For this, I started sharing my notes to the world and now I have 2400+ subscribers to the newsletter.

But there are also other great blogs for people who wants to go more in detail about what’s happening under the hood of an anti-bot: Trickster.dev is one of them, but also botting.rocks and webscraping.wiki.

Related articles
Join Octo Browser now
Or contact the support team in chat for any questions, at any time.