Extracting Data from Amazon: Tips, Tricks, and Tools for Web Scraping

JUNE 12, 2024 WEB-SCRAPING EXPERT OPINION
Extracting Data from Amazon: Tips, Tricks, and Tools for Web Scraping
Artur Hvalei
Technical Support Specialist, Octo Browser
Amazon is one of the largest e-commerce platforms in the world and a vast source of valuable data. Effectively extracting and using information about products, prices, and customer reviews is crucial for business development. Whether you are promoting your own products or tracking competitors, you will need tools for data collection to analyze the market. However, scraping on Amazon has its peculiarities you need to be aware of. In this article, we will discuss the necessary steps you need to take to create a web scraper, and an expert from the Octo Team will offer an example code for Amazon scraping.

Содержание

What is Web Scraping?

Web scraping is the process of automated data collection from websites. Special programs or scripts, called scrapers, extract information from web pages and convert it into a structured data format, convenient for further analysis and use. The most common formats for storing and processing data are CSV, JSON, SQL, or Excel.

Nowadays, web scraping is widely used in Data Science, marketing, and e-commerce. Web scrapers collect vast amounts of information for personal and professional purposes. Moreover, modern tech giants rely on web scraping methods to monitor and analyze trends.

Tools and Technologies for Web Scraping

Python and Libraries
Python is one of the most popular programming languages for web scraping. It is known for its simple and clear syntax, making it an ideal choice for both beginners and experienced coders. Another advantage is the wide range of available libraries for web scraping, such as Beautiful Soup, Scrapy, Requests, and Selenium. These libraries allow you to easily send HTTP requests, process HTML documents, and interact with web pages.
APIs
Amazon provides APIs for accessing its data, such as the Amazon Product Advertising API. This allows you to request specific information in a structured format without having to parse the entire HTML page.
Cloud Services
Cloud platforms such as AWS Lambda and Google Cloud Functions can be used to automate and scale web scraping processes. They provide high performance and the ability to handle large volumes of data.
Specialized Tools
There are also additional tools for web scraping, such as multi-accounting browsers and proxies. Their role is to spoof the digital fingerprint to bypass website security restrictions. These tools speed up data collection.

Applications of Amazon Web Scraping

Market and Competition Analysis
Web scraping allows you to collect data on your competitors' products and prices, analyze their product range, and identify trends. This helps companies adapt their strategies and remain competitive.
Price Monitoring
Collecting data on prices for similar products helps companies set competitive prices and respond promptly to market changes. This is especially important in the context of dynamic pricing and promotions.
Collecting Reviews and Ratings
Reviews and ratings are an important source of information about how consumers perceive products. Analyzing this data helps identify the strengths and weaknesses of products, as well as get ideas for their improvement.
Product Range Research
Using web scraping, you can analyze the range of products on Amazon, identify popular categories, and, based on this, make decisions about expanding or changing the product portfolio.
Tracking New Products
Web scraping can help you to learn about the appearance of new products on the platform quickly, which can be useful for manufacturers, distributors, and market analysts.

Navigating Amazon Interface Elements

Before you start scraping, it's important to understand how web pages are structured. Most web pages are written in HTML and contain elements such as tags, attributes, and classes. Knowing HTML will help you correctly identify and extract the necessary data.

On Amazon's homepage, shoppers use the search bar to enter keywords related to the desired product. As a result, they receive a list with product names, prices, ratings, and other essential attributes. Additionally, products can be filtered by various parameters such as price range, product category, and customer reviews. Navigating these components helps users easily find the products they're interested in, compare alternatives, view additional information, and conveniently make purchases on Amazon.
Amazon homepage with a search query for Octopus.

Amazon homepage with a search query for Octopus.

To get a more extensive list of results, you can use the pagination buttons located at the bottom of the page. Each page typically contains a large number of listings, allowing you to browse more products. The filters at the top of the page let you refine your search according to your requirements.

To understand the HTML structure of Amazon, follow these steps:
  1. Go to the website.
  2. Search for the desired product using the search bar or select a category from the product list.
  3. Open the developer tools by right-clicking on the product and selecting Inspect from the dropdown menu.
  4. Examine the HTML layout to identify the tags and attributes of the data you intend to extract.
Developer Tools

Key Steps to Start Scraping

Web scraping involves two main steps: finding the necessary information and structuring it. After studying the website's structure, let’s set up the necessary components for automating the scraping process.

For this task, we will use Python and its libraries:
  • Requests: This is a popular third-party Python library for making HTTP requests. It provides a simple and intuitive interface for sending HTTP requests to web servers and receiving responses. This library is perhaps the most well-known one associated with web scraping.
  • BeautifulSoup: This library is designed for easy and quick parsing of HTML and XML documents. It provides a simple interface for navigating, searching, and modifying the document tree, making the web scraping process more intuitive. It allows you to extract information from a page by searching for tags, attributes, or specific text.
  • Selenium: For interacting with dynamic web pages.
  • Pandas: This is a powerful and reliable library for data processing and cleaning. For example, after extracting data from web pages, you can use Pandas to handle missing values, transform the data into the required format, and remove duplicates.
  • Playwright: Allows efficient interaction with web pages that use JavaScript for dynamic content updates. This makes it especially useful for scraping sites like Amazon, where many elements load asynchronously.
  • Scrapy: For more complex web scraping tasks.
A document tree is a structure of an HTML document that shows the connections between various elements of the page, their order and nesting.
Once you have prepared Python, open the terminal or shell and create a new project directory using the following commands:

mkdir scraping-amazon-python
cd scraping-amazon-python
To install the libraries, open the terminal or shell and run the following commands:

pip install httpx
pip3 install pandas
pip3 install playwright
playwright install
Note: The last command (playwright install) is crucial as it ensures the proper installation of necessary browser files.

Make sure the installation process completes without any issues before proceeding to the next step. If you encounter difficulties while setting up the environment, you can consult AI services like ChatGPT, Mistral AI, and others. These services can help with troubleshooting errors and provide step-by-step instructions for resolving them.

Using Python and Libraries for Web Scraping

In your project directory, create a new Python script named amazon_scraper.py and add the following code:

import httpx
from playwright.async_api import async_playwright
import asyncio
import pandas as pd
# Profile's uuid from Octo
PROFILE_UUID = "UUID_SHOULD_BE_HERE"
# searching request
SEARCH_REQUEST = "fashion"
async def main():
    async with async_playwright() as p:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                'http://127.0.0.1:58888/api/profiles/start',
                json={
                    'uuid': PROFILE_UUID,
                    'headless': False,
                    'debug_port': True
                }
            )
            if not response.is_success:
                print(f'Start response is not successful: {response.json()}')
                return
            start_response = response.json()
            ws_endpoint = start_response.get('ws_endpoint')
        browser = await p.chromium.connect_over_cdp(ws_endpoint)
        page = browser.contexts[0].pages[0]
        # Opening Amazon
        await page.goto(f'https://www.amazon.com/s?k={SEARCH_REQUEST}')
        # Extract information
        results = []
        listings = await page.query_selector_all('div.a-section.a-spacing-small')
        for listing in listings:
            result = {}
            # Product name
            name_element = await listing.query_selector('h2.a-size-mini > a > span')
            result['product_name'] = await name_element.inner_text() if name_element else 'N/A'

            # Rating
            rating_element = await listing.query_selector('span[aria-label*="out of 5 stars"] > span.a-size-base')
            result['rating'] = (await rating_element.inner_text())[0:3] if rating_element else 'N/A'

            # Number of reviews
            reviews_element = await listing.query_selector('span[aria-label*="stars"] + span > a > span')
            result['number_of_reviews'] = await reviews_element.inner_text() if reviews_element else 'N/A'

            # Price
            price_element = await listing.query_selector('span.a-price > span.a-offscreen')
            result['price'] = await price_element.inner_text() if price_element else 'N/A'
            if(result['product_name']=='N/A' and result['rating']=='N/A' and result['number_of_reviews']=='N/A' and result['price']=='N/A'):
                pass
            else:
                results.append(result)
        # Close browser
        await browser.close()

        return results
# Run the scraper and save results to a CSV file
results = asyncio.run(main())
df = pd.DataFrame(results)
df.to_csv('amazon_products_listings.csv', index=False)
In this code, we use Python's asynchronous capabilities with the Playwright library to extract product listings from a specific Amazon page. We launch an Octo Browser profile, then connect to it via the Playwright library. The script opens a URL with a specific search query, which can be edited at the top of the script in the SEARCH_REQUEST variable.

By launching the browser and navigating to the target Amazon URL, you will extract product information: name, rating, review count, and price. After iterating through each listing on the page, you can filter out listings that lack data, which the script will mark as "N/A." The search results will be saved in a Pandas DataFrame and then exported to a CSV file named amazon_products_listings.csv.

The resulting CSV file amazon_products_listings.csv.

What Else is Needed for Effective Web Scraping?

Web scraping on Amazon without proxies and specialized scraping tools comes with numerous challenges. Like many other popular platforms, Amazon has request rate limits, meaning it can block your IP address if you exceed the set limit of requests. Additionally, Amazon uses bot detection algorithms that identify your digital fingerprint when you access site pages. In view of this, it is recommended to follow commonly used best practices to avoid detection and potential blocking by Amazon. Here are some of the most useful tips and tricks:
Simulating Natural Behavior
Amazon can block or temporarily suspend activity that it deems robotic or suspicious. It is crucial that your scraper looks as human-like as possible.

To develop a successful crawling pattern, think about how an average user would behave when exploring a page, and add corresponding clicks, scrolls, and mouse movements. To avoid blocking, introduce delays or random intervals between requests using functions like asyncio.sleep(random.uniform(1, 5)). This will make your pattern look less robotic.
Crawling is the process of discovering and correctly gathering information about new or updated site pages.
A Realistic Fingerprint
Use a multi-accounting browser to spoof your digital fingerprint with that of a real device. Platforms like Amazon collect various fingerprint parameters to identify bots. To avoid detection, ensure that your fingerprint parameters and their combinations are always plausible.

Additionally, to reduce the risk of detection, you should rotate IP addresses. High-quality proxies play a crucial role when working with a multi-accounting browser. Choose only reputable providers with the best prices. Proxies should be selected according to the scraping strategy and considering their GEOs, as Amazon provides different content for different regions. Make sure that your proxies have a low spam/abuse/fraud score. You should also consider the speed of the proxies. For example, residential proxies may have high latency, which will affect the scraping speed.
CAPTCHA Solving Services

Besides a multi-accounting browser, high-quality proxies, and a well-thought-out script simulating human behavior, an automatic CAPTCHA solver can also come in handy. For this, you can use OSS solutions, manual solving services like 2captcha and anti-captcha, or automated solvers such as Capmonster.

Why Use Multi-Accounting Browsers for Web Scraping?

Many websites, platforms and services use information about the user's device, browser, and connection to identify them. These sets of data are known as digital fingerprints. Based on fingerprint information, website security systems determine whether a user is suspicious.

The specific set of analyzed parameters can vary depending on the website's security system. To connect and properly display content, a browser provides over 50 different parameters related to your device, each of which can be part of the digital fingerprint.

Additionally, a browser can be tasked with creating a simple 2D or 3D image, and based on how the device performs this task, a hash can be generated. This hash will distinguish this device from other site visitors. This is how hardware fingerprinting via Canvas and WebGL works.

Minor changes to some characteristics, the information about which the browser passes to the website's security systems, will not prevent identification of an already familiar user. You can change the browser, time zone, or screen resolution, but even if you do all this simultaneously, the likelihood of identification will remain high.

Fingerprinting, along with other anti-scraping technologies such as rate limiting, geolocation, WAF, challenges, and CAPTCHAs, exists to protect websites from automated interactions. A multi-accounting browser with a high-quality fingerprint spoofing system will help bypass website security systems. As a result, the effectiveness of web scraping will increase, as data collection becomes faster and more reliable.

How Do Multi-Accounting Browsers Work?

The role of multi-accounting browsers in bypassing website security systems lies in spoofing the digital fingerprint. Using a multi-accounting browser, you can create multiple browser profiles, which are virtual copies of the browser, isolated from each other and having their own set of characteristics and settings: cookies, browsing history, extensions, proxies, fingerprint parameters. Each multi-accounting browser profile appears to website security systems as a separate user.

Octo Browser Interface.

How to Scrape Using a Multi-Accounting Browser

Multi-accounting browsers typically offer automation capabilities through the Chrome Dev Tools protocol. It allows you to automate the necessary scraping actions via software interfaces. For convenient work, you can use OSS libraries such as Puppeteer, Playwright, Selenium, etc.

In Octo Browser, all the necessary documentation to get started is available here, and detailed API instructions can be found here.

Will a Multi-Accounting Browser Reduce the Scraping Costs?

Multi-accounting browsers can both increase and decrease scraping costs, depending on the resource and working conditions.

  • Costs can be reduced by minimizing the risk of blocks and automating manual tasks. Multi-accounting browsers offer a profile manager and profile data auto-synchronization features for this.
  • Costs can be increased primarily by the purchase of licenses for the required number of profiles.

Everything else being equal, in the long run using a multi-accounting browser facilitates budget savings and reduces the scraping costs.

Is Web Scraping Worth the Automation Effort?

Web scraping is a powerful tool for automatic data collection and analysis. Companies use it to gather necessary information and make informed decisions in the sphere of e-commerce.

Use Python to efficiently search for products, reviews, descriptions, and prices on Amazon. Writing the necessary code may take some time and effort, but the results will exceed all expectations. To avoid the attention of security systems, simulate natural user behavior, use third-party IP addresses, and regularly change the digital fingerprint. Specialized tools such as multi-accounting browsers and proxy servers will allow you to rotate browser fingerprints and IP addresses to overcome restrictions and increase scraping speed.

Frequently Asked Questions

What kinds of data can be extracted using web scraping?
Web scraping can extract text, images, tables, metadata, and more.
Can scraping be detected?
Yes, data parsing can be detected by anti-bot systems, which may check your IP address, whether your digital fingerprint parameters match, and behavioral patterns. If the check fails, access to the website pages from your IP address and your device will be blocked.
How to avoid blocks during web scraping?
Use proxy servers, simulate real user actions, and add delays between requests.
What legal aspects of web scraping need to be considered?
The legal aspects of web scraping are governed by data protection laws and intellectual property rights. Scraping publicly available data on websites is not considered illegal if your actions do not violate their ToS. Follow the rules of the platform and always consider the legal aspects of web scraping.
Does Amazon allow scraping?
Scraping publicly available data on Amazon is not considered illegal provided your actions do not violate its ToS.
What are the main mistakes that can occur during web scraping, and how to avoid them?
Typical errors include issues with HTML parsing, tracking changes in the website structure, and exceeding the request rate limits. To avoid them, check and update your code regularly.
How to minimize the occurrence of CAPTCHAs when scraping Amazon?
Use reliable proxy servers and rotate your IP addresses. Reduce the scraping speed by adding random intervals between requests and actions. Ensure that your digital fingerprint parameters match those of real devices and do not raise suspicions of anti-bot systems.

Stay up to date with the latest Octo Browser news
By clicking the button you agree to our Privacy Policy.

Related articles
Join Octo Browser now
Or contact the support team in chat for any questions, at any time.