Amazon homepage with a search query for Octopus.
mkdir scraping-amazon-python
cd scraping-amazon-python
pip install httpx
pip3 install pandas
pip3 install playwright
playwright install
Make sure the installation process completes without any issues before proceeding to the next step. If you encounter difficulties while setting up the environment, you can consult AI services like ChatGPT, Mistral AI, and others. These services can help with troubleshooting errors and provide step-by-step instructions for resolving them.
import httpx
from playwright.async_api import async_playwright
import asyncio
import pandas as pd
# Profile's uuid from Octo
PROFILE_UUID = "UUID_SHOULD_BE_HERE"
# searching request
SEARCH_REQUEST = "fashion"
async def main():
async with async_playwright() as p:
async with httpx.AsyncClient() as client:
response = await client.post(
'http://127.0.0.1:58888/api/profiles/start',
json={
'uuid': PROFILE_UUID,
'headless': False,
'debug_port': True
}
)
if not response.is_success:
print(f'Start response is not successful: {response.json()}')
return
start_response = response.json()
ws_endpoint = start_response.get('ws_endpoint')
browser = await p.chromium.connect_over_cdp(ws_endpoint)
page = browser.contexts[0].pages[0]
# Opening Amazon
await page.goto(f'https://www.amazon.com/s?k={SEARCH_REQUEST}')
# Extract information
results = []
listings = await page.query_selector_all('div.a-section.a-spacing-small')
for listing in listings:
result = {}
# Product name
name_element = await listing.query_selector('h2.a-size-mini > a > span')
result['product_name'] = await name_element.inner_text() if name_element else 'N/A'
# Rating
rating_element = await listing.query_selector('span[aria-label*="out of 5 stars"] > span.a-size-base')
result['rating'] = (await rating_element.inner_text())[0:3] if rating_element else 'N/A'
# Number of reviews
reviews_element = await listing.query_selector('span[aria-label*="stars"] + span > a > span')
result['number_of_reviews'] = await reviews_element.inner_text() if reviews_element else 'N/A'
# Price
price_element = await listing.query_selector('span.a-price > span.a-offscreen')
result['price'] = await price_element.inner_text() if price_element else 'N/A'
if(result['product_name']=='N/A' and result['rating']=='N/A' and result['number_of_reviews']=='N/A' and result['price']=='N/A'):
pass
else:
results.append(result)
# Close browser
await browser.close()
return results
# Run the scraper and save results to a CSV file
results = asyncio.run(main())
df = pd.DataFrame(results)
df.to_csv('amazon_products_listings.csv', index=False)
By launching the browser and navigating to the target Amazon URL, you will extract product information: name, rating, review count, and price. After iterating through each listing on the page, you can filter out listings that lack data, which the script will mark as "N/A." The search results will be saved in a Pandas DataFrame and then exported to a CSV file named amazon_products_listings.csv.
Besides a multi-accounting browser, high-quality proxies, and a well-thought-out script simulating human behavior, an automatic CAPTCHA solver can also come in handy. For this, you can use OSS solutions, manual solving services like 2captcha and anti-captcha, or automated solvers such as Capmonster.
The role of multi-accounting browsers in bypassing website security systems lies in spoofing the digital fingerprint. Using a multi-accounting browser, you can create multiple browser profiles, which are virtual copies of the browser, isolated from each other and having their own set of characteristics and settings: cookies, browsing history, extensions, proxies, fingerprint parameters. Each multi-accounting browser profile appears to website security systems as a separate user.
In Octo Browser, all the necessary documentation to get started is available here, and detailed API instructions can be found here.
Use Python to efficiently search for products, reviews, descriptions, and prices on Amazon. Writing the necessary code may take some time and effort, but the results will exceed all expectations. To avoid the attention of security systems, simulate natural user behavior, use third-party IP addresses, and regularly change the digital fingerprint. Specialized tools such as multi-accounting browsers and proxy servers will allow you to rotate browser fingerprints and IP addresses to overcome restrictions and increase scraping speed.