网页抓取 Google 搜索结果:教程
2026/4/27


Artur Hvalei
Technical Support Specialist, Octo Browser
Google SERP 抓取可以帮助你了解用户实际看到哪些网站和内容、哪些关键词带来流量,以及哪些摘要格式表现最佳。在本文中,我们将介绍数据收集方法,按准确性和复杂性进行比较,并突出适用于从基础监控到大规模分析等任务的最佳解决方案。另外,你还会找到一个可直接使用的抓取脚本。
内容
保持匿名,充分利用多账户功能,借助市面上最优质的反检测浏览器实现您的目标。
为什么抓取 Google 搜索结果
Google 是一个全球性的消费者需求和竞争对手活动数据库。分析搜索引擎结果页(SERP)可以提供关键洞察:网站对关键词的实际排名、竞争对手的标题和元描述、富摘要的存在及其格式,以及来自“People Also Ask”模块和搜索建议的数据。这些数据可帮助公司和营销人员:
跟踪排名和可见度:分析 SEO 表现并监控随时间的进展。
研究竞争对手:了解他们的关键词和内容策略,并识别市场空白。
发现细分领域和趋势:找到新的关键词和查询,以创建相关内容。
分析广告:研究竞争对手的广告、标题、文案和策略。
因此,这些洞察对于 SEO 专家、营销人员、分析师、企业主以及在线营销工具开发者最具价值。
数据收集工具和方法
1. 第三方 SERP API(付费服务)
这些是专门的 API,负责处理数据收集中的所有技术复杂性。你发送请求后,会收到包含搜索结果、广告和其他元素的结构化 Json。服务提供商会管理代理轮换、解决 CAPTCHA,并渲染 JavaScript,交付可直接使用的数据。
优点:易于集成、可扩展、由服务商处理封禁问题、数据结构清晰。
缺点:规模化成本高(例如,Bright Data 起价为每 1,000 次请求 1 美元)、供应商锁定、处理延迟。
2. 官方 Google API(Custom Search JSON API)
这是一种通过将 Google Search 嵌入你的网站来访问搜索数据的合法方式。然而,它在本质上是不同的,因为它不会模拟真实用户搜索,也不会返回带有广告和动态元素的“实时” SERP。结果通常不够及时,并且结构也不同。
优点:合法、稳定、易于使用,包含免费额度(每天 100 次请求)。
缺点:不会返回真实的 SERP 数据。该 API 提供的是来自一组有限预定义站点的结构化结果,而不是用户看到的真实搜索页面。它有配额和限制,因此不适合大规模排名跟踪或竞争分析。
3. 直接 HTTP 请求(抓取)
这种方法模拟标准浏览器请求。你的脚本(Python、Node.js 等)向 Google Search URL 发送 GET 请求并接收 HTML 代码,然后需要对其进行解析。为了避免被检测,你需要使用 代理 并模拟和轮换 浏览器头。
优点:对流程有完全控制、成本低(只需要服务器和代理)、灵活性高。
缺点:复杂且脆弱。Google 会积极阻止非浏览器请求,因此需要持续解决验证码并轮换指纹。即使是带有 TLS 和头部模拟的高级方案也可能失败。Google 布局的任何变化都可能使你的解析器失效。
4. 浏览器自动化(Puppeteer、Playwright、Selenium)
这种方法模拟真实用户行为:打开浏览器、输入查询、点击和滚动。它能完美模仿人类交互,但需要更多计算资源。像 Puppeteer 这样的库可以控制 Chrome 实例,从动态页面收集数据。
优点:可以绕过复杂防护、执行 JavaScript、数据准确性最高(你抓取到的就是用户所见)、灵活且强大。
缺点:资源消耗高(CPU、内存)、比直接 HTTP 请求更慢、对于大规模项目来说配置和维护复杂。
为什么代理和反检测浏览器至关重要
Google 会主动保护其数据,并积极阻止自动化请求。两大主要障碍是 验证码 和基于 IP 的封禁,这些通常在请求超过限制时触发。
代理充当中介,隐藏你的真实 IP 地址。核心策略是 代理轮换,即定期更换 IP,以模拟来自不同用户的流量并避免触发反机器人系统。
反检测浏览器解决的是一个更高级的问题:掩盖数字指纹。它们允许你伪装 User-Agent、屏幕分辨率、媒体设备、GPU 设置等环境参数。这会为每个新配置文件创建一个逼真的指纹,这对于绕过那些分析设备指纹的系统至关重要。将反检测浏览器与高质量代理结合使用,可以创建成千上万个独特的“用户”,并大规模收集数据。
Octo Browser 在 Google SERP 抓取中的能力
Octo Browser 包含一个 API,可实现数据收集过程的完全自动化。Octo 还提供了带有请求示例的详细 API 文档。
文档中包含用于集成 Puppeteer、Playwright 和 Selenium 的代码片段,这些工具通过 CDP 协议控制浏览器。
实用建议
仔细研究官方 API 文档。
查看与 API 使用相关的 常见问题。
阅读关于使用 Octo API 的 详细文章。
Octo Browser 中的 API 请求会按订阅级别限制,但可以提高。使用检查响应头中 API 限额的函数。忽略 HTTP 429 错误可能会延长封禁时长。如果你在一个账户下使用多个设备进行自动化,请实现集中式请求跟踪(例如使用 Redis)。
不要使用未打补丁的自动化库版本,因为它们包含可被检测到的漏洞。对于 Puppeteer/Playwright,请使用 rebrowser 补丁。对于 Selenium,请使用 undetected-chromedriver。
使用最能模拟人类行为的函数和库:鼠标点击、悬停、光标移动、输入、滚动、导航流程以及随机动作。
使用本地缓存保存配置文件,以减少代理流量。这可以通过在创建配置文件时传入
"local_cache": true来实现,也可以通过--disk-cache-dir使用共享缓存目录,例如flags:["--disk-cache-dir=C:/Cache"]在配置文件设置中限制图片加载,以节省代理流量。可在创建配置文件时设置
"images_load_limit": 10240,将图片限制为不大于 10,240 字节。
抓取方法比较
方法 | 成本 | 复杂度 | 封禁风险 | 数据质量 |
|---|---|---|---|---|
付费 SERP API | 高(起价每 1,000 次请求 1 美元) | 低 | 极低 | 高 |
官方 API | 低 / 免费 | 低 | 无 | 低(不是真实 SERP 数据) |
HTTP 请求 | 中等(需要代理) | 高 | 非常高 | 高 |
使用反检测浏览器进行自动化 | 中等(需要订阅和代理) | 中等 | 极低 | 最高 |
用于抓取 Google SERP 的现成脚本
下面是一个可与 Octo Browser API 配合使用的抓取脚本示例。你可以将此脚本或其中的一部分作为构建完整项目的起点,并根据需要进行调整。
下载并安装 VS Code。
下载并安装 Node.js。
在方便的位置创建一个文件夹,并例如将其命名为
octo_scraper。在 VS Code 中打开这个文件夹。
创建一个
.js文件。最好根据其功能命名,例如google_scraping.js。将脚本代码粘贴到文件中。
在代码中的
config变量里,把你的代理添加到proxies数组中。在同一位置,将你的搜索查询添加到
google_search_queries数组中。在这个脚本示例中,查询数量必须大于或等于代理数量。你可以轻松修改抓取逻辑以适应你的需求。

注意:每个数组元素都必须用引号括起来。元素之间用逗号分隔。
打开终端并运行命令
npm i rebrowser-puppeteer axios fkill来安装 Node.js 依赖。

. 如果 VS Code 显示错误,请以管理员身份打开 Windows PowerShell,输入命令
Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned,然后确认。接着重复上一步。. 启动 Octo Browser。
. 在 Visual Studio 中运行程序(Ctrl/Cmd + F5),等待脚本完成。
. 抓取器会为每个添加的代理创建 一次性配置文件,并按顺序执行指定查询。脚本会模拟真实用户行为,以绕过 Google 的反欺诈系统。
. 你可以在调试控制台中监控过程。如果出现验证码,脚本会关闭该配置文件并启动一个新的。

. 搜索结果将保存在项目目录中的
search_results文件夹里。

脚本代码
const axios = require('axios'); const puppeteer = require('rebrowser-puppeteer'); const fs = require('fs').promises; const path = require('path'); const config = { octo_local_api_base_url: `http://localhost:58888/api/profiles`, //change port if you don't use default 58888 headless_mode: false, proxies: [ "socks5://login:password@127.0.0.1:50000", //paste your proxies "socks5://login:password@127.0.0.1:50000" ], google_search_queries: ["nodejs", "sidwudraq", "arch linux"] //change queries } // ============= HELPER FUNCTIONS ============= function random_range(min, max) { return min + Math.random() * (max - min); } async function sleep(seconds) { return new Promise(resolve => setTimeout(resolve, seconds * 1000)); } async function human_delay(min_ms = 50, max_ms = 200) { const mu = Math.log((min_ms + max_ms) / 2); const sigma = random_range(0.3, 0.6); let delay = Math.exp(mu + sigma * (Math.random() - 0.5) * 2); delay = Math.min(max_ms, Math.max(min_ms, delay)); await new Promise(resolve => setTimeout(resolve, delay)); } async function kill_browser(pid) { const { default: fkill } = await import('fkill'); await fkill(pid, { force: true }); console.log(`✅ Process with PID ${pid} successfully stopped.`); } // ============= BEZIER CURVES FOR HUMAN-LIKE MOVEMENT ============= function bezier_curve(t, p0, p1, p2, p3) { const mt = 1 - t; const mt2 = mt * mt; const t2 = t * t; const x = mt2 * mt * p0.x + 3 * mt2 * t * p1.x + 3 * mt * t2 * p2.x + t2 * t * p3.x; const y = mt2 * mt * p0.y + 3 * mt2 * t * p1.y + 3 * mt * t2 * p2.y + t2 * t * p3.y; return { x, y }; } function generate_bezier_points(start, end) { const distance = Math.hypot(end.x - start.x, end.y - start.y); const angle = Math.atan2(end.y - start.y, end.x - start.x); const deviation = random_range(distance * 0.2, distance * 0.5); const angle_variation = random_range(-Math.PI / 3, Math.PI / 3); const p1 = { x: start.x + Math.cos(angle + angle_variation) * deviation, y: start.y + Math.sin(angle + angle_variation) * deviation }; const p2 = { x: end.x - Math.cos(angle - angle_variation) * deviation, y: end.y - Math.sin(angle - angle_variation) * deviation }; return [start, p1, p2, end]; } function generate_trajectory(start, end, steps = null) { const distance = Math.hypot(end.x - start.x, end.y - start.y); const actual_steps = steps || Math.max(20, Math.min(100, Math.floor(distance / 3))); const bezier_points = generate_bezier_points(start, end); const trajectory = []; for (let i = 0; i <= actual_steps; i++) { const t = i / actual_steps; const eased_t = Math.pow(t, 1 + Math.random() * 0.3); const point = bezier_curve(eased_t, ...bezier_points); const jitter = { x: (Math.random() - 0.5) * random_range(0.5, 2), y: (Math.random() - 0.5) * random_range(0.5, 2) }; trajectory.push({ x: Math.round(point.x + jitter.x), y: Math.round(point.y + jitter.y) }); } return trajectory; } // ============= HUMAN-LIKE CLICK ============= async function human_click(page, selector_or_element, options = {}) { const { move_speed = 1.0, random_overshoot = true, click_delay = null, force_visible = true } = options; const element = typeof selector_or_element === 'string' ? await page.$(selector_or_element) : selector_or_element; if (!element) { throw new Error(`Element not found: ${selector_or_element}`); } if (force_visible) { await element.scrollIntoView(); await human_delay(100, 300); } const current_mouse = await page.evaluate(() => ({ x: window.mouseX || window.innerWidth / 2, y: window.mouseY || window.innerHeight / 2 })); const box = await element.boundingBox(); if (!box) throw new Error('Could not get element coordinates'); const target = { x: box.x + random_range(box.width * 0.2, box.width * 0.8), y: box.y + random_range(box.height * 0.2, box.height * 0.8) }; if (random_overshoot && Math.random() < 0.3) { const overshoot_x = (Math.random() - 0.5) * random_range(10, 30); const overshoot_y = (Math.random() - 0.5) * random_range(10, 30); const overshoot_target = { x: target.x + overshoot_x, y: target.y + overshoot_y }; const overshoot_trajectory = generate_trajectory(current_mouse, overshoot_target); for (const point of overshoot_trajectory) { await page.mouse.move(point.x, point.y); await human_delay(1, 3); } const return_trajectory = generate_trajectory(overshoot_target, target); for (const point of return_trajectory) { await page.mouse.move(point.x, point.y); await human_delay(1, 3); } } else { const trajectory = generate_trajectory(current_mouse, target); for (const point of trajectory) { await page.mouse.move(point.x, point.y); const delay = Math.max(1, Math.min(5, 10 / move_speed)); await human_delay(delay * 0.5, delay * 1.5); } } const final_delay = click_delay !== null ? click_delay : random_range(80, 250); await human_delay(final_delay * 0.8, final_delay * 1.2); if (Math.random() < 0.15) { const micro_offset_x = (Math.random() - 0.5) * random_range(1, 4); const micro_offset_y = (Math.random() - 0.5) * random_range(1, 4); await page.mouse.move(target.x + micro_offset_x, target.y + micro_offset_y); await human_delay(10, 30); } await page.mouse.down(); await human_delay(random_range(50, 150)); if (Math.random() < 0.2) { await page.mouse.move( target.x + (Math.random() - 0.5) * 2, target.y + (Math.random() - 0.5) * 2 ); } await page.mouse.up(); await human_delay(50, 150); await page.evaluate(({ x, y }) => { window.mouseX = x; window.mouseY = y; }, target); return { success: true, position: target }; } // ============= HUMAN-LIKE TEXT INPUT ============= async function human_type(page, selector, text, options = {}) { const { typing_speed = null, random_mistakes = false, backspace_fix = false } = options; const element = typeof selector === 'string' ? await page.$(selector) : selector; if (!element) { throw new Error(`Element not found: ${selector}`); } await human_click(page, element, { pre_hover: true }); // Clear the field await page.keyboard.down('Control'); await page.keyboard.press('a'); await page.keyboard.up('Control'); await page.keyboard.press('Backspace'); await human_delay(100, 200); for (let i = 0; i < text.length; i++) { const char = text[i]; let delay; if (typing_speed) { delay = typing_speed; } else { const base_delay = random_range(50, 200); const is_space = char === ' '; delay = is_space ? base_delay * 2 : base_delay; } if (random_mistakes && Math.random() < 0.02) { const wrong_char = String.fromCharCode( char.charCodeAt(0) + (Math.random() > 0.5 ? 1 : -1) ); await page.keyboard.type(wrong_char, { delay: delay * 0.5 }); await human_delay(100, 200); if (backspace_fix) { await page.keyboard.press('Backspace'); await human_delay(50, 100); } else { continue; } } await page.keyboard.type(char, { delay: delay }); } await human_delay(100, 300); return true; } // ============= HUMAN-LIKE SCROLL ============= async function human_scroll(page, options = {}) { const { scrolls = null, min_scroll = 300, max_scroll = 800 } = options; const num_scrolls = scrolls || Math.floor(random_range(3, 8)); for (let i = 0; i < num_scrolls; i++) { const scroll_distance = random_range(min_scroll, max_scroll); await page.evaluate((distance) => { window.scrollBy({ top: distance, behavior: 'smooth' }); }, scroll_distance); await human_delay(800, 2000); if (Math.random() < 0.2) { const back_distance = random_range(100, 300); await page.evaluate((distance) => { window.scrollBy({ top: -distance, behavior: 'smooth' }); }, back_distance); await human_delay(500, 1000); } } } // ============= DISTRIBUTE QUERIES AMONG PROFILES ============= function distribute_queries(queries, numProxies) { const total = queries.length; const baseCount = Math.floor(total / numProxies); const remainder = total % numProxies; const batches = []; let start = 0; for (let i = 0; i < numProxies; i++) { const count = baseCount + (i < remainder ? 1 : 0); const batch = queries.slice(start, start + count); batches.push(batch); start += count; } return batches; } // ============= PARSE GOOGLE RESULTS ============= async function parse_search_results(page, query) { return await page.evaluate((query) => { const results = []; // Find all result containers const organic_results = document.querySelectorAll('div.tF2Cxc'); console.log(`Found ${organic_results.length} result containers`); organic_results.forEach((result, index) => { try { // Title const title_element = result.querySelector('h3.LC20lb.MBeuO.DKV0Md'); const title = title_element ? title_element.innerText : ''; // Link let link_element = result.querySelector('a'); let link = link_element ? link_element.href : ''; // Clean Google redirect if (link && link.includes('/url?q=')) { const url_match = link.match(/\/url\?q=([^&]+)/); if (url_match) { link = decodeURIComponent(url_match[1]); } } // Description let desc_element = result.querySelector('div.VwiC3b.yXK7lf.p4wth.r025kc.Hdw6tb'); let description = desc_element ? desc_element.innerText : ''; // Fallback selector if (!description) { const fallback_desc = result.querySelector('div.VwiC3b'); description = fallback_desc ? fallback_desc.innerText : ''; } if (title && title.trim() && link) { results.push({ position: results.length + 1, title: title.trim(), link: link, description: description.trim().substring(0, 500) }); } } catch (error) { console.error(`Error parsing result ${index}:`, error); } }); console.log(`Successfully parsed ${results.length} results`); return { query: query, timestamp: new Date().toISOString(), total_results: results.length, results: results }; }, query); } // ============= SAVE RESULTS TO FILE ============= async function save_results_to_file(query, data, is_appending = false) { const filename = `${query.replace(/[^a-z0-9]/gi, '_').toLowerCase()}_results.txt`; const filepath = path.join(__dirname, 'search_results', filename); // Create directory if needed await fs.mkdir(path.join(__dirname, 'search_results'), { recursive: true }); let content = ''; if (!is_appending) { content += `=== GOOGLE SEARCH RESULTS ===\n`; content += `Query: ${data.query}\n`; content += `Time: ${data.timestamp}\n`; content += `Total results: ${data.total_results}\n`; content += `${'='.repeat(80)}\n\n`; } for (const result of data.results) { content += `${result.position}. ${result.title}\n`; content += ` URL: ${result.link}\n`; content += ` Description: ${result.description.substring(0, 200)}...\n`; content += ` ${'-'.repeat(80)}\n`; } content += `\n📄 Page saved: ${new Date().toISOString()}\n`; content += `${'='.repeat(80)}\n\n`; await fs.writeFile(filepath, content, { flag: is_appending ? 'a' : 'w' }); console.log(`✅ Results saved to: ${filepath}`); return filepath; } // ============= OPEN RANDOM RESULT PAGE ============= async function open_random_result(page, results) { if (!results || results.length === 0) { console.log('No results to open'); return false; } // Choose a random result (usually not the first) let result_index = 0; if (results.length > 1) { result_index = Math.random() < 0.7 ? Math.floor(random_range(1, Math.min(5, results.length))) : Math.floor(random_range(0, results.length)); } const selected_result = results[result_index]; console.log(`Opening result ${result_index + 1}: ${selected_result.title.substring(0, 50)}...`); try { // Check for captcha before opening const has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected, not opening result'); return false; } // Open in a new tab const new_page = await page.browser().newPage(); await new_page.goto(selected_result.link, { waitUntil: 'domcontentloaded', timeout: 20000 }); await human_delay(2000, 4000); // Check for captcha on the opened page const page_has_captcha = await check_for_captcha(new_page); if (page_has_captcha) { console.log('🚫 Captcha detected on opened page'); await new_page.close(); return false; } // Scroll on the opened page await human_scroll(new_page, { scrolls: random_range(2, 5) }); await human_delay(1500, 3000); // Close the tab await new_page.close(); console.log(`✅ Page viewed and closed`); return true; } catch (error) { console.log(`❌ Error opening page: ${error.message}`); return false; } } // ============= CAPTCHA CHECK ============= async function check_for_captcha(page) { const captcha_selectors = [ '#captcha-form', '.g-recaptcha', 'iframe[src*="recaptcha"]', 'form[action*="captcha"]', '#captcha', '.captcha', 'div[jsname="Jai8Rc"]', 'form[action*="sorry"]' ]; for (const selector of captcha_selectors) { const element = await page.$(selector); if (element) return true; } const current_url = page.url(); if (current_url.includes('sorry') || current_url.includes('captcha')) { return true; } const page_text = await page.evaluate(() => document.body.innerText); const captcha_keywords = ['captcha', 'robot', 'verify', 'unusual traffic', 'confirm', 'not a robot']; for (const keyword of captcha_keywords) { if (page_text.toLowerCase().includes(keyword)) { return true; } } return false; } // ============= MAIN SEARCH FUNCTION ============= async function google_search_human(page, query, results_data, retry_count = 0) { const max_retries = 2; console.log(`🔍 Searching: ${query}${retry_count > 0 ? ` (attempt ${retry_count + 1})` : ''}`); try { // Go to Google homepage await page.goto('https://www.google.com', { waitUntil: 'domcontentloaded', timeout: 30000 }); await human_delay(1000, 2000); // Check for captcha let has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected!'); return { error: 'captcha', query: query }; } // Accept cookies if present try { const cookie_button = await page.$('#L2AGLb'); if (cookie_button) { await human_click(page, cookie_button); console.log('✅ Cookies accepted'); await human_delay(500, 1000); } } catch (error) { console.log('No cookie button'); } // Enter search query const search_input = await page.$('textarea[name="q"], input[name="q"]'); if (!search_input) { throw new Error('Search input not found'); } await human_type(page, search_input, query, { random_mistakes: true, backspace_fix: true }); await human_delay(500, 1000); // Check for captcha before submitting has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected before submission!'); return { error: 'captcha', query: query }; } // Press Enter console.log('📤 Submitting query...'); await Promise.all([ page.waitForNavigation({ waitUntil: 'domcontentloaded', timeout: 15000 }).catch(e => { console.log(`⚠️ Navigation warning: ${e.message}`); return null; }), page.keyboard.press('Enter'), human_delay(500, 1000) ]); // Check for captcha after search has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected after search!'); return { error: 'captcha', query: query }; } console.log('⏳ Waiting for results to load...'); // Wait for results to appear try { await page.waitForSelector('div.tF2Cxc', { timeout: 15000, visible: true }); console.log('✅ Results loaded'); } catch (error) { console.log('⚠️ Results not found, continuing...'); } await human_delay(1500, 2500); // Scroll through results console.log('📜 Scrolling through results...'); await human_scroll(page, { scrolls: random_range(4, 8) }); // Parse results console.log('📊 Parsing results...'); const parsed_results = await parse_search_results(page, query); if (parsed_results.results.length === 0 && retry_count < max_retries) { console.log('⚠️ No results found, retrying...'); await human_delay(2000, 3000); return await google_search_human(page, query, results_data, retry_count + 1); } // Save results const is_appending = results_data.has_results; await save_results_to_file(query, parsed_results, is_appending); results_data.has_results = true; results_data.all_results.push(...parsed_results.results); // Open 1-2 random result pages if (parsed_results.results.length > 0) { const pages_to_open = Math.floor(random_range(1, Math.min(3, parsed_results.results.length))); console.log(`📖 Opening ${pages_to_open} result pages...`); for (let i = 0; i < pages_to_open; i++) { await open_random_result(page, parsed_results.results); await human_delay(1000, 2000); // Return to results page const current_url = page.url(); if (!current_url.includes('google.com/search')) { try { await page.goBack({ waitUntil: 'domcontentloaded', timeout: 10000 }); await human_delay(1000, 1500); } catch (error) { console.log('⚠️ Could not go back'); await page.reload({ waitUntil: 'domcontentloaded' }); } } } } console.log(`✅ Search "${query}" completed, found ${parsed_results.results.length} results`); return { success: true, query: query, results: parsed_results.results }; } catch (error) { console.error(`❌ Error during search "${query}": ${error.message}`); const has_captcha = await check_for_captcha(page).catch(() => false); if (has_captcha) { console.log('🚫 Error caused by captcha'); return { error: 'captcha', query: query }; } if (retry_count < max_retries) { console.log(`🔄 Retrying in 5 seconds...`); await sleep(5); return await google_search_human(page, query, results_data, retry_count + 1); } return { error: 'timeout', query: query }; } } // ============= OCTO FUNCTIONS ============= async function check_limits(response) { function parse_int_safe(value) { const parsed = parseInt(value, 10); return isNaN(parsed) ? 0 : parsed; } const ratelimit_header = response.headers.ratelimit; if (!ratelimit_header) { console.warn('No ratelimit header found!'); return; } const limit_entries = ratelimit_header.split(',').map(entry => entry.trim()); for (const entry of limit_entries) { const name_match = entry.match(/^([^;]+)/); const r_match = entry.match(/;r=(\d+)/); const t_match = entry.match(/;t=(\d+)/); if (!r_match || !t_match) { console.warn(`Invalid ratelimit format: ${entry}`); continue; } const limit_name = name_match ? name_match[1] : 'unknown_limit'; const remaining_quantity = parse_int_safe(r_match[1]); const window_seconds = parse_int_safe(t_match[1]); if (remaining_quantity < 5) { const wait_time = window_seconds + 1; console.log(`Waiting ${wait_time} seconds due to ${limit_name} limit`); await sleep(wait_time); } } } function parse_proxy(proxy) { const regex = /^(\w+):\/\/(?:([^:]+):([^@]+)@)?([^:]+):(\d+)$/; const match = proxy.match(regex); if (!match) return null; const [, type, login, password, host, port] = match; return { type, host, port, login: login || null, password: password || null }; } async function octo_one_time_profile(config, proxy) { const one_time_profile_config = { method: "post", url: `${config.octo_local_api_base_url}/one_time/start`, headers: { 'Content-Type': 'application/json' }, data: { "profile_data": { "fingerprint": { "os": Math.random() < 0.5 ? "win" : "mac" }, "proxy": proxy, "images_load_limit": 10240, }, "headless": config.headless_mode, "debug_port": true, "timeout": 60 } } const response = await axios(one_time_profile_config); await check_limits(response); return response; } // ============= MAIN PROCESS ============= (async () => { console.log('🚀 Starting Google Scraper with Human-like Behavior...'); console.log('🛡️ Captcha detection enabled - profiles with captcha will be skipped\n'); const proxy_count = config.proxies.length; const all_queries = config.google_search_queries; const query_batches = distribute_queries(all_queries, proxy_count); console.log(`Total proxies: ${proxy_count}`); console.log(`Total search queries: ${all_queries.length}`); console.log('Query distribution:'); query_batches.forEach((batch, idx) => { console.log(` Profile ${idx + 1}: ${batch.length} queries - ${batch.join(', ')}`); }); console.log(''); let successful_profiles = 0; let skipped_profiles = 0; let failed_profiles = 0; for (let i = 0; i < proxy_count; i++) { console.log(`\n${'='.repeat(80)}`); console.log(`📋 Processing profile ${i + 1}/${proxy_count}`); console.log(`${'='.repeat(80)}`); const queries_for_this_profile = query_batches[i]; if (queries_for_this_profile.length === 0) { console.log(`⚠️ No queries assigned to profile ${i + 1}, skipping.`); continue; } let parsed_proxy = parse_proxy(config.proxies[i]); if (!parsed_proxy) { console.error(`❌ Failed to parse proxy: ${config.proxies[i]}`); failed_profiles++; continue; } console.log(`🔧 Creating and starting One Time Profile with proxy: ${parsed_proxy.host}:${parsed_proxy.port}`); let ws_endpoint; try { ws_endpoint = await octo_one_time_profile(config, parsed_proxy); } catch (error) { console.error(`❌ Failed to create or start profile: ${error.message}`); failed_profiles++; continue; } if (!ws_endpoint || !ws_endpoint.data.ws_endpoint || !ws_endpoint.data.uuid) { console.error('❌ Failed to create or start profile'); failed_profiles++; continue; } console.log(`✅ Profile created and started: ${ws_endpoint.data.uuid}`); console.log(`🌐 Connecting to browser`); let browser; try { browser = await puppeteer.connect({ browserWSEndpoint: ws_endpoint.data.ws_endpoint, defaultViewport: null }); } catch (error) { console.error(`❌ Failed to connect to browser: ${error.message}`); await kill_browser(ws_endpoint.data.browser_pid); continue; } const page = await browser.newPage(); const results_data = { has_results: false, all_results: [] }; let captcha_detected = false; // Execute only the queries assigned to this profile for (let j = 0; j < queries_for_this_profile.length; j++) { const query = queries_for_this_profile[j]; try { const search_result = await google_search_human(page, query, results_data); if (search_result.error === 'captcha') { console.log(`\n🚨 CAPTCHA DETECTED! Skipping profile ${ws_endpoint.data.uuid}`); captcha_detected = true; break; } if (j < queries_for_this_profile.length - 1 && !captcha_detected) { const delay_between = random_range(5, 10); console.log(`\n⏰ Waiting ${delay_between.toFixed(1)} seconds before next search...`); await sleep(delay_between); } } catch (error) { console.error(`❌ Error during search "${query}": ${error.message}`); } } console.log(`🛑 Stopping profile...`); await kill_browser(ws_endpoint.data.browser_pid); if (captcha_detected) { console.log(`⏭️ Profile ${ws_endpoint.data.uuid} skipped due to captcha`); skipped_profiles++; } else if (results_data.all_results.length > 0) { const summary_filename = `summary_${ws_endpoint.data.uuid}_${Date.now()}.txt`; const summary_path = path.join(__dirname, 'search_results', summary_filename); let summary_content = `=== SEARCH SUMMARY ===\n`; summary_content += `Profile: ${ws_endpoint.data.uuid}\n`; summary_content += `Proxy: ${parsed_proxy.host}:${parsed_proxy.port}\n`; summary_content += `Queries executed: ${queries_for_this_profile.length}\n`; summary_content += `Queries: ${queries_for_this_profile.join(', ')}\n`; summary_content += `Total results collected: ${results_data.all_results.length}\n`; summary_content += `Time: ${new Date().toISOString()}\n`; summary_content += `${'='.repeat(80)}\n\n`; await fs.writeFile(summary_path, summary_content); console.log(`\n📊 Summary saved: ${summary_path}`); successful_profiles++; } else { console.log(`⚠️ Profile ${ws_endpoint.data.uuid} finished without results`); failed_profiles++; } console.log(`✅ Profile ${i + 1} completed`); if (i < proxy_count - 1) { const delay_between = random_range(10, 20); console.log(`\n⏰ Waiting ${delay_between.toFixed(1)} seconds before next profile...`); await sleep(delay_between); } } console.log(`\n${'='.repeat(80)}`); console.log(`📊 FINAL STATISTICS:`); console.log(`${'='.repeat(80)}`); console.log(`✅ Successful profiles: ${successful_profiles}`); console.log(`⏭️ Skipped due to captcha: ${skipped_profiles}`); console.log(`❌ Failed profiles: ${failed_profiles}`); console.log(`📁 All results saved in "search_results" folder`); console.log(`\n🎉 Google Scraper finished!`); })();
const axios = require('axios'); const puppeteer = require('rebrowser-puppeteer'); const fs = require('fs').promises; const path = require('path'); const config = { octo_local_api_base_url: `http://localhost:58888/api/profiles`, //change port if you don't use default 58888 headless_mode: false, proxies: [ "socks5://login:password@127.0.0.1:50000", //paste your proxies "socks5://login:password@127.0.0.1:50000" ], google_search_queries: ["nodejs", "sidwudraq", "arch linux"] //change queries } // ============= HELPER FUNCTIONS ============= function random_range(min, max) { return min + Math.random() * (max - min); } async function sleep(seconds) { return new Promise(resolve => setTimeout(resolve, seconds * 1000)); } async function human_delay(min_ms = 50, max_ms = 200) { const mu = Math.log((min_ms + max_ms) / 2); const sigma = random_range(0.3, 0.6); let delay = Math.exp(mu + sigma * (Math.random() - 0.5) * 2); delay = Math.min(max_ms, Math.max(min_ms, delay)); await new Promise(resolve => setTimeout(resolve, delay)); } async function kill_browser(pid) { const { default: fkill } = await import('fkill'); await fkill(pid, { force: true }); console.log(`✅ Process with PID ${pid} successfully stopped.`); } // ============= BEZIER CURVES FOR HUMAN-LIKE MOVEMENT ============= function bezier_curve(t, p0, p1, p2, p3) { const mt = 1 - t; const mt2 = mt * mt; const t2 = t * t; const x = mt2 * mt * p0.x + 3 * mt2 * t * p1.x + 3 * mt * t2 * p2.x + t2 * t * p3.x; const y = mt2 * mt * p0.y + 3 * mt2 * t * p1.y + 3 * mt * t2 * p2.y + t2 * t * p3.y; return { x, y }; } function generate_bezier_points(start, end) { const distance = Math.hypot(end.x - start.x, end.y - start.y); const angle = Math.atan2(end.y - start.y, end.x - start.x); const deviation = random_range(distance * 0.2, distance * 0.5); const angle_variation = random_range(-Math.PI / 3, Math.PI / 3); const p1 = { x: start.x + Math.cos(angle + angle_variation) * deviation, y: start.y + Math.sin(angle + angle_variation) * deviation }; const p2 = { x: end.x - Math.cos(angle - angle_variation) * deviation, y: end.y - Math.sin(angle - angle_variation) * deviation }; return [start, p1, p2, end]; } function generate_trajectory(start, end, steps = null) { const distance = Math.hypot(end.x - start.x, end.y - start.y); const actual_steps = steps || Math.max(20, Math.min(100, Math.floor(distance / 3))); const bezier_points = generate_bezier_points(start, end); const trajectory = []; for (let i = 0; i <= actual_steps; i++) { const t = i / actual_steps; const eased_t = Math.pow(t, 1 + Math.random() * 0.3); const point = bezier_curve(eased_t, ...bezier_points); const jitter = { x: (Math.random() - 0.5) * random_range(0.5, 2), y: (Math.random() - 0.5) * random_range(0.5, 2) }; trajectory.push({ x: Math.round(point.x + jitter.x), y: Math.round(point.y + jitter.y) }); } return trajectory; } // ============= HUMAN-LIKE CLICK ============= async function human_click(page, selector_or_element, options = {}) { const { move_speed = 1.0, random_overshoot = true, click_delay = null, force_visible = true } = options; const element = typeof selector_or_element === 'string' ? await page.$(selector_or_element) : selector_or_element; if (!element) { throw new Error(`Element not found: ${selector_or_element}`); } if (force_visible) { await element.scrollIntoView(); await human_delay(100, 300); } const current_mouse = await page.evaluate(() => ({ x: window.mouseX || window.innerWidth / 2, y: window.mouseY || window.innerHeight / 2 })); const box = await element.boundingBox(); if (!box) throw new Error('Could not get element coordinates'); const target = { x: box.x + random_range(box.width * 0.2, box.width * 0.8), y: box.y + random_range(box.height * 0.2, box.height * 0.8) }; if (random_overshoot && Math.random() < 0.3) { const overshoot_x = (Math.random() - 0.5) * random_range(10, 30); const overshoot_y = (Math.random() - 0.5) * random_range(10, 30); const overshoot_target = { x: target.x + overshoot_x, y: target.y + overshoot_y }; const overshoot_trajectory = generate_trajectory(current_mouse, overshoot_target); for (const point of overshoot_trajectory) { await page.mouse.move(point.x, point.y); await human_delay(1, 3); } const return_trajectory = generate_trajectory(overshoot_target, target); for (const point of return_trajectory) { await page.mouse.move(point.x, point.y); await human_delay(1, 3); } } else { const trajectory = generate_trajectory(current_mouse, target); for (const point of trajectory) { await page.mouse.move(point.x, point.y); const delay = Math.max(1, Math.min(5, 10 / move_speed)); await human_delay(delay * 0.5, delay * 1.5); } } const final_delay = click_delay !== null ? click_delay : random_range(80, 250); await human_delay(final_delay * 0.8, final_delay * 1.2); if (Math.random() < 0.15) { const micro_offset_x = (Math.random() - 0.5) * random_range(1, 4); const micro_offset_y = (Math.random() - 0.5) * random_range(1, 4); await page.mouse.move(target.x + micro_offset_x, target.y + micro_offset_y); await human_delay(10, 30); } await page.mouse.down(); await human_delay(random_range(50, 150)); if (Math.random() < 0.2) { await page.mouse.move( target.x + (Math.random() - 0.5) * 2, target.y + (Math.random() - 0.5) * 2 ); } await page.mouse.up(); await human_delay(50, 150); await page.evaluate(({ x, y }) => { window.mouseX = x; window.mouseY = y; }, target); return { success: true, position: target }; } // ============= HUMAN-LIKE TEXT INPUT ============= async function human_type(page, selector, text, options = {}) { const { typing_speed = null, random_mistakes = false, backspace_fix = false } = options; const element = typeof selector === 'string' ? await page.$(selector) : selector; if (!element) { throw new Error(`Element not found: ${selector}`); } await human_click(page, element, { pre_hover: true }); // Clear the field await page.keyboard.down('Control'); await page.keyboard.press('a'); await page.keyboard.up('Control'); await page.keyboard.press('Backspace'); await human_delay(100, 200); for (let i = 0; i < text.length; i++) { const char = text[i]; let delay; if (typing_speed) { delay = typing_speed; } else { const base_delay = random_range(50, 200); const is_space = char === ' '; delay = is_space ? base_delay * 2 : base_delay; } if (random_mistakes && Math.random() < 0.02) { const wrong_char = String.fromCharCode( char.charCodeAt(0) + (Math.random() > 0.5 ? 1 : -1) ); await page.keyboard.type(wrong_char, { delay: delay * 0.5 }); await human_delay(100, 200); if (backspace_fix) { await page.keyboard.press('Backspace'); await human_delay(50, 100); } else { continue; } } await page.keyboard.type(char, { delay: delay }); } await human_delay(100, 300); return true; } // ============= HUMAN-LIKE SCROLL ============= async function human_scroll(page, options = {}) { const { scrolls = null, min_scroll = 300, max_scroll = 800 } = options; const num_scrolls = scrolls || Math.floor(random_range(3, 8)); for (let i = 0; i < num_scrolls; i++) { const scroll_distance = random_range(min_scroll, max_scroll); await page.evaluate((distance) => { window.scrollBy({ top: distance, behavior: 'smooth' }); }, scroll_distance); await human_delay(800, 2000); if (Math.random() < 0.2) { const back_distance = random_range(100, 300); await page.evaluate((distance) => { window.scrollBy({ top: -distance, behavior: 'smooth' }); }, back_distance); await human_delay(500, 1000); } } } // ============= DISTRIBUTE QUERIES AMONG PROFILES ============= function distribute_queries(queries, numProxies) { const total = queries.length; const baseCount = Math.floor(total / numProxies); const remainder = total % numProxies; const batches = []; let start = 0; for (let i = 0; i < numProxies; i++) { const count = baseCount + (i < remainder ? 1 : 0); const batch = queries.slice(start, start + count); batches.push(batch); start += count; } return batches; } // ============= PARSE GOOGLE RESULTS ============= async function parse_search_results(page, query) { return await page.evaluate((query) => { const results = []; // Find all result containers const organic_results = document.querySelectorAll('div.tF2Cxc'); console.log(`Found ${organic_results.length} result containers`); organic_results.forEach((result, index) => { try { // Title const title_element = result.querySelector('h3.LC20lb.MBeuO.DKV0Md'); const title = title_element ? title_element.innerText : ''; // Link let link_element = result.querySelector('a'); let link = link_element ? link_element.href : ''; // Clean Google redirect if (link && link.includes('/url?q=')) { const url_match = link.match(/\/url\?q=([^&]+)/); if (url_match) { link = decodeURIComponent(url_match[1]); } } // Description let desc_element = result.querySelector('div.VwiC3b.yXK7lf.p4wth.r025kc.Hdw6tb'); let description = desc_element ? desc_element.innerText : ''; // Fallback selector if (!description) { const fallback_desc = result.querySelector('div.VwiC3b'); description = fallback_desc ? fallback_desc.innerText : ''; } if (title && title.trim() && link) { results.push({ position: results.length + 1, title: title.trim(), link: link, description: description.trim().substring(0, 500) }); } } catch (error) { console.error(`Error parsing result ${index}:`, error); } }); console.log(`Successfully parsed ${results.length} results`); return { query: query, timestamp: new Date().toISOString(), total_results: results.length, results: results }; }, query); } // ============= SAVE RESULTS TO FILE ============= async function save_results_to_file(query, data, is_appending = false) { const filename = `${query.replace(/[^a-z0-9]/gi, '_').toLowerCase()}_results.txt`; const filepath = path.join(__dirname, 'search_results', filename); // Create directory if needed await fs.mkdir(path.join(__dirname, 'search_results'), { recursive: true }); let content = ''; if (!is_appending) { content += `=== GOOGLE SEARCH RESULTS ===\n`; content += `Query: ${data.query}\n`; content += `Time: ${data.timestamp}\n`; content += `Total results: ${data.total_results}\n`; content += `${'='.repeat(80)}\n\n`; } for (const result of data.results) { content += `${result.position}. ${result.title}\n`; content += ` URL: ${result.link}\n`; content += ` Description: ${result.description.substring(0, 200)}...\n`; content += ` ${'-'.repeat(80)}\n`; } content += `\n📄 Page saved: ${new Date().toISOString()}\n`; content += `${'='.repeat(80)}\n\n`; await fs.writeFile(filepath, content, { flag: is_appending ? 'a' : 'w' }); console.log(`✅ Results saved to: ${filepath}`); return filepath; } // ============= OPEN RANDOM RESULT PAGE ============= async function open_random_result(page, results) { if (!results || results.length === 0) { console.log('No results to open'); return false; } // Choose a random result (usually not the first) let result_index = 0; if (results.length > 1) { result_index = Math.random() < 0.7 ? Math.floor(random_range(1, Math.min(5, results.length))) : Math.floor(random_range(0, results.length)); } const selected_result = results[result_index]; console.log(`Opening result ${result_index + 1}: ${selected_result.title.substring(0, 50)}...`); try { // Check for captcha before opening const has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected, not opening result'); return false; } // Open in a new tab const new_page = await page.browser().newPage(); await new_page.goto(selected_result.link, { waitUntil: 'domcontentloaded', timeout: 20000 }); await human_delay(2000, 4000); // Check for captcha on the opened page const page_has_captcha = await check_for_captcha(new_page); if (page_has_captcha) { console.log('🚫 Captcha detected on opened page'); await new_page.close(); return false; } // Scroll on the opened page await human_scroll(new_page, { scrolls: random_range(2, 5) }); await human_delay(1500, 3000); // Close the tab await new_page.close(); console.log(`✅ Page viewed and closed`); return true; } catch (error) { console.log(`❌ Error opening page: ${error.message}`); return false; } } // ============= CAPTCHA CHECK ============= async function check_for_captcha(page) { const captcha_selectors = [ '#captcha-form', '.g-recaptcha', 'iframe[src*="recaptcha"]', 'form[action*="captcha"]', '#captcha', '.captcha', 'div[jsname="Jai8Rc"]', 'form[action*="sorry"]' ]; for (const selector of captcha_selectors) { const element = await page.$(selector); if (element) return true; } const current_url = page.url(); if (current_url.includes('sorry') || current_url.includes('captcha')) { return true; } const page_text = await page.evaluate(() => document.body.innerText); const captcha_keywords = ['captcha', 'robot', 'verify', 'unusual traffic', 'confirm', 'not a robot']; for (const keyword of captcha_keywords) { if (page_text.toLowerCase().includes(keyword)) { return true; } } return false; } // ============= MAIN SEARCH FUNCTION ============= async function google_search_human(page, query, results_data, retry_count = 0) { const max_retries = 2; console.log(`🔍 Searching: ${query}${retry_count > 0 ? ` (attempt ${retry_count + 1})` : ''}`); try { // Go to Google homepage await page.goto('https://www.google.com', { waitUntil: 'domcontentloaded', timeout: 30000 }); await human_delay(1000, 2000); // Check for captcha let has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected!'); return { error: 'captcha', query: query }; } // Accept cookies if present try { const cookie_button = await page.$('#L2AGLb'); if (cookie_button) { await human_click(page, cookie_button); console.log('✅ Cookies accepted'); await human_delay(500, 1000); } } catch (error) { console.log('No cookie button'); } // Enter search query const search_input = await page.$('textarea[name="q"], input[name="q"]'); if (!search_input) { throw new Error('Search input not found'); } await human_type(page, search_input, query, { random_mistakes: true, backspace_fix: true }); await human_delay(500, 1000); // Check for captcha before submitting has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected before submission!'); return { error: 'captcha', query: query }; } // Press Enter console.log('📤 Submitting query...'); await Promise.all([ page.waitForNavigation({ waitUntil: 'domcontentloaded', timeout: 15000 }).catch(e => { console.log(`⚠️ Navigation warning: ${e.message}`); return null; }), page.keyboard.press('Enter'), human_delay(500, 1000) ]); // Check for captcha after search has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected after search!'); return { error: 'captcha', query: query }; } console.log('⏳ Waiting for results to load...'); // Wait for results to appear try { await page.waitForSelector('div.tF2Cxc', { timeout: 15000, visible: true }); console.log('✅ Results loaded'); } catch (error) { console.log('⚠️ Results not found, continuing...'); } await human_delay(1500, 2500); // Scroll through results console.log('📜 Scrolling through results...'); await human_scroll(page, { scrolls: random_range(4, 8) }); // Parse results console.log('📊 Parsing results...'); const parsed_results = await parse_search_results(page, query); if (parsed_results.results.length === 0 && retry_count < max_retries) { console.log('⚠️ No results found, retrying...'); await human_delay(2000, 3000); return await google_search_human(page, query, results_data, retry_count + 1); } // Save results const is_appending = results_data.has_results; await save_results_to_file(query, parsed_results, is_appending); results_data.has_results = true; results_data.all_results.push(...parsed_results.results); // Open 1-2 random result pages if (parsed_results.results.length > 0) { const pages_to_open = Math.floor(random_range(1, Math.min(3, parsed_results.results.length))); console.log(`📖 Opening ${pages_to_open} result pages...`); for (let i = 0; i < pages_to_open; i++) { await open_random_result(page, parsed_results.results); await human_delay(1000, 2000); // Return to results page const current_url = page.url(); if (!current_url.includes('google.com/search')) { try { await page.goBack({ waitUntil: 'domcontentloaded', timeout: 10000 }); await human_delay(1000, 1500); } catch (error) { console.log('⚠️ Could not go back'); await page.reload({ waitUntil: 'domcontentloaded' }); } } } } console.log(`✅ Search "${query}" completed, found ${parsed_results.results.length} results`); return { success: true, query: query, results: parsed_results.results }; } catch (error) { console.error(`❌ Error during search "${query}": ${error.message}`); const has_captcha = await check_for_captcha(page).catch(() => false); if (has_captcha) { console.log('🚫 Error caused by captcha'); return { error: 'captcha', query: query }; } if (retry_count < max_retries) { console.log(`🔄 Retrying in 5 seconds...`); await sleep(5); return await google_search_human(page, query, results_data, retry_count + 1); } return { error: 'timeout', query: query }; } } // ============= OCTO FUNCTIONS ============= async function check_limits(response) { function parse_int_safe(value) { const parsed = parseInt(value, 10); return isNaN(parsed) ? 0 : parsed; } const ratelimit_header = response.headers.ratelimit; if (!ratelimit_header) { console.warn('No ratelimit header found!'); return; } const limit_entries = ratelimit_header.split(',').map(entry => entry.trim()); for (const entry of limit_entries) { const name_match = entry.match(/^([^;]+)/); const r_match = entry.match(/;r=(\d+)/); const t_match = entry.match(/;t=(\d+)/); if (!r_match || !t_match) { console.warn(`Invalid ratelimit format: ${entry}`); continue; } const limit_name = name_match ? name_match[1] : 'unknown_limit'; const remaining_quantity = parse_int_safe(r_match[1]); const window_seconds = parse_int_safe(t_match[1]); if (remaining_quantity < 5) { const wait_time = window_seconds + 1; console.log(`Waiting ${wait_time} seconds due to ${limit_name} limit`); await sleep(wait_time); } } } function parse_proxy(proxy) { const regex = /^(\w+):\/\/(?:([^:]+):([^@]+)@)?([^:]+):(\d+)$/; const match = proxy.match(regex); if (!match) return null; const [, type, login, password, host, port] = match; return { type, host, port, login: login || null, password: password || null }; } async function octo_one_time_profile(config, proxy) { const one_time_profile_config = { method: "post", url: `${config.octo_local_api_base_url}/one_time/start`, headers: { 'Content-Type': 'application/json' }, data: { "profile_data": { "fingerprint": { "os": Math.random() < 0.5 ? "win" : "mac" }, "proxy": proxy, "images_load_limit": 10240, }, "headless": config.headless_mode, "debug_port": true, "timeout": 60 } } const response = await axios(one_time_profile_config); await check_limits(response); return response; } // ============= MAIN PROCESS ============= (async () => { console.log('🚀 Starting Google Scraper with Human-like Behavior...'); console.log('🛡️ Captcha detection enabled - profiles with captcha will be skipped\n'); const proxy_count = config.proxies.length; const all_queries = config.google_search_queries; const query_batches = distribute_queries(all_queries, proxy_count); console.log(`Total proxies: ${proxy_count}`); console.log(`Total search queries: ${all_queries.length}`); console.log('Query distribution:'); query_batches.forEach((batch, idx) => { console.log(` Profile ${idx + 1}: ${batch.length} queries - ${batch.join(', ')}`); }); console.log(''); let successful_profiles = 0; let skipped_profiles = 0; let failed_profiles = 0; for (let i = 0; i < proxy_count; i++) { console.log(`\n${'='.repeat(80)}`); console.log(`📋 Processing profile ${i + 1}/${proxy_count}`); console.log(`${'='.repeat(80)}`); const queries_for_this_profile = query_batches[i]; if (queries_for_this_profile.length === 0) { console.log(`⚠️ No queries assigned to profile ${i + 1}, skipping.`); continue; } let parsed_proxy = parse_proxy(config.proxies[i]); if (!parsed_proxy) { console.error(`❌ Failed to parse proxy: ${config.proxies[i]}`); failed_profiles++; continue; } console.log(`🔧 Creating and starting One Time Profile with proxy: ${parsed_proxy.host}:${parsed_proxy.port}`); let ws_endpoint; try { ws_endpoint = await octo_one_time_profile(config, parsed_proxy); } catch (error) { console.error(`❌ Failed to create or start profile: ${error.message}`); failed_profiles++; continue; } if (!ws_endpoint || !ws_endpoint.data.ws_endpoint || !ws_endpoint.data.uuid) { console.error('❌ Failed to create or start profile'); failed_profiles++; continue; } console.log(`✅ Profile created and started: ${ws_endpoint.data.uuid}`); console.log(`🌐 Connecting to browser`); let browser; try { browser = await puppeteer.connect({ browserWSEndpoint: ws_endpoint.data.ws_endpoint, defaultViewport: null }); } catch (error) { console.error(`❌ Failed to connect to browser: ${error.message}`); await kill_browser(ws_endpoint.data.browser_pid); continue; } const page = await browser.newPage(); const results_data = { has_results: false, all_results: [] }; let captcha_detected = false; // Execute only the queries assigned to this profile for (let j = 0; j < queries_for_this_profile.length; j++) { const query = queries_for_this_profile[j]; try { const search_result = await google_search_human(page, query, results_data); if (search_result.error === 'captcha') { console.log(`\n🚨 CAPTCHA DETECTED! Skipping profile ${ws_endpoint.data.uuid}`); captcha_detected = true; break; } if (j < queries_for_this_profile.length - 1 && !captcha_detected) { const delay_between = random_range(5, 10); console.log(`\n⏰ Waiting ${delay_between.toFixed(1)} seconds before next search...`); await sleep(delay_between); } } catch (error) { console.error(`❌ Error during search "${query}": ${error.message}`); } } console.log(`🛑 Stopping profile...`); await kill_browser(ws_endpoint.data.browser_pid); if (captcha_detected) { console.log(`⏭️ Profile ${ws_endpoint.data.uuid} skipped due to captcha`); skipped_profiles++; } else if (results_data.all_results.length > 0) { const summary_filename = `summary_${ws_endpoint.data.uuid}_${Date.now()}.txt`; const summary_path = path.join(__dirname, 'search_results', summary_filename); let summary_content = `=== SEARCH SUMMARY ===\n`; summary_content += `Profile: ${ws_endpoint.data.uuid}\n`; summary_content += `Proxy: ${parsed_proxy.host}:${parsed_proxy.port}\n`; summary_content += `Queries executed: ${queries_for_this_profile.length}\n`; summary_content += `Queries: ${queries_for_this_profile.join(', ')}\n`; summary_content += `Total results collected: ${results_data.all_results.length}\n`; summary_content += `Time: ${new Date().toISOString()}\n`; summary_content += `${'='.repeat(80)}\n\n`; await fs.writeFile(summary_path, summary_content); console.log(`\n📊 Summary saved: ${summary_path}`); successful_profiles++; } else { console.log(`⚠️ Profile ${ws_endpoint.data.uuid} finished without results`); failed_profiles++; } console.log(`✅ Profile ${i + 1} completed`); if (i < proxy_count - 1) { const delay_between = random_range(10, 20); console.log(`\n⏰ Waiting ${delay_between.toFixed(1)} seconds before next profile...`); await sleep(delay_between); } } console.log(`\n${'='.repeat(80)}`); console.log(`📊 FINAL STATISTICS:`); console.log(`${'='.repeat(80)}`); console.log(`✅ Successful profiles: ${successful_profiles}`); console.log(`⏭️ Skipped due to captcha: ${skipped_profiles}`); console.log(`❌ Failed profiles: ${failed_profiles}`); console.log(`📁 All results saved in "search_results" folder`); console.log(`\n🎉 Google Scraper finished!`); })();
保持匿名,充分利用多账户功能,借助市面上最优质的反检测浏览器实现您的目标。
为什么抓取 Google 搜索结果
Google 是一个全球性的消费者需求和竞争对手活动数据库。分析搜索引擎结果页(SERP)可以提供关键洞察:网站对关键词的实际排名、竞争对手的标题和元描述、富摘要的存在及其格式,以及来自“People Also Ask”模块和搜索建议的数据。这些数据可帮助公司和营销人员:
跟踪排名和可见度:分析 SEO 表现并监控随时间的进展。
研究竞争对手:了解他们的关键词和内容策略,并识别市场空白。
发现细分领域和趋势:找到新的关键词和查询,以创建相关内容。
分析广告:研究竞争对手的广告、标题、文案和策略。
因此,这些洞察对于 SEO 专家、营销人员、分析师、企业主以及在线营销工具开发者最具价值。
数据收集工具和方法
1. 第三方 SERP API(付费服务)
这些是专门的 API,负责处理数据收集中的所有技术复杂性。你发送请求后,会收到包含搜索结果、广告和其他元素的结构化 Json。服务提供商会管理代理轮换、解决 CAPTCHA,并渲染 JavaScript,交付可直接使用的数据。
优点:易于集成、可扩展、由服务商处理封禁问题、数据结构清晰。
缺点:规模化成本高(例如,Bright Data 起价为每 1,000 次请求 1 美元)、供应商锁定、处理延迟。
2. 官方 Google API(Custom Search JSON API)
这是一种通过将 Google Search 嵌入你的网站来访问搜索数据的合法方式。然而,它在本质上是不同的,因为它不会模拟真实用户搜索,也不会返回带有广告和动态元素的“实时” SERP。结果通常不够及时,并且结构也不同。
优点:合法、稳定、易于使用,包含免费额度(每天 100 次请求)。
缺点:不会返回真实的 SERP 数据。该 API 提供的是来自一组有限预定义站点的结构化结果,而不是用户看到的真实搜索页面。它有配额和限制,因此不适合大规模排名跟踪或竞争分析。
3. 直接 HTTP 请求(抓取)
这种方法模拟标准浏览器请求。你的脚本(Python、Node.js 等)向 Google Search URL 发送 GET 请求并接收 HTML 代码,然后需要对其进行解析。为了避免被检测,你需要使用 代理 并模拟和轮换 浏览器头。
优点:对流程有完全控制、成本低(只需要服务器和代理)、灵活性高。
缺点:复杂且脆弱。Google 会积极阻止非浏览器请求,因此需要持续解决验证码并轮换指纹。即使是带有 TLS 和头部模拟的高级方案也可能失败。Google 布局的任何变化都可能使你的解析器失效。
4. 浏览器自动化(Puppeteer、Playwright、Selenium)
这种方法模拟真实用户行为:打开浏览器、输入查询、点击和滚动。它能完美模仿人类交互,但需要更多计算资源。像 Puppeteer 这样的库可以控制 Chrome 实例,从动态页面收集数据。
优点:可以绕过复杂防护、执行 JavaScript、数据准确性最高(你抓取到的就是用户所见)、灵活且强大。
缺点:资源消耗高(CPU、内存)、比直接 HTTP 请求更慢、对于大规模项目来说配置和维护复杂。
为什么代理和反检测浏览器至关重要
Google 会主动保护其数据,并积极阻止自动化请求。两大主要障碍是 验证码 和基于 IP 的封禁,这些通常在请求超过限制时触发。
代理充当中介,隐藏你的真实 IP 地址。核心策略是 代理轮换,即定期更换 IP,以模拟来自不同用户的流量并避免触发反机器人系统。
反检测浏览器解决的是一个更高级的问题:掩盖数字指纹。它们允许你伪装 User-Agent、屏幕分辨率、媒体设备、GPU 设置等环境参数。这会为每个新配置文件创建一个逼真的指纹,这对于绕过那些分析设备指纹的系统至关重要。将反检测浏览器与高质量代理结合使用,可以创建成千上万个独特的“用户”,并大规模收集数据。
Octo Browser 在 Google SERP 抓取中的能力
Octo Browser 包含一个 API,可实现数据收集过程的完全自动化。Octo 还提供了带有请求示例的详细 API 文档。
文档中包含用于集成 Puppeteer、Playwright 和 Selenium 的代码片段,这些工具通过 CDP 协议控制浏览器。
实用建议
仔细研究官方 API 文档。
查看与 API 使用相关的 常见问题。
阅读关于使用 Octo API 的 详细文章。
Octo Browser 中的 API 请求会按订阅级别限制,但可以提高。使用检查响应头中 API 限额的函数。忽略 HTTP 429 错误可能会延长封禁时长。如果你在一个账户下使用多个设备进行自动化,请实现集中式请求跟踪(例如使用 Redis)。
不要使用未打补丁的自动化库版本,因为它们包含可被检测到的漏洞。对于 Puppeteer/Playwright,请使用 rebrowser 补丁。对于 Selenium,请使用 undetected-chromedriver。
使用最能模拟人类行为的函数和库:鼠标点击、悬停、光标移动、输入、滚动、导航流程以及随机动作。
使用本地缓存保存配置文件,以减少代理流量。这可以通过在创建配置文件时传入
"local_cache": true来实现,也可以通过--disk-cache-dir使用共享缓存目录,例如flags:["--disk-cache-dir=C:/Cache"]在配置文件设置中限制图片加载,以节省代理流量。可在创建配置文件时设置
"images_load_limit": 10240,将图片限制为不大于 10,240 字节。
抓取方法比较
方法 | 成本 | 复杂度 | 封禁风险 | 数据质量 |
|---|---|---|---|---|
付费 SERP API | 高(起价每 1,000 次请求 1 美元) | 低 | 极低 | 高 |
官方 API | 低 / 免费 | 低 | 无 | 低(不是真实 SERP 数据) |
HTTP 请求 | 中等(需要代理) | 高 | 非常高 | 高 |
使用反检测浏览器进行自动化 | 中等(需要订阅和代理) | 中等 | 极低 | 最高 |
用于抓取 Google SERP 的现成脚本
下面是一个可与 Octo Browser API 配合使用的抓取脚本示例。你可以将此脚本或其中的一部分作为构建完整项目的起点,并根据需要进行调整。
下载并安装 VS Code。
下载并安装 Node.js。
在方便的位置创建一个文件夹,并例如将其命名为
octo_scraper。在 VS Code 中打开这个文件夹。
创建一个
.js文件。最好根据其功能命名,例如google_scraping.js。将脚本代码粘贴到文件中。
在代码中的
config变量里,把你的代理添加到proxies数组中。在同一位置,将你的搜索查询添加到
google_search_queries数组中。在这个脚本示例中,查询数量必须大于或等于代理数量。你可以轻松修改抓取逻辑以适应你的需求。

注意:每个数组元素都必须用引号括起来。元素之间用逗号分隔。
打开终端并运行命令
npm i rebrowser-puppeteer axios fkill来安装 Node.js 依赖。

. 如果 VS Code 显示错误,请以管理员身份打开 Windows PowerShell,输入命令
Set-ExecutionPolicy -Scope CurrentUser -ExecutionPolicy RemoteSigned,然后确认。接着重复上一步。. 启动 Octo Browser。
. 在 Visual Studio 中运行程序(Ctrl/Cmd + F5),等待脚本完成。
. 抓取器会为每个添加的代理创建 一次性配置文件,并按顺序执行指定查询。脚本会模拟真实用户行为,以绕过 Google 的反欺诈系统。
. 你可以在调试控制台中监控过程。如果出现验证码,脚本会关闭该配置文件并启动一个新的。

. 搜索结果将保存在项目目录中的
search_results文件夹里。

脚本代码
const axios = require('axios'); const puppeteer = require('rebrowser-puppeteer'); const fs = require('fs').promises; const path = require('path'); const config = { octo_local_api_base_url: `http://localhost:58888/api/profiles`, //change port if you don't use default 58888 headless_mode: false, proxies: [ "socks5://login:password@127.0.0.1:50000", //paste your proxies "socks5://login:password@127.0.0.1:50000" ], google_search_queries: ["nodejs", "sidwudraq", "arch linux"] //change queries } // ============= HELPER FUNCTIONS ============= function random_range(min, max) { return min + Math.random() * (max - min); } async function sleep(seconds) { return new Promise(resolve => setTimeout(resolve, seconds * 1000)); } async function human_delay(min_ms = 50, max_ms = 200) { const mu = Math.log((min_ms + max_ms) / 2); const sigma = random_range(0.3, 0.6); let delay = Math.exp(mu + sigma * (Math.random() - 0.5) * 2); delay = Math.min(max_ms, Math.max(min_ms, delay)); await new Promise(resolve => setTimeout(resolve, delay)); } async function kill_browser(pid) { const { default: fkill } = await import('fkill'); await fkill(pid, { force: true }); console.log(`✅ Process with PID ${pid} successfully stopped.`); } // ============= BEZIER CURVES FOR HUMAN-LIKE MOVEMENT ============= function bezier_curve(t, p0, p1, p2, p3) { const mt = 1 - t; const mt2 = mt * mt; const t2 = t * t; const x = mt2 * mt * p0.x + 3 * mt2 * t * p1.x + 3 * mt * t2 * p2.x + t2 * t * p3.x; const y = mt2 * mt * p0.y + 3 * mt2 * t * p1.y + 3 * mt * t2 * p2.y + t2 * t * p3.y; return { x, y }; } function generate_bezier_points(start, end) { const distance = Math.hypot(end.x - start.x, end.y - start.y); const angle = Math.atan2(end.y - start.y, end.x - start.x); const deviation = random_range(distance * 0.2, distance * 0.5); const angle_variation = random_range(-Math.PI / 3, Math.PI / 3); const p1 = { x: start.x + Math.cos(angle + angle_variation) * deviation, y: start.y + Math.sin(angle + angle_variation) * deviation }; const p2 = { x: end.x - Math.cos(angle - angle_variation) * deviation, y: end.y - Math.sin(angle - angle_variation) * deviation }; return [start, p1, p2, end]; } function generate_trajectory(start, end, steps = null) { const distance = Math.hypot(end.x - start.x, end.y - start.y); const actual_steps = steps || Math.max(20, Math.min(100, Math.floor(distance / 3))); const bezier_points = generate_bezier_points(start, end); const trajectory = []; for (let i = 0; i <= actual_steps; i++) { const t = i / actual_steps; const eased_t = Math.pow(t, 1 + Math.random() * 0.3); const point = bezier_curve(eased_t, ...bezier_points); const jitter = { x: (Math.random() - 0.5) * random_range(0.5, 2), y: (Math.random() - 0.5) * random_range(0.5, 2) }; trajectory.push({ x: Math.round(point.x + jitter.x), y: Math.round(point.y + jitter.y) }); } return trajectory; } // ============= HUMAN-LIKE CLICK ============= async function human_click(page, selector_or_element, options = {}) { const { move_speed = 1.0, random_overshoot = true, click_delay = null, force_visible = true } = options; const element = typeof selector_or_element === 'string' ? await page.$(selector_or_element) : selector_or_element; if (!element) { throw new Error(`Element not found: ${selector_or_element}`); } if (force_visible) { await element.scrollIntoView(); await human_delay(100, 300); } const current_mouse = await page.evaluate(() => ({ x: window.mouseX || window.innerWidth / 2, y: window.mouseY || window.innerHeight / 2 })); const box = await element.boundingBox(); if (!box) throw new Error('Could not get element coordinates'); const target = { x: box.x + random_range(box.width * 0.2, box.width * 0.8), y: box.y + random_range(box.height * 0.2, box.height * 0.8) }; if (random_overshoot && Math.random() < 0.3) { const overshoot_x = (Math.random() - 0.5) * random_range(10, 30); const overshoot_y = (Math.random() - 0.5) * random_range(10, 30); const overshoot_target = { x: target.x + overshoot_x, y: target.y + overshoot_y }; const overshoot_trajectory = generate_trajectory(current_mouse, overshoot_target); for (const point of overshoot_trajectory) { await page.mouse.move(point.x, point.y); await human_delay(1, 3); } const return_trajectory = generate_trajectory(overshoot_target, target); for (const point of return_trajectory) { await page.mouse.move(point.x, point.y); await human_delay(1, 3); } } else { const trajectory = generate_trajectory(current_mouse, target); for (const point of trajectory) { await page.mouse.move(point.x, point.y); const delay = Math.max(1, Math.min(5, 10 / move_speed)); await human_delay(delay * 0.5, delay * 1.5); } } const final_delay = click_delay !== null ? click_delay : random_range(80, 250); await human_delay(final_delay * 0.8, final_delay * 1.2); if (Math.random() < 0.15) { const micro_offset_x = (Math.random() - 0.5) * random_range(1, 4); const micro_offset_y = (Math.random() - 0.5) * random_range(1, 4); await page.mouse.move(target.x + micro_offset_x, target.y + micro_offset_y); await human_delay(10, 30); } await page.mouse.down(); await human_delay(random_range(50, 150)); if (Math.random() < 0.2) { await page.mouse.move( target.x + (Math.random() - 0.5) * 2, target.y + (Math.random() - 0.5) * 2 ); } await page.mouse.up(); await human_delay(50, 150); await page.evaluate(({ x, y }) => { window.mouseX = x; window.mouseY = y; }, target); return { success: true, position: target }; } // ============= HUMAN-LIKE TEXT INPUT ============= async function human_type(page, selector, text, options = {}) { const { typing_speed = null, random_mistakes = false, backspace_fix = false } = options; const element = typeof selector === 'string' ? await page.$(selector) : selector; if (!element) { throw new Error(`Element not found: ${selector}`); } await human_click(page, element, { pre_hover: true }); // Clear the field await page.keyboard.down('Control'); await page.keyboard.press('a'); await page.keyboard.up('Control'); await page.keyboard.press('Backspace'); await human_delay(100, 200); for (let i = 0; i < text.length; i++) { const char = text[i]; let delay; if (typing_speed) { delay = typing_speed; } else { const base_delay = random_range(50, 200); const is_space = char === ' '; delay = is_space ? base_delay * 2 : base_delay; } if (random_mistakes && Math.random() < 0.02) { const wrong_char = String.fromCharCode( char.charCodeAt(0) + (Math.random() > 0.5 ? 1 : -1) ); await page.keyboard.type(wrong_char, { delay: delay * 0.5 }); await human_delay(100, 200); if (backspace_fix) { await page.keyboard.press('Backspace'); await human_delay(50, 100); } else { continue; } } await page.keyboard.type(char, { delay: delay }); } await human_delay(100, 300); return true; } // ============= HUMAN-LIKE SCROLL ============= async function human_scroll(page, options = {}) { const { scrolls = null, min_scroll = 300, max_scroll = 800 } = options; const num_scrolls = scrolls || Math.floor(random_range(3, 8)); for (let i = 0; i < num_scrolls; i++) { const scroll_distance = random_range(min_scroll, max_scroll); await page.evaluate((distance) => { window.scrollBy({ top: distance, behavior: 'smooth' }); }, scroll_distance); await human_delay(800, 2000); if (Math.random() < 0.2) { const back_distance = random_range(100, 300); await page.evaluate((distance) => { window.scrollBy({ top: -distance, behavior: 'smooth' }); }, back_distance); await human_delay(500, 1000); } } } // ============= DISTRIBUTE QUERIES AMONG PROFILES ============= function distribute_queries(queries, numProxies) { const total = queries.length; const baseCount = Math.floor(total / numProxies); const remainder = total % numProxies; const batches = []; let start = 0; for (let i = 0; i < numProxies; i++) { const count = baseCount + (i < remainder ? 1 : 0); const batch = queries.slice(start, start + count); batches.push(batch); start += count; } return batches; } // ============= PARSE GOOGLE RESULTS ============= async function parse_search_results(page, query) { return await page.evaluate((query) => { const results = []; // Find all result containers const organic_results = document.querySelectorAll('div.tF2Cxc'); console.log(`Found ${organic_results.length} result containers`); organic_results.forEach((result, index) => { try { // Title const title_element = result.querySelector('h3.LC20lb.MBeuO.DKV0Md'); const title = title_element ? title_element.innerText : ''; // Link let link_element = result.querySelector('a'); let link = link_element ? link_element.href : ''; // Clean Google redirect if (link && link.includes('/url?q=')) { const url_match = link.match(/\/url\?q=([^&]+)/); if (url_match) { link = decodeURIComponent(url_match[1]); } } // Description let desc_element = result.querySelector('div.VwiC3b.yXK7lf.p4wth.r025kc.Hdw6tb'); let description = desc_element ? desc_element.innerText : ''; // Fallback selector if (!description) { const fallback_desc = result.querySelector('div.VwiC3b'); description = fallback_desc ? fallback_desc.innerText : ''; } if (title && title.trim() && link) { results.push({ position: results.length + 1, title: title.trim(), link: link, description: description.trim().substring(0, 500) }); } } catch (error) { console.error(`Error parsing result ${index}:`, error); } }); console.log(`Successfully parsed ${results.length} results`); return { query: query, timestamp: new Date().toISOString(), total_results: results.length, results: results }; }, query); } // ============= SAVE RESULTS TO FILE ============= async function save_results_to_file(query, data, is_appending = false) { const filename = `${query.replace(/[^a-z0-9]/gi, '_').toLowerCase()}_results.txt`; const filepath = path.join(__dirname, 'search_results', filename); // Create directory if needed await fs.mkdir(path.join(__dirname, 'search_results'), { recursive: true }); let content = ''; if (!is_appending) { content += `=== GOOGLE SEARCH RESULTS ===\n`; content += `Query: ${data.query}\n`; content += `Time: ${data.timestamp}\n`; content += `Total results: ${data.total_results}\n`; content += `${'='.repeat(80)}\n\n`; } for (const result of data.results) { content += `${result.position}. ${result.title}\n`; content += ` URL: ${result.link}\n`; content += ` Description: ${result.description.substring(0, 200)}...\n`; content += ` ${'-'.repeat(80)}\n`; } content += `\n📄 Page saved: ${new Date().toISOString()}\n`; content += `${'='.repeat(80)}\n\n`; await fs.writeFile(filepath, content, { flag: is_appending ? 'a' : 'w' }); console.log(`✅ Results saved to: ${filepath}`); return filepath; } // ============= OPEN RANDOM RESULT PAGE ============= async function open_random_result(page, results) { if (!results || results.length === 0) { console.log('No results to open'); return false; } // Choose a random result (usually not the first) let result_index = 0; if (results.length > 1) { result_index = Math.random() < 0.7 ? Math.floor(random_range(1, Math.min(5, results.length))) : Math.floor(random_range(0, results.length)); } const selected_result = results[result_index]; console.log(`Opening result ${result_index + 1}: ${selected_result.title.substring(0, 50)}...`); try { // Check for captcha before opening const has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected, not opening result'); return false; } // Open in a new tab const new_page = await page.browser().newPage(); await new_page.goto(selected_result.link, { waitUntil: 'domcontentloaded', timeout: 20000 }); await human_delay(2000, 4000); // Check for captcha on the opened page const page_has_captcha = await check_for_captcha(new_page); if (page_has_captcha) { console.log('🚫 Captcha detected on opened page'); await new_page.close(); return false; } // Scroll on the opened page await human_scroll(new_page, { scrolls: random_range(2, 5) }); await human_delay(1500, 3000); // Close the tab await new_page.close(); console.log(`✅ Page viewed and closed`); return true; } catch (error) { console.log(`❌ Error opening page: ${error.message}`); return false; } } // ============= CAPTCHA CHECK ============= async function check_for_captcha(page) { const captcha_selectors = [ '#captcha-form', '.g-recaptcha', 'iframe[src*="recaptcha"]', 'form[action*="captcha"]', '#captcha', '.captcha', 'div[jsname="Jai8Rc"]', 'form[action*="sorry"]' ]; for (const selector of captcha_selectors) { const element = await page.$(selector); if (element) return true; } const current_url = page.url(); if (current_url.includes('sorry') || current_url.includes('captcha')) { return true; } const page_text = await page.evaluate(() => document.body.innerText); const captcha_keywords = ['captcha', 'robot', 'verify', 'unusual traffic', 'confirm', 'not a robot']; for (const keyword of captcha_keywords) { if (page_text.toLowerCase().includes(keyword)) { return true; } } return false; } // ============= MAIN SEARCH FUNCTION ============= async function google_search_human(page, query, results_data, retry_count = 0) { const max_retries = 2; console.log(`🔍 Searching: ${query}${retry_count > 0 ? ` (attempt ${retry_count + 1})` : ''}`); try { // Go to Google homepage await page.goto('https://www.google.com', { waitUntil: 'domcontentloaded', timeout: 30000 }); await human_delay(1000, 2000); // Check for captcha let has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected!'); return { error: 'captcha', query: query }; } // Accept cookies if present try { const cookie_button = await page.$('#L2AGLb'); if (cookie_button) { await human_click(page, cookie_button); console.log('✅ Cookies accepted'); await human_delay(500, 1000); } } catch (error) { console.log('No cookie button'); } // Enter search query const search_input = await page.$('textarea[name="q"], input[name="q"]'); if (!search_input) { throw new Error('Search input not found'); } await human_type(page, search_input, query, { random_mistakes: true, backspace_fix: true }); await human_delay(500, 1000); // Check for captcha before submitting has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected before submission!'); return { error: 'captcha', query: query }; } // Press Enter console.log('📤 Submitting query...'); await Promise.all([ page.waitForNavigation({ waitUntil: 'domcontentloaded', timeout: 15000 }).catch(e => { console.log(`⚠️ Navigation warning: ${e.message}`); return null; }), page.keyboard.press('Enter'), human_delay(500, 1000) ]); // Check for captcha after search has_captcha = await check_for_captcha(page); if (has_captcha) { console.log('🚫 Captcha detected after search!'); return { error: 'captcha', query: query }; } console.log('⏳ Waiting for results to load...'); // Wait for results to appear try { await page.waitForSelector('div.tF2Cxc', { timeout: 15000, visible: true }); console.log('✅ Results loaded'); } catch (error) { console.log('⚠️ Results not found, continuing...'); } await human_delay(1500, 2500); // Scroll through results console.log('📜 Scrolling through results...'); await human_scroll(page, { scrolls: random_range(4, 8) }); // Parse results console.log('📊 Parsing results...'); const parsed_results = await parse_search_results(page, query); if (parsed_results.results.length === 0 && retry_count < max_retries) { console.log('⚠️ No results found, retrying...'); await human_delay(2000, 3000); return await google_search_human(page, query, results_data, retry_count + 1); } // Save results const is_appending = results_data.has_results; await save_results_to_file(query, parsed_results, is_appending); results_data.has_results = true; results_data.all_results.push(...parsed_results.results); // Open 1-2 random result pages if (parsed_results.results.length > 0) { const pages_to_open = Math.floor(random_range(1, Math.min(3, parsed_results.results.length))); console.log(`📖 Opening ${pages_to_open} result pages...`); for (let i = 0; i < pages_to_open; i++) { await open_random_result(page, parsed_results.results); await human_delay(1000, 2000); // Return to results page const current_url = page.url(); if (!current_url.includes('google.com/search')) { try { await page.goBack({ waitUntil: 'domcontentloaded', timeout: 10000 }); await human_delay(1000, 1500); } catch (error) { console.log('⚠️ Could not go back'); await page.reload({ waitUntil: 'domcontentloaded' }); } } } } console.log(`✅ Search "${query}" completed, found ${parsed_results.results.length} results`); return { success: true, query: query, results: parsed_results.results }; } catch (error) { console.error(`❌ Error during search "${query}": ${error.message}`); const has_captcha = await check_for_captcha(page).catch(() => false); if (has_captcha) { console.log('🚫 Error caused by captcha'); return { error: 'captcha', query: query }; } if (retry_count < max_retries) { console.log(`🔄 Retrying in 5 seconds...`); await sleep(5); return await google_search_human(page, query, results_data, retry_count + 1); } return { error: 'timeout', query: query }; } } // ============= OCTO FUNCTIONS ============= async function check_limits(response) { function parse_int_safe(value) { const parsed = parseInt(value, 10); return isNaN(parsed) ? 0 : parsed; } const ratelimit_header = response.headers.ratelimit; if (!ratelimit_header) { console.warn('No ratelimit header found!'); return; } const limit_entries = ratelimit_header.split(',').map(entry => entry.trim()); for (const entry of limit_entries) { const name_match = entry.match(/^([^;]+)/); const r_match = entry.match(/;r=(\d+)/); const t_match = entry.match(/;t=(\d+)/); if (!r_match || !t_match) { console.warn(`Invalid ratelimit format: ${entry}`); continue; } const limit_name = name_match ? name_match[1] : 'unknown_limit'; const remaining_quantity = parse_int_safe(r_match[1]); const window_seconds = parse_int_safe(t_match[1]); if (remaining_quantity < 5) { const wait_time = window_seconds + 1; console.log(`Waiting ${wait_time} seconds due to ${limit_name} limit`); await sleep(wait_time); } } } function parse_proxy(proxy) { const regex = /^(\w+):\/\/(?:([^:]+):([^@]+)@)?([^:]+):(\d+)$/; const match = proxy.match(regex); if (!match) return null; const [, type, login, password, host, port] = match; return { type, host, port, login: login || null, password: password || null }; } async function octo_one_time_profile(config, proxy) { const one_time_profile_config = { method: "post", url: `${config.octo_local_api_base_url}/one_time/start`, headers: { 'Content-Type': 'application/json' }, data: { "profile_data": { "fingerprint": { "os": Math.random() < 0.5 ? "win" : "mac" }, "proxy": proxy, "images_load_limit": 10240, }, "headless": config.headless_mode, "debug_port": true, "timeout": 60 } } const response = await axios(one_time_profile_config); await check_limits(response); return response; } // ============= MAIN PROCESS ============= (async () => { console.log('🚀 Starting Google Scraper with Human-like Behavior...'); console.log('🛡️ Captcha detection enabled - profiles with captcha will be skipped\n'); const proxy_count = config.proxies.length; const all_queries = config.google_search_queries; const query_batches = distribute_queries(all_queries, proxy_count); console.log(`Total proxies: ${proxy_count}`); console.log(`Total search queries: ${all_queries.length}`); console.log('Query distribution:'); query_batches.forEach((batch, idx) => { console.log(` Profile ${idx + 1}: ${batch.length} queries - ${batch.join(', ')}`); }); console.log(''); let successful_profiles = 0; let skipped_profiles = 0; let failed_profiles = 0; for (let i = 0; i < proxy_count; i++) { console.log(`\n${'='.repeat(80)}`); console.log(`📋 Processing profile ${i + 1}/${proxy_count}`); console.log(`${'='.repeat(80)}`); const queries_for_this_profile = query_batches[i]; if (queries_for_this_profile.length === 0) { console.log(`⚠️ No queries assigned to profile ${i + 1}, skipping.`); continue; } let parsed_proxy = parse_proxy(config.proxies[i]); if (!parsed_proxy) { console.error(`❌ Failed to parse proxy: ${config.proxies[i]}`); failed_profiles++; continue; } console.log(`🔧 Creating and starting One Time Profile with proxy: ${parsed_proxy.host}:${parsed_proxy.port}`); let ws_endpoint; try { ws_endpoint = await octo_one_time_profile(config, parsed_proxy); } catch (error) { console.error(`❌ Failed to create or start profile: ${error.message}`); failed_profiles++; continue; } if (!ws_endpoint || !ws_endpoint.data.ws_endpoint || !ws_endpoint.data.uuid) { console.error('❌ Failed to create or start profile'); failed_profiles++; continue; } console.log(`✅ Profile created and started: ${ws_endpoint.data.uuid}`); console.log(`🌐 Connecting to browser`); let browser; try { browser = await puppeteer.connect({ browserWSEndpoint: ws_endpoint.data.ws_endpoint, defaultViewport: null }); } catch (error) { console.error(`❌ Failed to connect to browser: ${error.message}`); await kill_browser(ws_endpoint.data.browser_pid); continue; } const page = await browser.newPage(); const results_data = { has_results: false, all_results: [] }; let captcha_detected = false; // Execute only the queries assigned to this profile for (let j = 0; j < queries_for_this_profile.length; j++) { const query = queries_for_this_profile[j]; try { const search_result = await google_search_human(page, query, results_data); if (search_result.error === 'captcha') { console.log(`\n🚨 CAPTCHA DETECTED! Skipping profile ${ws_endpoint.data.uuid}`); captcha_detected = true; break; } if (j < queries_for_this_profile.length - 1 && !captcha_detected) { const delay_between = random_range(5, 10); console.log(`\n⏰ Waiting ${delay_between.toFixed(1)} seconds before next search...`); await sleep(delay_between); } } catch (error) { console.error(`❌ Error during search "${query}": ${error.message}`); } } console.log(`🛑 Stopping profile...`); await kill_browser(ws_endpoint.data.browser_pid); if (captcha_detected) { console.log(`⏭️ Profile ${ws_endpoint.data.uuid} skipped due to captcha`); skipped_profiles++; } else if (results_data.all_results.length > 0) { const summary_filename = `summary_${ws_endpoint.data.uuid}_${Date.now()}.txt`; const summary_path = path.join(__dirname, 'search_results', summary_filename); let summary_content = `=== SEARCH SUMMARY ===\n`; summary_content += `Profile: ${ws_endpoint.data.uuid}\n`; summary_content += `Proxy: ${parsed_proxy.host}:${parsed_proxy.port}\n`; summary_content += `Queries executed: ${queries_for_this_profile.length}\n`; summary_content += `Queries: ${queries_for_this_profile.join(', ')}\n`; summary_content += `Total results collected: ${results_data.all_results.length}\n`; summary_content += `Time: ${new Date().toISOString()}\n`; summary_content += `${'='.repeat(80)}\n\n`; await fs.writeFile(summary_path, summary_content); console.log(`\n📊 Summary saved: ${summary_path}`); successful_profiles++; } else { console.log(`⚠️ Profile ${ws_endpoint.data.uuid} finished without results`); failed_profiles++; } console.log(`✅ Profile ${i + 1} completed`); if (i < proxy_count - 1) { const delay_between = random_range(10, 20); console.log(`\n⏰ Waiting ${delay_between.toFixed(1)} seconds before next profile...`); await sleep(delay_between); } } console.log(`\n${'='.repeat(80)}`); console.log(`📊 FINAL STATISTICS:`); console.log(`${'='.repeat(80)}`); console.log(`✅ Successful profiles: ${successful_profiles}`); console.log(`⏭️ Skipped due to captcha: ${skipped_profiles}`); console.log(`❌ Failed profiles: ${failed_profiles}`); console.log(`📁 All results saved in "search_results" folder`); console.log(`\n🎉 Google Scraper finished!`); })();
随时获取最新的Octo Browser新闻
通过点击按钮,您同意我们的 隐私政策。
随时获取最新的Octo Browser新闻
通过点击按钮,您同意我们的 隐私政策。
随时获取最新的Octo Browser新闻
通过点击按钮,您同意我们的 隐私政策。
