> ## Documentation Index
> Fetch the complete documentation index at: https://docs.browserbase.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Web data retrieval

> Extract structured data from any website at scale with Browserbase. Use Stagehand or Playwright with Verified, proxy rotation, and CAPTCHA solving.

Extract data from websites using cloud browsers that handle JavaScript rendering, bot protection, and dynamic content. Browserbase gives you reliable infrastructure for data extraction workflows, whether you're using Stagehand or Playwright.

* Scale data extraction across concurrent sessions without managing infrastructure
* Browse protected sites with Browserbase's [Verified](/platform/identity/overview)
* Rotate IPs and geolocations with [proxies](/platform/identity/proxies)
* Debug and monitor extraction runs with [session recordings](/platform/browser/observability/session-recording) and [live views](/platform/browser/observability/session-live-view)

<Info>
  **Need scheduled or webhook-triggered data collection?** [Functions](/platform/runtime/overview) let you deploy data extraction workflows that can be invoked on-demand or on a schedule—perfect for building data pipelines and monitoring workflows.
</Info>

## Template

Get started quickly with a ready-to-use data extraction template.

<Card title="Company Value Prop Generator" icon="rocket" href="https://www.browserbase.com/templates/company-value-prop-generator">
  Clone, configure, and run in minutes
</Card>

## Example: Extracting a book catalog

To demonstrate data extraction with Browserbase, this example pulls book titles, prices, and availability from a sample catalog site.

### Code example

<Tabs>
  <Tab title="Node.js">
    <CodeGroup>
      ```typescript Stagehand theme={null}
      import { Stagehand } from "@browserbasehq/stagehand";
      import { z } from "zod";
      import dotenv from "dotenv";
      dotenv.config();

      const stagehand = new Stagehand({
          env: "BROWSERBASE",
          verbose: 0,
      });

      async function scrapeBooks() {
          await stagehand.init();
          const page = stagehand.context.pages()[0];

          await page.goto("https://books.toscrape.com/");

          const scrape = await stagehand.extract({
              instruction: "Extract the books from the page",
              schema: z.object({
                  books: z.array(z.object({
                      title: z.string(),
                      price: z.string(),
                      image: z.string(),
                      inStock: z.string(),
                      link: z.string(),
                  }))
              }),
          });

          console.log(scrape.books);

          await stagehand.close();
      }

      scrapeBooks().catch(console.error);
      ```

      ```typescript Playwright theme={null}
      import { chromium } from "playwright-core";
      import { Browserbase } from "@browserbasehq/sdk";
      import * as dotenv from "dotenv";
      dotenv.config();

      async function createSession() {
          const bb = new Browserbase({ apiKey: process.env.BROWSERBASE_API_KEY });
          const session = await bb.sessions.create();

          return session;
      }

      async function scrapeBooks() {  
          const session = await createSession();
          const browser = await chromium.connectOverCDP(session.connectUrl);
          const defaultContext = browser.contexts()[0];
          const page = defaultContext.pages()[0];
          
          // Navigate to site
          await page.goto("https://books.toscrape.com/");

          // Extract the books from the page
          const books = await page.evaluate(() => {
              const items = document.querySelectorAll("article.product_pod");
              return Array.from(items).map(item => {
              const titleElement = item.querySelector("h3 > a");
              const priceElement = item.querySelector("p.price_color");
              const imageElement = item.querySelector("img");
              const inStockElement = item.querySelector("p.instock.availability");
              const linkElement = item.querySelector("h3 > a");

              return {
                  title: titleElement?.getAttribute("title"),
                  price: priceElement?.textContent,
                  image: imageElement?.src,
                  inStock: inStockElement?.textContent?.trim(),
                  link: linkElement?.getAttribute("href")
              };
              });
          });

          await browser.close();
          return books;
      }

      const books = scrapeBooks().catch(console.error);
      console.log(books);
      ```
    </CodeGroup>
  </Tab>

  <Tab title="Python">
    <CodeGroup>
      ```python Stagehand theme={null}
      import os
      import asyncio
      from stagehand import AsyncStagehand
      from dotenv import load_dotenv

      load_dotenv()

      async def main():
          client = AsyncStagehand(
              browserbase_api_key=os.environ["BROWSERBASE_API_KEY"],
              model_api_key=os.environ["MODEL_API_KEY"],
          )
          session = await client.sessions.create(model_name="google/gemini-2.5-flash")

          # Navigate to the site
          await session.navigate(url="https://books.toscrape.com/")

          # Extract structured data from the page
          extract_response = await session.extract(
              instruction="Extract the books from the page including title, price, and stock status",
              schema={
                  "type": "object",
                  "properties": {
                      "books": {
                          "type": "array",
                          "items": {
                              "type": "object",
                              "properties": {
                                  "title": {"type": "string"},
                                  "price": {"type": "string"},
                                  "inStock": {"type": "string"},
                              },
                          },
                      },
                  },
              },
          )

          print(extract_response.data.result)

          await session.end()

      if __name__ == "__main__":
          asyncio.run(main())
      ```

      ```python Playwright theme={null}
      import os
      from playwright.sync_api import sync_playwright
      from browserbase import Browserbase
      from dotenv import load_dotenv

      load_dotenv()

      def create_session():
          """Creates a Browserbase session."""
          bb = Browserbase(api_key=os.environ["BROWSERBASE_API_KEY"])
          session = bb.sessions.create(
              # Add configuration options here if needed
          )
          return session

      def extract_books():
          """Extracts book data using Playwright with Browserbase."""
          session = create_session()
          print(f"View session recording at https://browserbase.com/sessions/{session.id}")

          with sync_playwright() as p:
              browser = p.chromium.connect_over_cdp(session.connect_url)

              # Get the default browser context and page
              context = browser.contexts[0]
              page = context.pages[0]

              # Navigate to the page
              page.goto("https://books.toscrape.com/")

              # Extract the books from the page
              items = page.locator('article.product_pod')
              books = items.all()

              book_data_list = []
              for book in books:

                  book_data = {
                      "title": book.locator('h3 a').get_attribute('title'),
                      "price": book.locator('p.price_color').text_content(),
                      "image": book.locator('div.image_container img').get_attribute('src'),
                      "inStock": book.locator('p.instock.availability').text_content().strip(),
                      "link": book.locator('h3 a').get_attribute('href')
                  }
                  
                  book_data_list.append(book_data)

              print("Shutting down...")
              page.close()
              browser.close()

              return book_data_list

      if __name__ == "__main__":
          books = extract_books()
          print(books)
      ```
    </CodeGroup>
  </Tab>
</Tabs>

### Example output

```
[
  {
    title: 'A Light in the Attic',
    price: '£51.77',
    image: 'https://books.toscrape.com/media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg',
    inStock: 'In stock',
    link: 'catalogue/a-light-in-the-attic_1000/index.html'
  },
  ...
]
```

## Best practices for data extraction

Follow these best practices to build reliable, efficient, and ethical data extraction workflows with Browserbase.

### Ethical data collection

* **Respect robots.txt**: Check the website's robots.txt file for crawling guidelines
* **Rate limiting**: Implement reasonable delays between requests (2-5 seconds)
* **Terms of Service**: Review the website's terms of service before extracting data
* **Data usage**: Only collect and use data in accordance with the website's policies

### Performance optimization

* **Batch processing**: Process multiple pages in batches with [concurrent sessions](/optimizations/concurrency/overview)
* **Selective extraction**: Only extract the data you need
* **Resource management**: Close browser sessions promptly after use
* **Connection reuse**: [Reuse browsers](/platform/browser/long-sessions/overview#using-keep-alive) for sequential extraction tasks

### Protected sites

* **Enable Browserbase Verified**: Recognized by bot protection partners
* **Randomize behavior**: Add variable delays between actions
* **Use proxies**: Rotate IPs to distribute requests
* **Mimic human interaction**: Add realistic mouse movements and delays
* **Handle CAPTCHAs**: Enable Browserbase's automatic CAPTCHA solving

## Next steps

<CardGroup cols={3}>
  <Card title="Verified" icon="user-secret" href="/platform/identity/overview">
    Configure fingerprinting and CAPTCHA solving
  </Card>

  <Card title="Browser Contexts" icon="browser" href="/platform/browser/core-features/contexts">
    Persist cookies and session data
  </Card>

  <Card title="Proxies" icon="network-wired" href="/platform/identity/proxies">
    Configure IP rotation and geolocation
  </Card>

  <Card title="Browserbase Functions" icon="bolt" href="/platform/runtime/overview">
    Deploy data extraction workflows as cloud functions
  </Card>
</CardGroup>
