Guide to Binance Scraper

WebScraping

Jump to section
  1. Quick comparison
  2. Set Up a Python Binance Scraper Environment
  3. Install Dependencies and Create the Project Structure
  4. Configure Public Access and Safe Secrets Handling
  5. Build Core Scrapers for Symbols, Prices, Order Books, and Candles
  6. Scrape Trading Pairs and Live Market Stats
  7. Scrape Order Book Snapshots and Historical Candles
  8. Use WebSocket Streams for Real-Time Binance Data
  9. Handle Pagination, Rate Limits, and Dynamic Content
  10. Paginate Historical Data Without Gaps
  11. Throttle Requests and Recover from Errors
  12. Work with Dynamic Pages and Reduce Anti-Bot Friction
  13. Store, Export, and Use Binance Data in Analytics Workflows
  14. Export Clean Data for Analytics and Reporting
  15. Turn Raw Data into Business-Ready Metrics
  16. Conclusion: From DIY Binance Scraper to Reliable Data Operations
  17. FAQs
  18. Do I need an API key for public Binance data?
  19. How do I keep candle data gap-free?
  20. What is the best format for storing Binance data?

Guide to Binance Scraper

If I want Binance market data, I should use the API first, not browser scraping. That is the main point. Binance gives me public access to symbol lists, prices, order books, and candle data through REST and WebSockets, and that covers most data jobs without a private key.

Here’s the short version:

  • I use REST for snapshots and history
  • I use WebSockets for live prices and order book updates
  • I use browser tools only for UI-only pages like announcements
  • I page through candles in 1,000-row batches
  • I watch rate-limit weight, not just request count
  • I save large datasets in Parquet, and small flat exports in CSV or JSON
  • I track both event time and ingest time
  • I check for gaps, duplicates, and schema changes while preparing for common web scraping errors

A few numbers matter right away:

  • Spot API limit: 1,200 weight per minute per IP
  • Klines limit: 1,000 candles per request
  • Order book depth: up to 5,000 levels
  • WebSocket sessions: Binance may close them after 24 hours
  • A 50-symbol pipeline can produce about 5 GB to 7.5 GB per day
Binance Data Collection Methods: REST vs WebSocket vs Browser Scraping

Binance Data Collection Methods: REST vs WebSocket vs Browser Scraping

Quick comparison

Method Best use What I get Main watchout
REST API History and snapshots Exchange info, tickers, candles, depth snapshots Weight limits
WebSocket Live market feeds Trades, price streams, depth diffs Reconnects and sequence checks
Browser scraping UI-only pages Announcements, page-only items CAPTCHA, WAF, page changes

The big takeaway is simple: a Binance scraper is less about “scraping” pages and more about building a clean data pipeline. I need type conversion, gap-free pagination, backoff for 429 and 5XX errors, and storage that can handle high-volume market data without making analysis a mess.

Set Up a Python Binance Scraper Environment

Install Dependencies and Create the Project Structure

Start by creating an isolated environment so this project doesn't mix packages with anything else on your machine. Inside a new folder, run python -m venv venv. Then activate it with source venv/bin/activate on macOS/Linux or venv\Scripts\activate on Windows.

Once that's done, install everything in one shot:

pip install requests httpx websocket-client pandas python-dotenv

Each package has a clear job:

  • requests for simple REST calls
  • httpx for async work or heavier pipelines
  • websocket-client for WebSocket streams
  • pandas for cleanup and export
  • python-dotenv for secret handling

It's also smart to set up a clean folder layout from the start. That saves you from the usual mess later.

Directory Purpose Key Files
binance_scraper/ Main source code scraper.py, utils.py
data/ Raw and processed output prices.csv, order_book.json
config/ Settings and secrets .env, config.py
logs/ Execution and error tracking scraper.log

Order book scraping can produce large daily files fast. To keep things under control, split data/ into subfolders like spot/, futures/, and klines/.

Configure Public Access and Safe Secrets Handling

Most public Binance data doesn't need auth. Still, it's smart to set up secret handling now, especially if you plan to add private endpoints later.

Create a .env file inside config/:

BINANCE_API_KEY=your_actual_key_here
BINANCE_SECRET_KEY=your_actual_secret_here

Then load those values with python-dotenv:

from dotenv import load_dotenv
import os

load_dotenv("config/.env")
api_key = os.getenv("BINANCE_API_KEY")
api_secret = os.getenv("BINANCE_SECRET_KEY")

Add .env, logs/, and data/ to .gitignore. That helps protect secrets and keeps bulky output files out of version control.

Also create a requirements.txt file so anyone on your team can recreate the same setup without guesswork.

With that in place, you're ready to connect to Binance and start pulling live and historical market data.

Build Core Scrapers for Symbols, Prices, Order Books, and Candles

Scrape Trading Pairs and Live Market Stats

Start with the basic market data endpoints. Use GET /api/v3/exchangeInfo to pull symbols and filters, GET /api/v3/ticker/price for the latest price, and GET /api/v3/ticker/24hr for 24-hour market stats.

One detail matters here: the 24-hour ticker endpoint has a request weight of 2–80 based on how many symbols you ask for. That makes a full batch request the smart move. Pull the whole set in one call instead of looping through symbols one by one.

Binance also returns most numeric fields as strings. If you skip type conversion during ingestion, your analysis can get messy fast. Convert those values to float or Decimal before you analyze or export them.

Here’s a small pattern that fetches all 24-hour tickers and writes them to CSV:

import requests
import pandas as pd
from datetime import datetime, timezone

def fetch_24hr_tickers():
    url = "https://api.binance.com/api/v3/ticker/24hr"
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    data = response.json()

    df = pd.DataFrame(data)
    numeric_cols = [
        "lastPrice", "priceChangePercent", "volume", "highPrice", "lowPrice",
        "priceChange", "weightedAvgPrice", "openPrice", "quoteVolume",
        "bidPrice", "bidQty", "askPrice", "askQty"
    ]
    df[numeric_cols] = df[numeric_cols].astype(float)

    timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d_%H-%M-%S")
    df.to_csv(f"data/spot/tickers_{timestamp}.csv", index=False)
    return df

Once you have symbol-level snapshots, the next step is depth data and candle history.

Scrape Order Book Snapshots and Historical Candles

The GET /api/v3/depth endpoint gives you bid and ask levels for a symbol. The limit parameter goes as high as 5,000 levels. But there’s a tradeoff: request weight climbs with depth. A limit of 1–100 costs 5, 101–500 costs 25, 501–1,000 costs 50, and 1,001–5,000 costs 250.

So don’t grab 5,000 levels just because you can. Use the smallest depth that matches your use case.

def fetch_order_book(symbol: str, limit: int = 100) -> dict:
    url = "https://api.binance.com/api/v3/depth"
    params = {"symbol": symbol, "limit": limit}
    response = requests.get(url, params=params, timeout=10)
    response.raise_for_status()
    book = response.json()

    # Convert string prices/quantities to float
    book["bids"] = [[float(p), float(q)] for p, q in book["bids"]]
    book["asks"] = [[float(p), float(q)] for p, q in book["asks"]]
    return book

For candles, use GET /api/v3/klines. You must send a symbol and an interval, such as 1m, 1h, or 1d. Each response returns up to 1,000 candles.

If you want a full history with no holes, fetch in a loop with startTime and endTime. After each batch, set the next startTime to the last candle’s closeTime plus one millisecond. That tiny + 1ms step is easy to miss, but it keeps your pagination clean.

Binance timestamps are in UTC milliseconds, so convert them to UTC datetimes as part of normalization.

def fetch_klines(symbol: str, interval: str, start_ms: int, end_ms: int) -> pd.DataFrame:
    url = "https://api.binance.com/api/v3/klines"
    all_candles = []
    current_start = start_ms

    while current_start < end_ms:
        params = {
            "symbol": symbol,
            "interval": interval,
            "startTime": current_start,
            "endTime": end_ms,
            "limit": 1000,
        }
        resp = requests.get(url, params=params, timeout=10)
        resp.raise_for_status()
        candles = resp.json()
        if not candles:
            break
        all_candles.extend(candles)
        current_start = candles[-1][6] + 1  # closeTime + 1ms

    cols = [
        "open_time", "open", "high", "low", "close", "volume",
        "close_time", "quote_volume", "trade_count",
        "taker_buy_base", "taker_buy_quote", "ignore"
    ]
    df = pd.DataFrame(all_candles, columns=cols)
    df["open_time"] = pd.to_datetime(df["open_time"], unit="ms", utc=True)
    df["close_time"] = pd.to_datetime(df["close_time"], unit="ms", utc=True)
    for col in ["open", "high", "low", "close", "volume"]:
        df[col] = df[col].astype(float)
    return df

If you’re storing long candle histories, compressed columnar formats are a better fit for analysis and export.

Use WebSocket Streams for Real-Time Binance Data

For live prices and order-book diffs, use WebSockets. After the connection is open, WebSocket messages do not consume request weight. That said, the stream code still needs reconnect handling, heartbeats, and sequence checks. Without those pieces, things fall apart at the worst time.

import json
import time
from datetime import datetime, timezone

import websocket

def on_message(ws, message):
    data = json.loads(message)
    date_str = datetime.now(timezone.utc).strftime("%Y-%m-%d")
    with open(f"data/spot/stream_{date_str}.jsonl", "a", encoding="utf-8") as f:
        f.write(json.dumps(data) + "\n")

def on_error(ws, error):
    print(f"WebSocket error: {error}")

def start_stream(stream: str):
    url = f"wss://stream.binance.com:9443/ws/{stream}"
    while True:
        ws = websocket.WebSocketApp(url, on_message=on_message, on_error=on_error)
        ws.run_forever(ping_interval=20, ping_timeout=10)
        print("Connection closed  -  reconnecting...")
        time.sleep(5)

start_stream("btcusdt@depth")

Binance closes WebSocket connections after 24 hours. Treat reconnect logic as part of the core design, not a cleanup task for later. You also need to reply to ping frames within 1 minute to avoid getting disconnected early.

To keep a local order book in sync, combine the REST depth snapshot with diff streams. The order matters. Start the stream first, then fetch the REST snapshot so you can apply the buffered diffs without gaps.

Here’s the handoff logic:

  • Buffer diff events as they arrive.
  • Drop any buffered event whose final update ID, u, is less than the snapshot’s lastUpdateId.
  • Apply the first event that satisfies U <= lastUpdateId + 1 <= u.

If you skip that sequence check, your local book can drift out of sync without much warning.

Handle Pagination, Rate Limits, and Dynamic Content

Paginate Historical Data Without Gaps

Once your core scrapers are working, the next job is reliability. That usually comes down to three things: gap-free pagination, steady pacing, and fallback handling.

Binance limits klines requests to 1,000 candles per call. To paginate safely, move startTime forward with the last candle's close time + 1 ms. Then sort every batch by open_time before you save it. That extra check makes it easier to spot out-of-order rows before they slip into your dataset.

As you write batches to disk, compare each batch's open_time values against what you already have stored. Also check that each candle moves forward by exactly one interval. If you find a duplicate or a missing time window, re-fetch that range instead of skipping past it. It takes a bit more work up front, but it saves you from messy data later.

Throttle Requests and Recover from Errors

Binance uses a weight-based rate limit instead of a plain request-count limit. The standard Spot API limit is 1,200 request weights per minute per IP. Track the X-MBX-USED-WEIGHT-1M header on every response, and slow down before you hit the wall. A good rule is to back off when usage reaches 80% of the minute budget.

If you get a 429, stop all requests at once and wait for the time listed in Retry-After. Keep pushing, and Binance can answer with an automated 418 ban that lasts anywhere from 2 minutes to 3 days.

For transient 5XX responses, treat the batch as incomplete. Wait a short time, then retry with exponential backoff plus jitter. If timeouts start stacking up, a simple circuit breaker can stop things from snowballing. Adaptive throttling or a token bucket also helps keep your scraper paced without leaning on fixed delays.

After request pacing is under control, you only need DOM handling for the pages that still depend on it.

Work with Dynamic Pages and Reduce Anti-Bot Friction

Use browser automation only when Binance doesn't expose the data through REST or WebSockets.

That usually means public pages the API doesn't cover, like announcement pages, public dashboards, or UI-only features. Binance uses a custom WAF, CAPTCHA challenges, and TLS fingerprinting, so plain HTTP clients can get blocked. curl_cffi can help by mimicking a browser TLS handshake and cutting down WAF friction.

When Playwright or Puppeteer is necessary, wait for specific DOM elements instead of using fixed delays. For infinite-scroll pages, check row counts to decide when loading is done. Scroll position by itself isn't dependable.

Store, Export, and Use Binance Data in Analytics Workflows

Export Clean Data for Analytics and Reporting

Once your collection setup is stable, the next step is simple: save data in formats that keep timing and numeric precision intact.

For high-volume Binance data from REST snapshots and WebSocket streams, like order books and candles, use Parquet with ZSTD. It handles heavy workloads well and keeps storage under control. A production scraper that covers 50 symbols can produce between 5 GB and 7.5 GB of data per day.

For smaller and flatter datasets, such as 24-hour ticker stats, CSV or JSON is usually enough.

You’ll also want to store both event time and ingest time for each record. That gives you a clean way to measure latency and check whether your captures are lining up as expected.

A clear file layout helps a lot here. Group files by market type, data category, and date. For example:

spot/candles/BTCUSDT/2026-06-17.parquet

Partitioning by market and date makes queries easier and speeds up later analysis work. For prices, use decimals or scaled integers instead of floats. Float rounding errors may look small at first, but over a large dataset they can snowball. This setup also makes feature engineering faster.

Turn Raw Data into Business-Ready Metrics

Raw Binance data is useful, but derived metrics are where things start to click.

During ingestion, turn candles and order book snapshots into metrics like spread, mid-price, imbalance, microprice, rolling volume, and volume velocity. Doing this early saves time later and gives your analytics layer cleaner inputs.

For timeframe aggregation, use pandas or polars to resample intraday records into daily or higher-timeframe OHLCV candles.

One common headache is schema drift. Binance may change field names or response formats without warning, and that can break downstream models fast. Add schema checks so missing keys get caught early instead of turning into silent null values in your metrics layer.

Conclusion: From DIY Binance Scraper to Reliable Data Operations

With collection, storage, and validation working together, a scraper stops being a side project and starts acting like a repeatable data pipeline.

As volume grows, add checks for sequence gaps and schema drift. A production-ready scraper that handles 50+ symbols can run on a 4 vCPU, 8 GB RAM instance for about $7.50 per month. Keep checking for gaps, precision issues, and schema changes as the dataset gets larger.

FAQs

Do I need an API key for public Binance data?

No. Public Binance market data, including price tickers, order book snapshots, historical klines, and trade history, does not require an API key.

These requests are still subject to IP-based rate limits and weight budgets, even without authentication. Binance makes this data available publicly on purpose.

How do I keep candle data gap-free?

To keep candle data gap-free, use automated resumption and deduplication. Your script should check for data that already exists, read the last saved timestamp, and continue requests from that point.

It also helps to add error handling with exponential backoff, so short-term network problems or rate limits don't kill the collection process. When you append data, merge the new rows with the old ones and remove duplicate candles that can show up if the connection drops and then resumes.

What is the best format for storing Binance data?

The right format comes down to what you need from the data and how you plan to work with it.

Apache Parquet is usually the best pick when efficiency matters, especially for large datasets. It works well for high-frequency order book snapshots and historical trade data, and it’s often paired with ZSTD compression to keep file sizes down without making data access a pain.

CSV is still a common option. It’s easy to inspect, plays nicely with many tools, and makes sense for smaller projects or cases where compatibility matters more than storage efficiency.

Whichever format you choose, stick to a consistent directory structure organized by market type, symbol, and interval.

Want this done for you?

Send us the URLs. We'll quote it in 24 hours.

Paste the URL(s) you want scraped. We'll reply within 24 hours with a feasibility check and a ballpark quote.

Monthly budget

Or, browse our 3 case studies →

FAQ

FAQs

Find answers to commonly asked questions about our Data as a Service solutions, ensuring clarity and understanding of our offerings.

How will I receive my data and in which formats?

We offer versatile delivery options including FTP, SFTP, AWS S3, Google Cloud Storage, email, Dropbox, and Google Drive. We accommodate data formats such as CSV, JSON, JSONLines, and XML, and are open to custom delivery or format discussions to align with your project needs.

What types of data can your service extract?

We are equipped to extract a diverse range of data from any website, while strictly adhering to legal and ethical guidelines, including compliance with Terms and Conditions, privacy, and copyright laws. Our expert teams assess legal implications and ensure best practices in web scraping for each project.

How are data projects managed?

Upon receiving your project request, our solution architects promptly engage in a discovery call to comprehend your specific needs, discussing the scope, scale, data transformation, and integrations required. A tailored solution is proposed post a thorough understanding, ensuring optimal results.

Can I use AI to scrape websites?

Yes, You can use AI to scrape websites. Webscraping HQ’s AI website technology can handle large amounts of data extraction and collection needs. Our AI scraping API allows user to scrape up to 50000 pages one by one.

What support services do you offer?

We offer inclusive support addressing coverage issues, missed deliveries, and minor site modifications, with additional support available for significant changes necessitating comprehensive spider restructuring.

Is there an option to test the services before purchasing?

Absolutely, we offer service testing with sample data from previously scraped sources. For new sources, sample data is shared post-purchase, after the commencement of development.

How can your services aid in web content extraction?

We provide end-to-end solutions for web content extraction, delivering structured and accurate data efficiently. For those preferring a hands-on approach, we offer user-friendly tools for self-service data extraction.

Is web scraping detectable?

Yes, Web scraping is detectable. One of the best ways to identify web scrapers is by examining their IP address and tracking how it's behaving.

Why is data extraction essential?

Data extraction is crucial for leveraging the wealth of information on the web, enabling businesses to gain insights, monitor market trends, assess brand health, and maintain a competitive edge. It is invaluable in diverse applications including research, news monitoring, and contract tracking.

Can you illustrate an application of data extraction?

In retail and e-commerce, data extraction is instrumental for competitor price monitoring, allowing for automated, accurate, and efficient tracking of product prices across various platforms, aiding in strategic planning and decision-making.