How to Scrape Tokopedia Data?

WebScraping

Jump to section
  1. Key Steps:
  2. Setting Up the Scraping Environment
  3. Installing Required Tools and Libraries
  4. Getting and Configuring API Access
  5. Testing the Setup
  6. How to Scrape Tokopedia Data
  7. Scraping Search Results
  8. Extracting Product Details
  9. Handling Anti-Bot Measures
  10. Using Proxies and Custom Configurations
  11. Proxies vs. Headless Browsers
  12. Storing and Exporting Scraped Data
  13. Saving Data in JSON or CSV Format
  14. Converting Currency and Batch Processing
  15. Scaling Scraping with Web Scraping HQ
  16. Benefits of Using Web Scraping HQ
  17. Pricing Plans and Custom Solutions
  18. Integrating Web Scraping HQ Data into a Workflow
  19. Conclusion

How to Scrape Tokopedia Data?

Scraping Tokopedia data can provide valuable insights into product pricing, market trends, and competitor strategies. Tokopedia, Indonesia's largest e-commerce platform, uses JavaScript-rendered pages, making it necessary to handle dynamic content and anti-bot measures effectively. Here's a quick summary of the process:

  • Data Available: Product details, pricing, reviews, seller info, logistics, and media.
  • Challenges: JavaScript-heavy structure, Cloudflare protection, rate limits (1–5 requests per minute).
  • Tools Needed: Python, libraries like crawlbase, BeautifulSoup, requests, and pandas.
  • Legal Considerations: Public data scraping is generally allowed under U.S. law, but ensure compliance with Tokopedia's terms and avoid extracting personal data.

Key Steps:

  1. Set Up Tools: Follow a web scraping for beginners guide to install Python libraries and get a Crawlbase JS Token for handling JavaScript-rendered content.
  2. Scrape Search Results: Use specific CSS selectors to extract product names, prices, and store info.
  3. Extract Product Details: Dive deeper into product pages for descriptions, ratings, and media URLs.
  4. Handle Anti-Bot Systems: Use residential proxies, rotate User-Agent strings, and implement delays to avoid bans.
  5. Export Data: Save results in CSV or JSON formats for analysis.

For large-scale projects, managed services like Web Scraping HQ can simplify the process by automating tasks like proxy rotation and compliance checks.

Compliance Tip: Focus only on publicly available data, respect rate limits, and avoid violating terms of service.

How to Scrape Tokopedia Data: Step-by-Step Workflow

How to Scrape Tokopedia Data: Step-by-Step Workflow

Setting Up the Scraping Environment

Before diving into coding, it's essential to have all your tools ready. Tokopedia's scraping difficulty is rated as medium (3/5) as of May 2026, primarily due to Cloudflare protection and rate limiting. Basic Python requests often fall short since they can't handle JavaScript execution or mimic browser TLS fingerprints. That’s why this guide opts for a more robust approach.

Installing Required Tools and Libraries

Start by verifying that Python 3 is installed. You can do this by running:

python --version

Once confirmed, install the necessary libraries using this pip command:

pip install crawlbase beautifulsoup4 requests pandas

Here's how each library contributes to the scraping process:

Library Purpose
crawlbase Manages JavaScript rendering and bypasses anti-bot protections
beautifulsoup4 Parses HTML and extracts specific data using CSS selectors
requests Handles HTTP requests, useful for interacting with dynamic endpoints
pandas Organizes scraped data and exports it into formats like CSV or JSON

For writing and testing your scraper, tools like Visual Studio Code, PyCharm, or Jupyter Notebook are highly recommended.

The next step involves securing API access to handle Tokopedia's dynamic content.

Getting and Configuring API Access

Head over to the Crawlbase website and create a free account. You'll be granted 1,000 free requests to experiment with your setup. After signing up, log in to your dashboard and copy your JavaScript (JS) Token.

This token is crucial for scraping dynamic, JavaScript-heavy content. Avoid using the Normal Token, as it’s designed for static HTML-only sites.

Token Type Rendering Anti-Bot Features Suitable For
Normal Token None Basic IP rotation Static HTML websites
JS Token Full browser-like Advanced handling Dynamic, JavaScript-heavy sites

Once you have your JS Token, initialize the Crawlbase API in your script:

from crawlbase import CrawlingAPI

crawling_api = CrawlingAPI({ 'token': 'YOUR_JS_TOKEN' })

To ensure Tokopedia's JavaScript content is fully loaded before scraping, configure your request options. Set ajax_wait to 'true' and page_wait to at least '5000' (5 seconds).

Testing the Setup

Run the script below to verify that your environment is correctly configured:

from crawlbase import CrawlingAPI

crawling_api = CrawlingAPI({ 'token': 'YOUR_JS_TOKEN' })

def test_setup(url):
    response = crawling_api.get(url)
    if response['headers']['pc_status'] == '200':
        print("Environment configured correctly!")
        return response['body'].decode('utf-8')
    else:
        print(f"Setup failed. Status: {response['headers']['pc_status']}")
        return None

test_setup('https://www.tokopedia.com/')

If the pc_status returns '200', it confirms that your setup is functioning as expected.

How to Scrape Tokopedia Data

With your JS Token ready, you can start collecting data from Tokopedia. This guide explains two main tasks: scraping search results and extracting detailed product information.

Scraping Search Results

Tokopedia search URLs follow a consistent format. For instance, searching for "laptop" uses the URL: https://www.tokopedia.com/search?st=product&q=laptop. You can paginate by appending &page=2, &page=3, and so on.

Here are the key CSS selectors for extracting data from search result cards:

Data Point CSS Selector
Product Name span.OWkG6oHwAppMn1hIBsC3pQ==
Price div.ELhJqP-Bfiud3i5eBR8NWg==
Store Name span.X6c-fdwuofj6zGvLKVUaNQ==

Below is an example script to scrape the first page of search results:

from crawlbase import CrawlingAPI
from bs4 import BeautifulSoup

crawling_api = CrawlingAPI({'token': 'YOUR_JS_TOKEN'})

def scrape_search_results(keyword, page=1):
    url = f"https://www.tokopedia.com/search?st=product&q={keyword}&page={page}"
    options = {'ajax_wait': 'true', 'page_wait': '5000'}
    response = crawling_api.get(url, options)

    if response['headers']['pc_status'] != '200':
        print(f"Failed to retrieve page {page}")
        return []

    soup = BeautifulSoup(response['body'].decode('utf-8'), 'html.parser')
    products = []

    for card in soup.select('div[data-testid="divSRPContentProducts"] > div'):
        name = card.select_one('span.OWkG6oHwAppMn1hIBsC3pQ==')
        price = card.select_one('div.ELhJqP-Bfiud3i5eBR8NWg==')
        store = card.select_one('span.X6c-fdwuofj6zGvLKVUaNQ==')
        link = card.select_one('a[href]')

        products.append({
            'name': name.text.strip() if name else None,
            'price': price.text.strip() if price else None,
            'store': store.text.strip() if store else None,
            'url': link['href'] if link else None
        })

    return products

results = scrape_search_results('laptop', page=1)
for product in results:
    print(product)

To scrape multiple pages, wrap this function in a loop and increment the page parameter until you collect enough data. This approach enables efficient scraping across numerous pages.

Extracting Product Details

Once you've gathered basic search results, you can dive deeper to extract more detailed product information. The data-testid attributes in Tokopedia's HTML offer stable selectors compared to CSS class names.

Here are the key selectors for a product detail page:

Data Point CSS Selector
Product Name h1[data-testid="lblPDPDetailProductName"]
Price div[data-testid="lblPDPDetailProductPrice"]
Description div[data-testid="lblPDPDescriptionProduk"]
Store Name a[data-testid="llbPDPFooterShopName"]
Image URLs button[data-testid="PDPImageThumbnail"] img

The following script extracts details from a single product URL:

def scrape_product_details(product_url):
    options = {'ajax_wait': 'true', 'page_wait': '5000'}
    response = crawling_api.get(product_url, options)

    if response['headers']['pc_status'] != '200':
        print("Failed to retrieve product page")
        return {}

    soup = BeautifulSoup(response['body'].decode('utf-8'), 'html.parser')

    name = soup.select_one('h1[data-testid="lblPDPDetailProductName"]')
    price = soup.select_one('div[data-testid="lblPDPDetailProductPrice"]')
    description = soup.select_one('div[data-testid="lblPDPDescriptionProduk"]')
    store = soup.select_one('a[data-testid="llbPDPFooterShopName"]')
    images = [img['src'] for img in soup.select('button[data-testid="PDPImageThumbnail"] img') if img.get('src')]

    return {
        'name': name.text.strip() if name else None,
        'price': price.text.strip() if price else None,
        'description': description.text.strip() if description else None,
        'store': store.text.strip() if store else None,
        'images': images
    }

details = scrape_product_details('https://www.tokopedia.com/example-store/example-product')
print(details)

For additional data like ratings and reviews, HTML scraping alone may not suffice. Tokopedia often loads this information via GraphQL endpoints (gql.tokopedia.com/graphql), which return structured JSON data. By inspecting network activity in your browser's DevTools (under the Fetch/XHR tab), you can identify these calls. The stats object in the response typically includes fields like reviewCount and rating.

Handling Anti-Bot Measures

Tokopedia employs a combination of Cloudflare and server-side rate limiting to deter automated traffic. As of May 14, 2026, scraping Tokopedia is rated as "Medium" difficulty (3 out of 5). The platform enforces strict limits, typically allowing only 1–5 requests per minute from a single IP before triggering rate limits or temporary bans. Cloudflare adds another layer of defense with JavaScript challenges, TLS fingerprinting, and behavioral analysis to detect and block bots. To navigate these barriers, using residential proxies along with tailored configurations is highly effective.

Using Proxies and Custom Configurations

A dependable way to keep your scraper functional is by leveraging residential proxies, which route requests through real home networks, making the traffic appear like that of ordinary users. Unlike datacenter proxies (which cost around $1–$5/GB), residential proxies are better equipped to bypass Tokopedia's advanced fingerprinting and blocking mechanisms.

When setting up your API, ensure that it includes options for handling JavaScript rendering. For example, enable sufficient wait times (e.g., ajax_wait set to true and page_wait at 5000ms) to ensure pages load fully.

If you encounter a 429 error, implement exponential backoff by waiting progressively longer intervals (e.g., 2, 4, 8 seconds) before retrying. Rotate through at least 50 genuine browser User-Agent strings for each request. Additionally, use curl_cffi to simulates Chrome's TLS handshake, which can help bypass Cloudflare's checks. Establishing a session by first visiting Tokopedia's homepage can also reduce the likelihood of triggering Cloudflare's challenge loop, as this creates a valid cookie chain.

Proxies vs. Headless Browsers

Proxies and headless browsers serve distinct purposes, and knowing when to use each is essential. Proxies focus on overcoming IP-based restrictions, while headless browsers like Playwright or Selenium are designed to handle JavaScript-heavy pages and interactive challenges. The table below outlines their differences (based on):

Feature Residential Proxies Headless Browsers (Playwright/Selenium)
Primary Benefit Bypasses IP reputation and rate limits Handles complex JavaScript challenges and interactions
Speed Fast with minimal overhead Slower; higher CPU and memory consumption
Detection Risk Low – mimics real home users Medium – headless signatures can be detected
Cost Moderate (around $5–$15/GB) Higher due to infrastructure requirements
Best For High-volume data extraction and price tracking Tasks requiring full browser emulation, like interacting with dynamic content

For most scraping tasks on Tokopedia, combining residential proxies with a Crawlbase JS Token strikes a good balance between speed, cost, and efficiency. However, headless browsers are better suited for tasks that require interacting with the page, such as clicking buttons, scrolling through lazy-loaded content, or handling more advanced interactive challenges.

Storing and Exporting Scraped Data

Once your scraper reliably collects data, the next step is to organize and format it into a structured dataset. A clean export process ensures that raw data becomes usable for analysis. Building on secure data extraction, this section explains how to store and format your results effectively, using cara scrape Tokopedia as an example.

Saving Data in JSON or CSV Format

Start by collecting each scraped product as a Python dictionary. Then, compile these dictionaries into a pandas DataFrame for cleaning and exporting. Consistency in your schema is key - every product should include the same fields (even if some values are None) so that rows align correctly in the DataFrame.

Here are some important fields to capture:

Field Source Key Purpose
Product ID product_id / data-productid Helps avoid duplicates
Product Name name / prd_link-product-name Useful for text analysis
Price (IDR) pricetext_idr Tracks price trends
Rating statsrating Filters for quality
Review Count statsreviewCount Measures popularity
Seller Location shopcity Assists in logistics planning

Once your data is organized, export it using either CSV or JSON formats. For CSV, use the following pandas command to preserve Indonesian characters:

df.to_csv('tokopedia_products.csv', index=False, encoding='utf-8')

For JSON, which works well for hierarchical or web-integrated data, use:

json.dump(data, json_file, indent=4)

After exporting, prepare your data for analysis by converting text-based fields, such as prices, into numeric formats.

Converting Currency and Batch Processing

To analyze pricing data, you'll need to standardize numerical fields. Tokopedia displays prices in Indonesian Rupiah (IDR), formatted like "Rp178.000" (Rp178,000). Clean these strings by removing the "Rp" prefix and dots, then convert them into integers:

df['price_idr'] = df['price_idr'].str.replace('Rp', '').str.replace('.', '').astype(int)
df['price_usd'] = df['price_idr'] / 15500

This adds a price_usd column by converting Rupiah values into U.S. Dollars based on the current exchange rate. Additional steps, such as replacing missing sales values with 0, removing terms like terjual and ribu, and deduplicating rows with drop_duplicates(), further refine your dataset for analysis.

When scraping multiple URLs, respect rate limits by adding delays of 2 to 5 seconds between requests using time.sleep(). Always save raw results as a JSON backup before cleaning or exporting to CSV. This ensures you won’t lose data if an error occurs during the cleaning process, saving you from having to re-scrape everything.

Scaling Scraping with Web Scraping HQ

Extracting large-scale data from Tokopedia manually is a daunting task. As scraping projects grow in size, the operational hurdles - like managing proxies, handling rate limits and scraping errors, and streamlining data pipelines - can quickly become overwhelming. That’s where a managed solution like Web Scraping HQ steps in to simplify and automate the process.

Benefits of Using Web Scraping HQ

Web Scraping HQ takes the headache out of scaling Tokopedia data extraction. It automates critical tasks, reducing the need for constant manual intervention while ensuring smooth operations even as complexity increases.

Here’s what makes it stand out:

  • Managed scalability: Automatically handles tasks like bypassing Cloudflare, rotating residential IPs, and managing TLS fingerprinting - allowing you to focus on insights rather than infrastructure.
  • Compliance and quality assurance: Delivers data that’s legally compliant and accurate, with built-in quality checks that eliminate the need for manual preprocessing.
  • Structured delivery: Provides clean, ready-to-use data in formats like JSON or CSV, perfect for integrating directly into analytics tools or robotic process automation workflows.

With Tokopedia ranked as the #251 most scraped website globally as of May 2026, and boasting over 90 million active users and 350 million monthly visits, the sheer scale of data makes a managed solution essential for businesses looking to gain a competitive edge.

Pricing Plans and Custom Solutions

Web Scraping HQ offers two tailored pricing plans to suit different operational needs:

Plan Price Best For Key Features
Standard $449/month Businesses needing structured, reliable data JSON/CSV output, automated QA, compliance assurance, dedicated support
Custom $999+/month Enterprises with high-volume requirements Custom data schema, double-layer QA, scalable to millions of rows, priority support, 24-hour turnaround

The Standard plan is ideal for businesses that need consistent, structured data within 5 business days. For enterprises tracking pricing across thousands of Tokopedia listings daily, the Custom plan offers faster turnarounds (24 hours) and advanced features like custom data schemas and priority support.

Both plans ensure seamless integration into your analytics workflow, saving time and effort while delivering actionable insights.

Integrating Web Scraping HQ Data into a Workflow

Once your data is delivered, integrating it into your analytics pipeline is straightforward. Here’s a Python example to get you started:

import pandas as pd

# Load the delivered dataset
df = pd.read_csv('webscrapinghq_tokopedia_products.csv', encoding='utf-8')

# Convert the price column to numeric for analysis and convert from IDR to USD
df['price_idr'] = pd.to_numeric(df['price_idr'], errors='coerce')
df['price_usd'] = (df['price_idr'] / 15500).round(2)

# Filter products with a rating of 4.5 or higher
top_rated = df[df['rating'] >= 4.5].sort_values('review_count', ascending=False)

print(top_rated[['name', 'price_usd', 'rating', 'seller_city']].head(10))

For even more convenience, Web Scraping HQ supports direct data delivery to Amazon S3 or Google Cloud Storage, keeping your workflow fully automated and eliminating the need for manual downloads.

Conclusion

Scraping Tokopedia comes with its own set of challenges - like handling JavaScript-heavy content, bypassing Cloudflare’s anti-bot defenses, and dealing with rate limits that kick in after just 1–5 requests per minute from a single IP. To tackle these issues, tools such as curl_cffi for TLS fingerprinting and residential proxies for rotating IPs are essential components for any serious scraping project on the platform.

For those looking to simplify the process, services like Web Scraping HQ provides reliable web scraping services. Plans start at $449/month, offering structured and compliance-checked data in JSON or CSV. For enterprise needs, custom schemas with a 24-hour turnaround are available for $999+/month. These options ensure consistent and reliable data delivery, which is critical for effective analysis and long-term success.

Whether you’re building your own scraper or opting for a managed service, the core principles remain the same: respect rate limits, rotate IPs, and ensure your data is well-structured before analysis. Following these fundamentals is key to creating a scraper that’s not just a one-time tool but a dependable, long-term solution.

Want this done for you?

Send us the URLs. We'll quote it in 24 hours.

Paste the URL(s) you want scraped. We'll reply within 24 hours with a feasibility check and a ballpark quote.

Monthly budget

Or, browse our 3 case studies →

FAQ

FAQs

Find answers to commonly asked questions about our Data as a Service solutions, ensuring clarity and understanding of our offerings.

How will I receive my data and in which formats?

We offer versatile delivery options including FTP, SFTP, AWS S3, Google Cloud Storage, email, Dropbox, and Google Drive. We accommodate data formats such as CSV, JSON, JSONLines, and XML, and are open to custom delivery or format discussions to align with your project needs.

What types of data can your service extract?

We are equipped to extract a diverse range of data from any website, while strictly adhering to legal and ethical guidelines, including compliance with Terms and Conditions, privacy, and copyright laws. Our expert teams assess legal implications and ensure best practices in web scraping for each project.

How are data projects managed?

Upon receiving your project request, our solution architects promptly engage in a discovery call to comprehend your specific needs, discussing the scope, scale, data transformation, and integrations required. A tailored solution is proposed post a thorough understanding, ensuring optimal results.

Can I use AI to scrape websites?

Yes, You can use AI to scrape websites. Webscraping HQ’s AI website technology can handle large amounts of data extraction and collection needs. Our AI scraping API allows user to scrape up to 50000 pages one by one.

What support services do you offer?

We offer inclusive support addressing coverage issues, missed deliveries, and minor site modifications, with additional support available for significant changes necessitating comprehensive spider restructuring.

Is there an option to test the services before purchasing?

Absolutely, we offer service testing with sample data from previously scraped sources. For new sources, sample data is shared post-purchase, after the commencement of development.

How can your services aid in web content extraction?

We provide end-to-end solutions for web content extraction, delivering structured and accurate data efficiently. For those preferring a hands-on approach, we offer user-friendly tools for self-service data extraction.

Is web scraping detectable?

Yes, Web scraping is detectable. One of the best ways to identify web scrapers is by examining their IP address and tracking how it's behaving.

Why is data extraction essential?

Data extraction is crucial for leveraging the wealth of information on the web, enabling businesses to gain insights, monitor market trends, assess brand health, and maintain a competitive edge. It is invaluable in diverse applications including research, news monitoring, and contract tracking.

Can you illustrate an application of data extraction?

In retail and e-commerce, data extraction is instrumental for competitor price monitoring, allowing for automated, accurate, and efficient tracking of product prices across various platforms, aiding in strategic planning and decision-making.