Best AI Prompts for Web Scraping Scripts with ChatGPT

TL;DR

ChatGPT can generate complete Python web scraping scripts when provided with the target website structure, data requirements, and technical context
BeautifulSoup scripts work best for static HTML pages; Selenium scripts are needed for JavaScript-heavy pages
The most effective scraping prompts specify the target website, data fields, and technical stack — vague prompts produce broken scripts
Script quality depends on prompt specificity — website structure, anti-scraping measures, and error handling requirements must be included
Always add rate limiting and respectful scraping practices to generated scripts to avoid IP blocks and server overload
Generated scripts should be reviewed and tested before running on large datasets

Introduction

Web scraping scripts automate data extraction from websites. Unlike no-code scraping tools, scripts give you complete control over the extraction process, handle complex scenarios, and can be scheduled and integrated into data pipelines. The challenge is that writing scraping scripts from scratch is time-consuming, and maintaining them as websites change requires ongoing engineering effort.

ChatGPT accelerates scraping script development by generating working code from clear specifications. Given a description of what you want to extract and from which website, ChatGPT can produce a Python script using BeautifulSoup, Selenium, or Scrapy that handles the core extraction logic. You then test, debug, and maintain it.

The key to getting working scripts from ChatGPT is specificity. A prompt that says “write a script to scrape a website” produces an outline. A prompt that describes the website structure, the specific data fields, the technical environment, and the error handling requirements produces a script that is much closer to production-ready.

BeautifulSoup vs. Selenium: Which to Use
Basic BeautifulSoup Script Prompts
Selenium Script Generation Prompts
Pagination and Multi-Page Scraping Prompts
Data Cleaning in Python Prompts
Error Handling and Robustness Prompts
Common Script Mistakes
FAQ

BeautifulSoup vs. Selenium: Which to Use {#beautifulsoup-vs-selenium}

BeautifulSoup parses the HTML that a web server sends back. It works on any website where the data you want is in the initial HTML response. It is faster and uses less memory than Selenium. Use BeautifulSoup when:

The target data loads with the initial page HTML
The website does not require JavaScript execution to display content
Speed and efficiency are important

Selenium automates a real browser (Chrome, Firefox) to render pages the way a human user would see them. It executes JavaScript, fills forms, clicks buttons, and handles dynamic content loading. Use Selenium when:

The target data loads via JavaScript after the initial page load
The website requires interaction (login, search, filtering) before data appears
The website has anti-bot measures that detect non-browser access

ChatGPT can generate scripts for both. Specify which you need based on the website you are scraping.

Basic BeautifulSoup Script Prompts {#basic-beautifulsoup-script-prompts}

Prompt:

Generate a Python script using requests and BeautifulSoup to scrape [WEBSITE/URL].

Data to extract:
1. [FIELD NAME]: located at [CSS SELECTOR OR ELEMENT DESCRIPTION]
2. [FIELD NAME]: located at [CSS SELECTOR OR ELEMENT DESCRIPTION]
...etc.

Output: CSV file with headers [LIST HEADERS]

Technical requirements:
- Python version: 3.x
- Libraries: requests, beautifulsoup4
- Error handling: skip pages that fail, log errors
- Rate limiting: [DELAY IN SECONDS] between requests

Include:
1. requests.Session() for connection pooling
2. Proper headers (User-Agent, Accept-Language) to avoid basic blocks
3. BeautifulSoup parsing with html.parser
4. CSV writer with proper encoding
5. Error handling for request failures and parsing errors

Test URL: [URL WITH DATA TO SCRAPE]

[WEBSITE + DATA + TECHNICAL REQUIREMENTS]

Selenium Script Generation Prompts {#selenium-script-generation-prompts}

Prompt:

Generate a Python Selenium script to scrape [WEBSITE/URL].

This website requires [JAVASCRIPT RENDERING / LOGIN / BUTTON CLICK TO LOAD DATA — describe what makes Selenium necessary]

Data to extract:
1. [FIELD NAME]: located at [CSS SELECTOR OR ELEMENT DESCRIPTION]
2. [FIELD NAME]: located at [CSS SELECTOR OR ELEMENT DESCRIPTION]
...etc.

Output: CSV file with headers [LIST HEADERS]

Technical requirements:
- Browser: Chrome (headless mode for automation)
- Libraries: selenium, webdriver-manager
- Explicit wait conditions for elements to load
- Error handling: gracefully handle elements not found

Selenium-specific requirements:
1. Initialize Chrome in headless mode
2. Use WebDriverWait with expected_conditions for dynamic content
3. Handle StaleElementReferenceException with retry logic
4. Close the browser properly in finally block

Include polite scraping practices: [DELAY] seconds between actions.

[WEBSITE + DATA + TECHNICAL REQUIREMENTS]

Pagination and Multi-Page Scraping Prompts {#pagination-multi-page-scraping-prompts}

Prompt:

Generate a Python scraping script that handles pagination for [WEBSITE].

Pagination type: [NUMBERED PAGES WITH NEXT BUTTON / "LOAD MORE" BUTTON / PAGE NUMBER IN URL / INFINITE SCROLL]

URL pattern: [HOW PAGE NUMBERS APPEAR IN URL — e.g., page=1, /page/1/]

Data to extract (same fields on each page):
1. [FIELD NAME]
2. [FIELD NAME]
...etc.

Pages to scrape: [NUMBER OR "ALL PAGES UNTIL NO MORE DATA"]

Script requirements:
1. Loop through all pages using the pagination pattern
2. Extract data from each page
3. Handle the last page (no more next button or URL pattern breaks)
4. Add [DELAY] seconds between page requests
5. Save incrementally or save all at once at the end
6. Report total pages scraped and records extracted

[WEBSITE + PAGINATION TYPE + DATA + PAGE COUNT]

Data Cleaning in Python Prompts {#data-cleaning-python-prompts}

Prompt:

I have scraped raw data from [SOURCE]. The data needs cleaning. Generate a Python script to clean and format it.

Raw data sample:
[PASTE 5-10 ROWS WITH RAW VALUES]

Issues to fix:
1. [ISSUE — e.g., HTML tags in text fields]
2. [ISSUE — e.g., inconsistent date formats]
3. [ISSUE — e.g., trailing whitespace]
4. [ISSUE — e.g., special characters not encoding properly]
5. [ISSUE — e.g., blank/null values that should be marked]

Target output format:
[CSV / JSON with specific field names and types]

Generate a Python cleaning script that:
1. Reads the raw data file
2. Applies cleaning transformations to each field
3. Validates data types (dates are dates, numbers are numbers, etc.)
4. Writes the cleaned data to [OUTPUT FORMAT]
5. Reports any records that could not be cleaned or validated

[CLEANING RULES + OUTPUT FORMAT]

Error Handling and Robustness Prompts {#error-handling-robustness-prompts}

Prompt:

I have a basic Python scraping script that works on [WEBSITE]. Help me add robust error handling and make it production-ready.

Current script approach:
[BRIEF DESCRIPTION OF WHAT THE SCRIPT DOES]

Common failure points I have observed:
1. [FAILURE — e.g., ConnectionError when website is slow]
2. [FAILURE — e.g., AttributeError when page structure changes]
3. [FAILURE — e.g., Timeout when JavaScript takes too long]

Add the following improvements:
1. Retry logic with exponential backoff for connection failures (max [NUMBER] retries)
2. Graceful handling of parsing errors — log and skip malformed records
3. Timeout handling for requests (max [SECONDS] seconds per request)
4. Logging to file for debugging failed scrapes
5. A check for robots.txt compliance before scraping
6. A signal handler to allow graceful interruption (Ctrl+C saves progress)

Generate the updated script with these error handling improvements.

[SCRIPT DESCRIPTION + FAILURE POINTS]

Common Script Mistakes {#common-script-mistakes}

The most common mistake is not handling website structure changes. Even well-written scraping scripts break when websites update their HTML structure. Build in logging and error reporting so you know when a script stops working, and schedule regular checks to verify the script is extracting data correctly.

Another common mistake is not respecting rate limits. A scraping script that hammers a website with requests will get the server overloaded, potentially triggering IP blocks. Always add delays between requests and implement connection pooling properly.

A third mistake is not handling encoding correctly. Websites use various character encodings (UTF-8, ISO-8859-1, Windows-1252). BeautifulSoup usually handles this automatically, but always verify the output encoding in your CSV and JSON files to avoid garbled characters.

FAQ {#faq}

What is the fastest Python library for web scraping?

For static pages, requests with BeautifulSoup is fast and efficient. For JavaScript-heavy pages, requests-html (which embeds a mini-browser) is faster than full Selenium when you only need JavaScript rendering without full browser automation. For large-scale scraping, Scrapy is the fastest option as it handles asynchronous requests and has built-in concurrency.

How do I avoid getting blocked while scraping?

Use respectful scraping practices: add random delays between requests (2-5 seconds), rotate User-Agent headers, do not scrape during peak hours, and only scrape the data you actually need. Also check robots.txt and the website’s terms of service. If a website offers an API, use it instead of scraping.

Can I scrape login-protected pages?

Yes, with Selenium you can automate login by filling in credentials and clicking the login button. Alternatively, you can use requests with a session cookie obtained from a logged-in browser session. Note that scraping behind login may violate the website’s terms of service.

How do I handle CAPTCHAs?

CAPTCHAs are intentionally difficult for automated tools. The practical options are: use the website’s API if available, use a CAPTCHA solving service (which has ethical and legal implications), or avoid triggering CAPTCHAs through respectful scraping practices that do not make the website think a bot is accessing it.

Conclusion

ChatGPT accelerates Python web scraping script development by generating production-ready code from clear specifications. The key is specificity in your prompts — website structure, data fields, technical environment, and error handling requirements.

Key takeaways:

Choose BeautifulSoup for static pages, Selenium for JavaScript-rendered content
Be specific about website structure, data fields, and technical requirements in prompts
Always add rate limiting and error handling to production scripts
Test scripts on small samples before running large-scale extractions
Monitor scripts for failures and update when website structure changes

Your next step: take a scraping task you need and run it through the appropriate script generation prompt. Test the output on a small sample before running the full extraction.

Best AI Prompts for Web Scraping Scripts with ChatGPT

Key Takeaways

Summarize with AI

Best AI Prompts for Web Scraping Scripts with ChatGPT

TL;DR

Introduction

Table of Contents

BeautifulSoup vs. Selenium: Which to Use {#beautifulsoup-vs-selenium}

Basic BeautifulSoup Script Prompts {#basic-beautifulsoup-script-prompts}

Selenium Script Generation Prompts {#selenium-script-generation-prompts}

Data Cleaning in Python Prompts {#data-cleaning-python-prompts}

Error Handling and Robustness Prompts {#error-handling-robustness-prompts}

Common Script Mistakes {#common-script-mistakes}

FAQ {#faq}

Conclusion

Get our weekly AI digest

AIUnpacker Editorial Team

More in Engineering

Best AI Prompts for Code Refactoring with Claude Code

Best AI Prompts for Unit Test Generation with Cursor

Best AI Prompts for Documentation Generation with Claude Code

Code Translation AI Prompts for Full Stack Developers