Best AI Prompts for Web Scraping Scripts with ChatGPT
TL;DR
- ChatGPT can generate complete Python web scraping scripts when provided with the target website structure, data requirements, and technical context
- BeautifulSoup scripts work best for static HTML pages; Selenium scripts are needed for JavaScript-heavy pages
- The most effective scraping prompts specify the target website, data fields, and technical stack — vague prompts produce broken scripts
- Script quality depends on prompt specificity — website structure, anti-scraping measures, and error handling requirements must be included
- Always add rate limiting and respectful scraping practices to generated scripts to avoid IP blocks and server overload
- Generated scripts should be reviewed and tested before running on large datasets
Introduction
Web scraping scripts automate data extraction from websites. Unlike no-code scraping tools, scripts give you complete control over the extraction process, handle complex scenarios, and can be scheduled and integrated into data pipelines. The challenge is that writing scraping scripts from scratch is time-consuming, and maintaining them as websites change requires ongoing engineering effort.
ChatGPT accelerates scraping script development by generating working code from clear specifications. Given a description of what you want to extract and from which website, ChatGPT can produce a Python script using BeautifulSoup, Selenium, or Scrapy that handles the core extraction logic. You then test, debug, and maintain it.
The key to getting working scripts from ChatGPT is specificity. A prompt that says “write a script to scrape a website” produces an outline. A prompt that describes the website structure, the specific data fields, the technical environment, and the error handling requirements produces a script that is much closer to production-ready.
Table of Contents
- BeautifulSoup vs. Selenium: Which to Use
- Basic BeautifulSoup Script Prompts
- Selenium Script Generation Prompts
- Pagination and Multi-Page Scraping Prompts
- Data Cleaning in Python Prompts
- Error Handling and Robustness Prompts
- Common Script Mistakes
- FAQ
BeautifulSoup vs. Selenium: Which to Use {#beautifulsoup-vs-selenium}
BeautifulSoup parses the HTML that a web server sends back. It works on any website where the data you want is in the initial HTML response. It is faster and uses less memory than Selenium. Use BeautifulSoup when:
- The target data loads with the initial page HTML
- The website does not require JavaScript execution to display content
- Speed and efficiency are important
Selenium automates a real browser (Chrome, Firefox) to render pages the way a human user would see them. It executes JavaScript, fills forms, clicks buttons, and handles dynamic content loading. Use Selenium when:
- The target data loads via JavaScript after the initial page load
- The website requires interaction (login, search, filtering) before data appears
- The website has anti-bot measures that detect non-browser access
ChatGPT can generate scripts for both. Specify which you need based on the website you are scraping.
Basic BeautifulSoup Script Prompts {#basic-beautifulsoup-script-prompts}
Prompt:
Generate a Python script using requests and BeautifulSoup to scrape [WEBSITE/URL].
Data to extract:
1. [FIELD NAME]: located at [CSS SELECTOR OR ELEMENT DESCRIPTION]
2. [FIELD NAME]: located at [CSS SELECTOR OR ELEMENT DESCRIPTION]
...etc.
Output: CSV file with headers [LIST HEADERS]
Technical requirements:
- Python version: 3.x
- Libraries: requests, beautifulsoup4
- Error handling: skip pages that fail, log errors
- Rate limiting: [DELAY IN SECONDS] between requests
Include:
1. requests.Session() for connection pooling
2. Proper headers (User-Agent, Accept-Language) to avoid basic blocks
3. BeautifulSoup parsing with html.parser
4. CSV writer with proper encoding
5. Error handling for request failures and parsing errors
Test URL: [URL WITH DATA TO SCRAPE]
[WEBSITE + DATA + TECHNICAL REQUIREMENTS]
Selenium Script Generation Prompts {#selenium-script-generation-prompts}
Prompt:
Generate a Python Selenium script to scrape [WEBSITE/URL].
This website requires [JAVASCRIPT RENDERING / LOGIN / BUTTON CLICK TO LOAD DATA — describe what makes Selenium necessary]
Data to extract:
1. [FIELD NAME]: located at [CSS SELECTOR OR ELEMENT DESCRIPTION]
2. [FIELD NAME]: located at [CSS SELECTOR OR ELEMENT DESCRIPTION]
...etc.
Output: CSV file with headers [LIST HEADERS]
Technical requirements:
- Browser: Chrome (headless mode for automation)
- Libraries: selenium, webdriver-manager
- Explicit wait conditions for elements to load
- Error handling: gracefully handle elements not found
Selenium-specific requirements:
1. Initialize Chrome in headless mode
2. Use WebDriverWait with expected_conditions for dynamic content
3. Handle StaleElementReferenceException with retry logic
4. Close the browser properly in finally block
Include polite scraping practices: [DELAY] seconds between actions.
[WEBSITE + DATA + TECHNICAL REQUIREMENTS]
Pagination and Multi-Page Scraping Prompts {#pagination-multi-page-scraping-prompts}
Prompt:
Generate a Python scraping script that handles pagination for [WEBSITE].
Pagination type: [NUMBERED PAGES WITH NEXT BUTTON / "LOAD MORE" BUTTON / PAGE NUMBER IN URL / INFINITE SCROLL]
URL pattern: [HOW PAGE NUMBERS APPEAR IN URL — e.g., page=1, /page/1/]
Data to extract (same fields on each page):
1. [FIELD NAME]
2. [FIELD NAME]
...etc.
Pages to scrape: [NUMBER OR "ALL PAGES UNTIL NO MORE DATA"]
Script requirements:
1. Loop through all pages using the pagination pattern
2. Extract data from each page
3. Handle the last page (no more next button or URL pattern breaks)
4. Add [DELAY] seconds between page requests
5. Save incrementally or save all at once at the end
6. Report total pages scraped and records extracted
[WEBSITE + PAGINATION TYPE + DATA + PAGE COUNT]
Data Cleaning in Python Prompts {#data-cleaning-python-prompts}
Prompt:
I have scraped raw data from [SOURCE]. The data needs cleaning. Generate a Python script to clean and format it.
Raw data sample:
[PASTE 5-10 ROWS WITH RAW VALUES]
Issues to fix:
1. [ISSUE — e.g., HTML tags in text fields]
2. [ISSUE — e.g., inconsistent date formats]
3. [ISSUE — e.g., trailing whitespace]
4. [ISSUE — e.g., special characters not encoding properly]
5. [ISSUE — e.g., blank/null values that should be marked]
Target output format:
[CSV / JSON with specific field names and types]
Generate a Python cleaning script that:
1. Reads the raw data file
2. Applies cleaning transformations to each field
3. Validates data types (dates are dates, numbers are numbers, etc.)
4. Writes the cleaned data to [OUTPUT FORMAT]
5. Reports any records that could not be cleaned or validated
[CLEANING RULES + OUTPUT FORMAT]
Error Handling and Robustness Prompts {#error-handling-robustness-prompts}
Prompt:
I have a basic Python scraping script that works on [WEBSITE]. Help me add robust error handling and make it production-ready.
Current script approach:
[BRIEF DESCRIPTION OF WHAT THE SCRIPT DOES]
Common failure points I have observed:
1. [FAILURE — e.g., ConnectionError when website is slow]
2. [FAILURE — e.g., AttributeError when page structure changes]
3. [FAILURE — e.g., Timeout when JavaScript takes too long]
Add the following improvements:
1. Retry logic with exponential backoff for connection failures (max [NUMBER] retries)
2. Graceful handling of parsing errors — log and skip malformed records
3. Timeout handling for requests (max [SECONDS] seconds per request)
4. Logging to file for debugging failed scrapes
5. A check for robots.txt compliance before scraping
6. A signal handler to allow graceful interruption (Ctrl+C saves progress)
Generate the updated script with these error handling improvements.
[SCRIPT DESCRIPTION + FAILURE POINTS]
Common Script Mistakes {#common-script-mistakes}
The most common mistake is not handling website structure changes. Even well-written scraping scripts break when websites update their HTML structure. Build in logging and error reporting so you know when a script stops working, and schedule regular checks to verify the script is extracting data correctly.
Another common mistake is not respecting rate limits. A scraping script that hammers a website with requests will get the server overloaded, potentially triggering IP blocks. Always add delays between requests and implement connection pooling properly.
A third mistake is not handling encoding correctly. Websites use various character encodings (UTF-8, ISO-8859-1, Windows-1252). BeautifulSoup usually handles this automatically, but always verify the output encoding in your CSV and JSON files to avoid garbled characters.
FAQ {#faq}
What is the fastest Python library for web scraping?
For static pages, requests with BeautifulSoup is fast and efficient. For JavaScript-heavy pages, requests-html (which embeds a mini-browser) is faster than full Selenium when you only need JavaScript rendering without full browser automation. For large-scale scraping, Scrapy is the fastest option as it handles asynchronous requests and has built-in concurrency.
How do I avoid getting blocked while scraping?
Use respectful scraping practices: add random delays between requests (2-5 seconds), rotate User-Agent headers, do not scrape during peak hours, and only scrape the data you actually need. Also check robots.txt and the website’s terms of service. If a website offers an API, use it instead of scraping.
Can I scrape login-protected pages?
Yes, with Selenium you can automate login by filling in credentials and clicking the login button. Alternatively, you can use requests with a session cookie obtained from a logged-in browser session. Note that scraping behind login may violate the website’s terms of service.
How do I handle CAPTCHAs?
CAPTCHAs are intentionally difficult for automated tools. The practical options are: use the website’s API if available, use a CAPTCHA solving service (which has ethical and legal implications), or avoid triggering CAPTCHAs through respectful scraping practices that do not make the website think a bot is accessing it.
Conclusion
ChatGPT accelerates Python web scraping script development by generating production-ready code from clear specifications. The key is specificity in your prompts — website structure, data fields, technical environment, and error handling requirements.
Key takeaways:
- Choose BeautifulSoup for static pages, Selenium for JavaScript-rendered content
- Be specific about website structure, data fields, and technical requirements in prompts
- Always add rate limiting and error handling to production scripts
- Test scripts on small samples before running large-scale extractions
- Monitor scripts for failures and update when website structure changes
Your next step: take a scraping task you need and run it through the appropriate script generation prompt. Test the output on a small sample before running the full extraction.