Web Scraping
An automated process of extracting data from web pages for further processing or storage.
What is web scraping?
Web scraping (also known as web harvesting or web data extraction) is an automated process in which software downloads web page content and extracts structured data from it - texts, prices, contacts, or other information.
How web scraping works
- A bot loads the HTML code of a web page (like a browser, but without rendering).
- It parses the HTML and extracts the required data (headings, paragraphs, tables).
- The data is saved to a structured format (JSON, CSV) or directly into a database.
Use in RAG systems
Scraping is a key tool when building a RAG knowledge base - it allows you to automatically download the content of an entire website (using a sitemap), convert it to Markdown, and store it in a vector database.
Legal considerations
Always check the website's terms of service and robots.txt file before scraping. Some websites explicitly prohibit it.