Web Scraping

An automated process of extracting data from web pages for further processing or storage.

What is web scraping?

Web scraping (also known as web harvesting or web data extraction) is an automated process in which software downloads web page content and extracts structured data from it - texts, prices, contacts, or other information.

How web scraping works

A bot loads the HTML code of a web page (like a browser, but without rendering).
It parses the HTML and extracts the required data (headings, paragraphs, tables).
The data is saved to a structured format (JSON, CSV) or directly into a database.

Use in RAG systems

Scraping is a key tool when building a RAG knowledge base - it allows you to automatically download the content of an entire website (using a sitemap), convert it to Markdown, and store it in a vector database.

Legal considerations

Always check the website's terms of service and robots.txt file before scraping. Some websites explicitly prohibit it.