Sponge
Sponge helps you transform the limitless amount of unstructured content into valuable information that can be used everywhere. Whether is about web sites, blogs or forums, Sponge is a full-featured, flexible and extensible web crawler that runs on any platform and will help you crawl what you want, how you want.
Web sites are crawled with configurable HTTP spiders. It provides a very simple web UI where users can annotate any web page and define how to extract their data of interest.
Once data is extracted, it is fed to the processing pipeline where extracted data can be manipulated before using it in your own service or application.
Any website is a valuable source of information. The robots that can be designed in Sponge allow you to easily decode any raw website, by iteratively highlighting certain areas of the site and mapping them to the structure you want. Using a user-friendly web interface, data mapping can be done in a couple of clicks.
Sponge identifies the unique relevant HTML landmarks (CSS classes, DOM components, DOM tree branch paths or unique combinations of these) using complex and smart algorithms.

Features

    Web UI for easy configuration
    Multi-threaded
    Job scheduling
    Monitoring
    Supports pages rendered with JavaScript.
    Language detection
    URL normalization
    Configurable crawling speed
    Detects modified and deleted documents
    Supports sitemap.xml
    Supports robot rules
    Supports canonical URLs
    Document filters based on URL, HTTP headers, content, or metadata
    Can re-process or delete URLs no longer linked by other crawled pages
    Different URL extraction strategies for different content types (RSS, HTML, XML)
    Reference XML/HTML elements using simple DOM tree navigation
    Configurable hit intervals according to different schedules
    Customizable User Agents
    Configurable maximum crawling depth
    Different frequencies for re-crawling certain pages
    Can crawls millions of records on average hardware
Last modified 1yr ago
Copy link
Contents
Features