Sponge
Sponge helps you transform the limitless amount of unstructured content into valuable information that can be used everywhere. Whether is about web sites, blogs or forums, Sponge is a full-featured, flexible and extensible web crawler that runs on any platform and will help you crawl what you want, how you want.
Web sites are crawled with configurable HTTP spiders. It provides a very simple web UI where users can annotate any web page and define how to extract their data of interest.
Once data is extracted, it is fed to the processing pipeline where extracted data can be manipulated before using it in your own service or application.
Any website is a valuable source of information. The robots that can be designed in Sponge allow you to easily decode any raw website, by iteratively highlighting certain areas of the site and mapping them to the structure you want. Using a user-friendly web interface, data mapping can be done in a couple of clicks.
Sponge identifies the unique relevant HTML landmarks (CSS classes, DOM components, DOM tree branch paths or unique combinations of these) using complex and smart algorithms.

Features

  • Web UI for easy configuration
  • Multi-threaded
  • Job scheduling
  • Monitoring
  • Supports pages rendered with JavaScript.
  • Language detection
  • URL normalization
  • Configurable crawling speed
  • Detects modified and deleted documents
  • Supports sitemap.xml
  • Supports robot rules
  • Supports canonical URLs
  • Document filters based on URL, HTTP headers, content, or metadata
  • Can re-process or delete URLs no longer linked by other crawled pages
  • Different URL extraction strategies for different content types (RSS, HTML, XML)
  • Reference XML/HTML elements using simple DOM tree navigation
  • Configurable hit intervals according to different schedules
  • Customizable User Agents
  • Configurable maximum crawling depth
  • Different frequencies for re-crawling certain pages
  • Can crawls millions of records on average hardware
Last modified 2yr ago
Copy link
Contents
Features