gocolly/colly

Pluggable cache system?

Open

#103 opened on Feb 11, 2018

View on GitHub
 (26 comments) (4 reactions) (0 assignees)Go (24,898 stars) (1,837 forks)batch import
enhancementhelp wanted

Description

Hi,

Thanks for Colly! :)

I have a task with hundreds of thousands of pages, so obviously I am using Colly's caching, but it's basically 'too much' for the filesystem. (Wasted disk space, a pain to manage, slow to backup, etc)

I'd like to propose a pluggable cache system, similar to how you've made other Colly components.

Perhaps with an API like this:-

type Cache interface {
	Init() error
	Get(url string) (*colly.Response, error)
	Put(url string, r colly.Response) error
	Remove(url string) error
}

...or...

type Cache interface {
	Init() error
	Get(url string) ([]byte, error)
	Put(url string, data []byte) error
	Remove(url string) error
}

The first one won't be possible if you then wish to implement FileSystemCache in a subpackage to Colly though.

The reason I also need a Remove method is because one project has a site that sometimes serves maintenance pages, and whilst I can detect these, Colly currently has no method of saying stop, or of rejecting a page after processing. Obviously, the last thing I want part way through a crawl is to have my cache poisoned. But that's probably a separate issue, that I can live with if I can do the removal of bad pages myself.

If pluggable caches were to be implemented, I have existing code from another project that has a cache built using SQLite as a key-value store, compressing/decompressing the data with Zstandard (it's both surprisingly quick and super efficient on disk space), that I would happily port over. This can either become part of Colly, or a separate thing on my own Github.

I did start implementing this myself, but ran into a problem with how I went about it. (I followed the pattern you have established of having the separate components as subpackages, I then got bitten because my FileSystemCache couldn't easily reach into the Collector to get the CacheDir, I was trying to preserve existing behaviour / API compatibility. Maybe that's not an issue. Maybe these bits shouldn't be subpackages. Obviously once I started thinking things like that I figured it was best to discuss before/instead of proceeding any further.)

— Your thoughts?

Contributor guide