gocolly/colly

cache and proxy

Open

#187 opened on Jul 10, 2018

View on GitHub
 (2 comments) (0 reactions) (0 assignees)Go (24,898 stars) (1,837 forks)batch import
enhancementhelp wanted

Description

Hey guys,

I have an issue. In real world of scrappers when using proxies, checking if You have response code 200 or 500 is simply not enough. Plenty of proxies or even website itself can throw an output with code 200 which is invalid from developer point of view and yet its getting cached. Is there any way to apply some kind of custom logic which would determine if response is what we want? For example:

colly.CacheDir("some/dir", func(resp) {
   if bytes.Contains(resp.Body, []byte("some magic marker")) {
      return false // don't cache
   }
})

I'm new to colly and i was digging thru source code for a bit, but didn't find such of feature. It's also useful without proxies.

Contributor guide