Description
Hi there!
I'm attempting to set up Hound to search the code from every WordPress plugin, which is approximately 48.5k separate repositories. I'm using #156 to avoid needing to have actual repositories for these, and instead using a home grown tool; this avoids needing the overhead of svn/git history. I'm then using an external Python script to generate the config.json
(Per #138, it seems like it might be copying the data across too? This is something like 20GB worth of data in the repos, so copying it is a tangible cost that I'd prefer to avoid, but one step at a time ;) )
From what I can see, it appears that there's significant overhead from having each of these repos separately; it appears hound is generating an index for each repo? (It also runs into #139 despite bumping ulimit -n significantly.) They're all quite small repos, so ideally, I'd like to have them in a combined index instead.
With a single combined index however, I can't build the URLs how I'd like. My directory structure is /srv/sync/plugins/<project>/<files>, but the URL structure is .../browser/<project>/trunk/<file>. With separated repos in the config, I can get this correctly, but not with a single combined repo.
I'm not sure how best to solve this in a way that minimises duplication but still gives me the flexibility I need. Any immediate thoughts on how to attack this? Also, any thoughts on how to run Hound at this sort of scale would be appreciated. :)
(Happy to contribute code as needed. The motivation behind this is basically that whenever we want to make potentially breaking changes in WordPress, we want to check against the actual uses in the wild so we can determine impact.)