GSA/datagov-wptheme

Analysis on what open source projects are using federal datasets, using GitHub search

Open

#464 opened on Sep 16, 2014

View on GitHub
 (7 comments) (0 reactions) (0 assignees)JavaScript (1,889 stars) (457 forks)batch import
contentfeature-requesthelp wantedpublic feedback

Description

You can search the full text of all open source code on GitHub (which is insane and magnificent), and one trick is to search it for domain names and URLs.

There are over 45,000 results for searching for "data.gov" on code in GitHub, and ~350 issues. There's all kinds of data quality issues -- for example, data.gov is a subset of api.data.gov, dots aren't handled perfectly, and many repos will come up more than once. But it's still pretty cool.

I think it'd be more useful to take the URLs of datasets (and landing pages for datasets) known to Data.gov and run them through GitHub search and see what sort of things come up. It'd be a neat lead generator to find people working on niche or in-progress tools, that you'd be unlikely to come across in the press. Maybe it's even a way to get deserving projects more press.

Unfortunately, GitHub doesn't offer an API or feed of search result data across GitHub -- you have to use the web interface. (There is a Search API, but unlike the web interface, it requires that searches be limited to a user, organization, or repository.)

So you'd probably have to scrape GitHub.com, the website, and page through data. Which sounds no fun, but to the right kind of brain might also sound actually really fun.

One other maybe-fun caveat is that if there are datasets that have support libraries built for them, some projects may just reference those libraries instead -- and so the URL for the dataset would only appear in the library, not the project. I can imagine this being the case for a well-established data agency like the US Census, for example. So identifying support libraries, and searching GitHub's code for references to them would also help dig up some leads. But even for the Census' domain name, there are lots of little projects.

Contributor guide