Web scraping - Get keys from website

Web scraping - Get keys from website

I'm the creator of Cruzalinhas, and apologize in advance for not having (yet) translated the site from Portuguese to English, which would otherwise would make these matters clearer (and this answer shorter).

As you may have noticed, the data shows transportation routes for buses, subways and trains in São Paulo in a way that makes it easier to find their connections to plan routes with connections (Google Maps and others make it automatically, sometimes missing routes that could be found with more interactive search).

It uses geohashes for "cheap" proximity search between routes), and crawls its data from the São Paulo Transportation Company (SPTrans).

To answer your first question: those IDs are the ones from the original site. However, they are not really much stable (I've seen they removing an old ID and replacing a new one just because a line changed routes), so Cruzalinhas does a full crawl every now and then and updates the entire database (I'd replace it completely, but Google App Engine makes it a bit harder than usual).

The good news: the site is open-sourced (http://github.com/chesterbr/cruzalinhas) under an MIT license. Documentation is also still in Portuguese, but you will be mostly interested in sptscraper, the command-line crawler.

The most efficient way to get the data is to do a sptscraper.py download, then sptscraper.py parse, then sptscraper.py dump and import from there. There are more options, and here is a quick translation of its help screen:

Downloads and parses data from public transportation routes from the SPTrans
website.

It parses HTML files and stores the result in the linhas.sqlite file, which
can be used in other applications, converted to JSON or used to update
cruzalinhas itself.

Commands:
info          Shows the number of pending updates
download [id] Downloads HTML files from SPTrans (all or starting with id)
resume        Continues the download from the last line saved.
parse         Reads downloaded HTMLs and (soft) inserts/updates/deletes in
              the database.
list          Outputs a JSON with the route IDs from the database.
dump [id]     Outputs a JSON with all routes in the database (or just one)
hashes        Prints a JSON with the geohashes for each line (mapping to
              the routes that cross the hash)
upload        Uploads the pending changes in the database to cruzalinhas.

Keep in mind that this data is not taken with SPTrans consent, even though this is public information and they are legally obliged to do so. The site and the scraper were created as an act of protest against that, before the specific freedom of digital information law passed (even though there was already previous legislation regulating the availability of public service information, so no illegal act was conducted in this, or will be if you use it responsibily).

For that reason (and due to the fact that the back end is a bit... "challenged"), the scaraper is very careful in throttling the requests, in order to avoid overloading their servers. It makes the crawling span towards several hours, but you don't want to overload the service (which may force them into blocking you, or even changing the site to make crawling harder).

I'll eventually do a full rewrite of that code (was likely my first Python/App Engine code, written a few years ago, and a quick hack focused on exposing how useful this public data can be outside the confines of SPTrans' website). It will have a saner crawling process, should make the latest data available for download on a single click, and likely make a full lines list available on the API.

For now, if you just want the last crawling (which I did a month or two ago), just contact me and I'll be happy to send you the sqlite/JSON files.

Recent Search

0 Komentar Web scraping - Get keys from website

Đăng nhận xét