Thursday, March 31, 2011

Visualization: Pixels, Degrees and lots of data!

My visualization project has lead me down a rabbit hole I never knew existed.  When I think of the world I think about the miles between location X and location Y, which can be easily translated into latitudes and longitudes.  Though I never knew the actual conversion I knew it was mathematically trivial and I had never really thought about how this works in something like Google Maps.

Mapping Coordinates and Pixels
Of course when dealing with mapping software the world can no longer be represented as a ellipsoid.  The standard way to project a globe onto a flat surface is the Mercator projection.  Using this projection, we can display the globe as a flat map.  There is some weirdness about the way the Mercator projection works, which you can read about here.  Once the world has been projected into this form we can easily display the map, but Google maps this coordinate system onto yet another coordinate system, their tile system.  This tile system is likely not a surprise to anyone who has ever used google maps, but the way that it works did surprise me a little bit.  Google has a predefined tile size (which is 256 by standard).  As you zoom the entire world is broken up into more and more tiles, but the viewport has the same number in it.  For example, at zoom level 2 this is a 2x2 grid, but at the 19th zoom level it's 2097151 x 2097151.  That's a lot of tiles!

Why does it matter?
Now the question is, where does this fit into my visualization project?  What I need is the ability to map a lat/long (or group of lat/longs) into a particular tile at every zoom level.  The initial scope of this project is to gather about a week's worth of data and allow the user to view this data at all zoom levels.  The data is currently being gathered at about 6 tweets/second.  6 * 60 * 60 gives us 21,600 tweets per hour which is 3,628,800 (call it 4 million) for the week.  The obvious (and bad) solution to this problem would be to simply create a database of the 4 million tweets at each zoom level and then when I wanted information for a particular tile (for a particular time) I would calculate the number of tweets in that tile/time, make a color out of that number and color the tile.  Obviously this solution is very bad and we can do much better.  The initial plan is to do a bunch of data preprocessing that will allow the data to grouped by both location and time, but that is a topic for another day.


Thursday, March 24, 2011

Visualization Project and Google Maps

This semester I've been tasked with finding a reasonable project for my advanced graphics course.  A small part of me wanted to use a basic physics engine, get some nice graphics and make my own version of angry birds or something, but I decided I'd go with something else.  Maybe something to do with the social world...

I switched my plan to a visualization of individual tweets on top of something like Nasa World Wind, but as I began to investigate it, I found it very useless. A visualization of an individual tweet isn't nearly as cool as a visualization of many tweets over time, which brought me to my final idea (with the help of Tim)...

I think I'm going to do a visualization of twitter geographical data on top of google maps (or google earth).  This way I'm aggregating the data before showing it to the user, which should be much easier to understand.  My plan is as follows:

  1. Divide the map up into a grid.  For starters I will just do this on a single major city.
  2. At different zoom levels the data will be regrouped.  This way when a user is looking at the entire city it should show them a set of grids and as they zoom, the grids will become more detailed.
  3. Provide (in tooltips or based on click events) details about the data in the grid they're selecting.
  4. Provide some kind of coloring of each grid so that a user can tell where the most tweets are coming from.
  5. Weight the tweets based on how old they are (e.g. a tweet from yesterday doesn't influence the coloring as much as a tweet from today).

This should provide an aggregation of data that shows where the most geo-coded tweets come from.  The plan is to architect this in a way that if it works well that I could expand my search to include a larger area.


Tuesday, March 22, 2011

Facebook, REST apis and latency

Last week I took some time to do some research about Facebook and the amount of time it would take to get the friend data I wanted.  To get a baseline I used our internal development servers to get my list of friends.  The problem was that I didn't have a good way to actually see what was coming across the wire, so I chose to use Fiddler to capture this information.  My friend and co-worker Leif Wickland did a bit of hand holding to show me the ins and outs of this tool and we used it to reroute requests on our local network and capture the data that was coming across the wire.

Here's what we found out:

  • The current API we're using for Facebook supports XML and that it does not support JSON
  • The current API also does not allow gzip-encoding
Both of these things were a huge bummer, but since it's what we have to work with I continued on.  I did some tests and found that about 500 friends was about 50KB of data.  Now, this might not really sound like much in the world today, but it takes a whopping 550ms to complete.  That's crazy for only 50KB of data! Now, considering that because of the way TCP works, this is 6 round trips and our ping to Facebook is about 70ms, 420 of that 550ms is wasted in TCP overhead.  Now, there is nothing I can do about the TCP overhead (though it will be much better in our hosted environment) it's definately something I will keep in mind when dealing with our other REST apis.

Just to satisfy the curiosity of my mind I calculated that each friend was about 100 bytes of XML.  If I were to convert to using the Facebook graph API and I switched to using JSON as my data format it would be more like 80 bytes per friend.  While that's a nice savings it's even better if you then use the compressed json stream, which is only 15 bytes per friend.

I think it might be time to do some rewriting...