Andrew Rollins Technology, entrepreneurship, the Internet, and video games

21Feb/100

MongoDB Sharding

Messed around today with MongoDB sharding on version 1.2.2. It was pretty easy to setup. All I had to do was:

  1. Download MongoDB
  2. Do this setup

Here's the question that prompted me to try it: does MongoDB only fetch the necessary subset of chunks when doing range queries? The short answer is yes, it does. It was natural to assume it would, but I wanted to see it in action.

To test this, I created the test.people collection from step 2 above, then ran this with the mongo client:

for (i=0; i < 3000000; i++) { test.people.insert({name: i}) }

When that finished, I had three chunks.

> printShardingStatus(db.getSisterDB("config"))

--- Sharding Status ---
sharding version: { "_id" : ObjectId("4b7df95cc03e000000005d6b"), "version" : 2 }
shards:
{ "_id" : ObjectId("4b7df969c03e000000005d6c"), "host" : "localhost:10000" }
{ "_id" : ObjectId("4b7df96ec03e000000005d6d"), "host" : "localhost:10001" }
databases:
{ "name" : "admin", "partitioned" : false, "primary" : "localhost:20000", "_id" : ObjectId("4b7df969b90f00000000
56aa") }
my chunks
{ "name" : "test", "partitioned" : true, "primary" : "localhost:10001", "sharded" : { "test.people" : { "key" :
{ "name" : 1 }, "unique" : true } }, "_id" : ObjectId("4b7df982b90f0000000056ab") }
my chunks
test.people { "name" : { $minKey : 1 } } -->> { "name" : 0 } on : localhost:10001 { "t" : 126654
7599000, "i" : 4 }
test.people { "name" : 0 } -->> { "name" : 2595572 } on : localhost:10001 { "t" : 1266547599000,
"i" : 2 }
test.people { "name" : 2595572 } -->> { "name" : { $maxKey : 1 } } on : localhost:10000 { "t" :
1266547599000, "i" : 5 }

You'll see that three chunks exist above. Two are on one shard, one is on the other. The important point to notice is that one is for [0, 2595572) and another for [2595572, maxkey). I'm not sure why [minkey, 0) and [0, 2595572) wasn't just [minkey, 2595572), but that's something for another day. For the purposes of my range test, this suffices.

I then tried operations such as:

> db.people.find({ name: { $gt: 1, $lt: 3 } } )
{ "_id" : ObjectId("4b7df9c85a4800000000485d"), "name" : 2 }

> db.people.find({ name: { $gt: 2595573, $lt: 2595575 } } )
{ "_id" : ObjectId("4b7dfb8f903c000000003c92"), "name" : 2595574 }

> db.people.find({ name: { $gt: 2595570, $lt: 2595575 } } )
{ "_id" : ObjectId("4b7dfb8d5a4800000027e34f"), "name" : 2595572 }
{ "_id" : ObjectId("4b7dfb8f903c000000003c91"), "name" : 2595573 }
{ "_id" : ObjectId("4b7dfb8f903c000000003c92"), "name" : 2595574 }
{ "_id" : ObjectId("4b7dfb8d5a4800000027e34e"), "name" : 2595571 }

I watched the mongod output on these finds. The first two queries only hit one shard. The last query hit both shards. So MongoDB does in fact only query the necessary chunks even when doing range queries.

Tagged as: No Comments
29Dec/090

TechStars Application Tips

With the deadline looming for TechStars Boston 2010, I've been asked for tips from a few people. It's time to share them in a blog post. Hopefully this helps.

Disclaimer: this is just my perspective from Localytics, a TechStars Boston 2009 company. My tips are anecdotal at best. As always, exercise your own judgement.

  1. There are a lot of applications, so short and sweet works well. Eliminate fluff. Most of our answers were less than a few concise paragraphs. The only answer that was long was for "Tell us about each founder." That's because...
  2. ... it's more about the founders than it is about the idea. A non-trivial percentage of TechStars companies change there idea, and it's not a big deal. Your application should help people like David and Shawn get to know you. Highlight cool things you've worked on in the past. Show that you're motivated and smart.
  3. Be mentor-able. TechStars is all about mentoring. You need to be open to advice while still being able to think critically and make decisions. To the extent that you can, convey those personality traits in your application.
  4. Be dedicated to the startup. TechStars strongly favors founding teams that are 100% committed and ready to go.
  5. Get your application in early - TechStars is already looking at them. Getting in early means your application has more time to float in front of different eyes. You are at less risk of being lost in the last minute flood.
  6. Get feedback. If you're connected to any alums or mentors, see if they'll review your application. We resubmitted our application several times based on feedback (though be warned, I'm not sure if resubmitting is possible on the application form, we sent updates over email).
  7. If your company/product is already chugging along, then demonstrate progress. Show how your company has met goals or achieved milestones like X number of users or high profile client Y.
  8. Try to get into TechStars for a Day. It gives you a chance for face time. If you do go, bring a laptop and be ready to informally demo your product.
  9. Apply! It isn't that much work, so just do it.

You should also check out:

Best of luck to all of the applicants! TechStars is a truly amazing program.

17Nov/090

Ruby MD5 and SHA1 Digest Benchmark

I did a benchmark of MD5 and SHA1 digests in Ruby. The benchmark was done in Ruby 1.8.6 on Windows.

The code used to benchmark:

require 'benchmark'
require 'digest/sha1'
require 'digest/md5'
require 'base64'

# Demonstration of digests

puts "SHA1.hexdigest     #{Digest::SHA1.hexdigest('test')}"
puts "MD5.hexdigest      #{Digest::MD5.hexdigest('test')}"
puts "SHA1.digest        #{Base64.encode64(Digest::SHA1.digest('test'))}"
puts "MD5.digest         #{Base64.encode64(Digest::MD5.digest('test'))}"
puts "MD5.digest.inspect #{Digest::MD5.digest('test').inspect}"

print "\nSHA1.digest bytes "
Digest::SHA1.digest('test').each_byte {|c| printf("%4d", c) }
print "\nMD5.digest bytes  "
Digest::MD5.digest( 'test').each_byte {|c| printf("%4d", c) }
puts "\n\n"

# Benchmark

TIMES = 50_000

Benchmark.bmbm do |b|
b.report("sha1 hexdig") { TIMES.times do |x|; Digest::SHA1.hexdigest(x.to_s); end }
b.report("md5  hexdig") { TIMES.times do |x|; Digest::MD5.hexdigest(x.to_s); end }

b.report("sha1 digest") { TIMES.times do |x|; Digest::SHA1.digest(x.to_s); end }
b.report("md5  digest") { TIMES.times do |x|; Digest::MD5.digest(x.to_s); end }
end

The output:

SHA1.hexdigest     a94a8fe5ccb19ba61c4c0873d391e987982fbbd3
MD5.hexdigest      098f6bcd4621d373cade4e832627b4f6
SHA1.digest        qUqP5cyxm6YcTAhz05Hph5gvu9M=
MD5.digest         CY9rzUYh03PK3k6DJie09g==
MD5.digest.inspect "\t\217k\315F!\323s\312\336N\203&amp;amp;amp;'\264\366"

SHA1.digest bytes  169  74 143 229 204 177 155 166  28  76   8 115 211 145 233 135 152  47 187 211
MD5.digest bytes     9 143 107 205  70  33 211 115 202 222  78 131  38  39 180 246

Rehearsal -----------------------------------------------
sha1 hexdig   0.797000   0.000000   0.797000 (  0.796000)
md5  hexdig   0.641000   0.000000   0.641000 (  0.641000)
sha1 digest   0.625000   0.000000   0.625000 (  0.625000)
md5  digest   0.500000   0.000000   0.500000 (  0.500000)
-------------------------------------- total: 2.563000sec

user     system      total        real
sha1 hexdig   0.750000   0.032000   0.782000 (  0.781000)
md5  hexdig   0.593000   0.000000   0.593000 (  0.594000)
sha1 digest   0.594000   0.031000   0.625000 (  0.625000)
md5  digest   0.469000   0.000000   0.469000 (  0.468000)

As expected, digest methods are faster than hexdigest and MD5 is faster than SHA1.

Filed under: Ruby No Comments
11Aug/090

Database Normalization: First, Second, and Third Normal Forms

I read a great explanation of first, second, and third normal form a few weeks ago. For those that know what database normalization is but haven't seen the "forms", the different forms are essentially rules for having a well normalized relation DB. Keeping them in mind when doing DB design is key to keeping a great database. I'd like to make an attempt at condensing the linked tutorial into its essentials.

First Normal Form (1NF): No repeating elements or groups of elements

Don't repeat your columns. Avoid this:

OrderId ItemId1 ItemId2
1 100 101

ItemId1, 2, ... should be split out into relational tables.

Second Normal Form (2NF): No partial dependencies on a concatenated key

This is a complex way of saying that if a column isn’t intrinsically related to the entire primary key, then you should break out the primary key into different tables.

Example:

OrderId (PK) ItemId (PK) OrderDate
1 100 2009-01-01
1 101 2009-01-01

The primary key is (OrderId, ItemId).

Consider OrderDate. It is conceptually part of an order. An order always occurs at some time. But is an OrderDate related to an Item? Not really.

You may be saying, “but items are part of an order!”, and you would be right. But that’s not what I’m getting at. OrderDate is independent of the item itself.

Look at another way: in the table above the OrderDate will always be the same for a given OrderId regardless of the value of the ItemId column. This means data duplication, which is denormalization.

Here’s how we correct the problem:

Orders
OrderId (PK) OrderDate
1 2009-01-01
Order_Items
OrderId (PK) ItemId (PK)
1 100
1 101

Here is an excellent line from the article, “All we are trying to establish here is whether a particular order on a particular date relies on a particular item.”

Third Normal Form (3NF): No dependencies on non-key attributes

2NF covers the case of multi-column primary keys. 3NF is meant to cover single column keys. Simply stated, pull out columns that don’t directly relate to the subject of the row (the primary key), and put them in their own table.

Example:

Orders
OrderId (PK) OrderDate CustomerName CustomerCity
1 2009-01-01 John Smith Chicago

Customer information could be the subject of its own table. Pull out customer name and other customer fields into another table, and then put a Customer foreign key into Orders.

Wikipedia has a great quote from Bill Kent: “every non-key attribute ‘must provide a fact about the key, the whole key, and nothing but the key’.”

Tagged as: No Comments
21Jun/090

MySQL Join Performance

Earlier this week I was curious about the performance of JOINs in MySQL. How severe is the performance hit of joins? How much slower is a string join over an integer join? I decided to do some tests, and I’m going to share my results here.

I did these tests on a computer with the following specs and software:

  • 3.2 GHz Pentium 4
  • 2 GB of RAM
  • Windows XP
  • MySQL 5.0.51b running in development mode (no query caching)

I made two tables:

CREATE TABLE `parents` (
`id` int(10) unsigned NOT NULL auto_increment,
`uuid` varchar(128) default NULL,
`time` datetime default NULL,
`string` varchar(128) default NULL,
PRIMARY KEY  (`id`),
UNIQUE KEY `uuid` (`uuid`),
KEY `time` (`time`),
KEY `string` (`string`)
)

CREATE TABLE `children` (
`id` int(10) unsigned NOT NULL auto_increment,
`uuid` varchar(128) default NULL,
`parent_id` int(10) unsigned NOT NULL,
`parent_uuid` varchar(128) default NULL,
`time` datetime default NULL,
`string` varchar(128) default NULL,
PRIMARY KEY  (`id`),
UNIQUE KEY `uuid` (`uuid`),
KEY `parent_id` (`parent_id`),
KEY `parent_uuid` (`parent_uuid`),
KEY `time` (`time`),
KEY `string` (`string`)
)

I filled these tables with records using the following Ruby code:

require 'mysql'
require 'digest/sha1'
require 'base64'

RECORDS   = 10_000
stringS = ['blue', 'red', 'green'].freeze

my = Mysql::new("localhost", "user", "password", "db_perf")

my.query("DELETE FROM parents")
my.query("DELETE FROM children")
my.query("ALTER TABLE parents  AUTO_INCREMENT = 1")
my.query("ALTER TABLE children AUTO_INCREMENT = 1")

def make_uuid(arg)
Digest::SHA1.hexdigest(arg.to_s)
end

start_time = Time.now.tv_sec
(1..RECORDS).each do |i|
if ((i % 500) == 0)
puts "Iteration #{i}"
time    = Time.now.tv_sec
elapsed = time - start_time
time_per_record = elapsed.to_f/i
time_to_complete = time_per_record * RECORDS
puts "Minutes to complete: #{time_to_complete/60}"
end

uuid   = make_uuid(i)
time   = Time.at(i * 3600).strftime('%Y-%m-%d %H:%M')
string = stringS[i % stringS.length]
res  = my.query(
"INSERT INTO parents "                 +
"(uuid, time, string) VALUES " +
"('#{uuid}', '#{time}', '#{string}')" )
res  = my.query(
"INSERT INTO children "         +
"(uuid, time, string, parent_id, parent_uuid) VALUES " +
"('#{uuid}', '#{time}', '#{string}', #{i}, '#{uuid}')" )
end

puts ''
puts 'DONE'
end_time = Time.now.tv_sec
elapsed = end_time - start_time
puts "Elapsed time in minutes: #{elapsed.to_f/60}"

The table engine you use and the character encoding will make a difference. Why? Because the engine will affect how keys are stored and used, and the character encoding will affect the speed of string comparisons.

I use Rails these days, and when dealing with Rails, InnoDB (which is transactional) and UTF8 are common, so let’s look at them:

image

image

Blue designates queries without a join. They are:

Select Group by String SELECT * FROM children GROUP BY string
Select Group By Time Func SELECT * FROM children GROUP BY YEAR(time), DAYOFYEAR(time)
Select SELECT * FROM children

Red designates queries with a join, which are:

Join Integer Group by String SELECT * FROM children INNER JOIN (parents) ON (children.parent_id = parents.id) GROUP BY parents.string
Join Integer Group by Time Func SELECT * FROM children INNER JOIN (parents) ON (children.parent_id = parents.id) GROUP BY YEAR(children.time), DAYOFYEAR(children.time)
Join String Group by String SELECT * FROM children INNER JOIN (parents) ON (children.parent_uuid = parents.uuid) GROUP BY parents.string
Join String Group by Time Func SELECT * FROM children INNER JOIN (parents) ON (children.parent_uuid = parents.uuid) GROUP BY YEAR(children.time), DAYOFYEAR(children.time)
Join Integer SELECT * FROM children INNER JOIN (parents) ON (children.parent_id = parents.id)
Join String SELECT * FROM children INNER JOIN (parents) ON (children.parent_uuid = parents.uuid)

The 10K queries were run 10 times, and the 100K queries were run 6 times. I did a warmup round before each measured round (so for a 10K query, that means 10 times to warmup, 10 times to measure). The times you see in the charts are the average times. Be sure to turn off query caching.

The results speak for themselves, so I’m offering them up with no analysis for now.

Tagged as: , No Comments
24Mar/090

Twitter Search

A few weeks ago, I finally got around to trying Twitter’s search. All it took was a few queries for me to realize its potential.

(For those that don’t know, Twitter gained in-house search functionality when it acquired Summize last year.)

One afternoon, while I was mucking around with Ruby on Rails, I encountered a problem with a release candidate. Annoyed, I took to the Internet searching for a solution. Google wasn’t of much help; the RC was just too fresh for there to be much information on regular web pages.

Disappointed, I started thinking about other ways to find an answer. The Rails IRC channel was one possibility, their bug tracker was another. That's when it hit me - why not try Twitter?

It seemed logical. After all, Twitter is the place for soapbox style venting. Someone must have complained about the same problem. Maybe they even offered up a link to a solution.

I did a search, and to my amazement, I actually saw some interesting comments. For the first time, Twitter actually helped me save time rather than waste it.

Therein lies most of the value in Twitter. Forget for a moment the mundane, narcissistic, attention whoring tweets that occupy most of Twitter. If you cut out all the garbage, the rest of Twitter is a rich source of real-time information about problems, products, places, news, and more.

Social marketers already know this. They're using Twitter search to monitor brands and engage their customers. But allow me to go even broader. Twitter search isn’t just for social marketers, it holds great promise for the average seeker of information.

Want to hear what’s going on at an event? Favorite web site down? Check Twitter.

This is a new breed of real-time information. Think of it as collective experience. Sure, you could say Facebook has been doing it for ages, but Twitter is special. Twitter’s culture encourages public sharing. That’s what makes it great for searching.

Over time, people will slowly discover its value. As the mainstream starts to search Twitter and similar public streams, third parties like Google will start to realize the value in indexing it. They will provide new views into the information that we haven’t seen yet. Startups will also jump on board and innovate.

One could even argue that this is half the point of Google Friend Connect. Google is inserting itself into the chain so it can start analyzing and presenting that information for search.

Let’s not forget that a new source of information means a new area for targeted advertising. This is great for a company like Twitter, because it gives them another option to consider for their business model.

On that note, I’m starting to come around on the idea that Twitter could be a viable business. Expanding their reach, introducing users to their search, and then monetizing the results could be a great source of revenue just like it is for Google.

I’d also like to see Twitter start introducing business accounts. They should give businesses ways to manage groups of accounts so employees can participate in company tweeting.

Another possibility is to provide features to analyze the response to tweets (e.g., see how many people are tuning in, and what kinds of people are listening).

Twitter has already started taking steps in the right direction, and I really do hope it works out.

Tagged as: , No Comments
6Mar/090

The Past Few Months

It's been a while since I've updated my blog, so I thought I'd provide a rundown of what I've been doing.

Startup

We really found our focus over the past couple months. We are committed to making tools for mobile developers. Our name is Localytics, and our first tool is an analytics service for mobile applications.

We launched an early version of our analytics engine last week at a DemocampBoston1. Democamp involves a quick succession of presentations by local startups. Check out my recap.

We captured a video of our presentation. The presentation was 5 minutes with a 5 minute Q&A at the end. Check it out:

Registration is closed right now, but we will be opening it up in a couple weeks. During that time, we will add more features, documentation, and hopefully some guides on how to get the most out of the service.

For now, you can check out our demo, which might not make a whole lot of sense, but at least it's there and it works.

Gaming

I managed to get through a couple of the 2008 holiday season games. I have mixed feelings about most of them. The only ones I'd flat out recommend are Gears of War 2 and Left 4 Dead. Perhaps I'll go into more detail in future posts.

I finally discovered the wonder that is Xbox Live Arcade. There are some awesome games on there, many of them with a retro feel. Braid, Castle Crashers, and Geometry Wars are all great. Each is an update of classic game styles with gorgeous HD visuals.

Braid is a side-scroller with time manipulation, Castle Crashers invokes memories of Teenage Muntant Ninja Turtles beat-em-up games, and Geometry Wars is Asteroids on crack. I especially like how Geometry Wars pulls in high scores from your friends, encouraging you to beat them.

Tagged as: No Comments
21Oct/084

What an insane fall lineup

The games coming out this fall season are an assault on the senses. This is a way better lineup than last year. Games I care about include:

  • Dead Space - released Oct 13 to great reviews
  • Fable 2 - Oct 21
  • Far Cry 2 - Oct 22
  • Command and Conquer: Red Alert 3 - Oct 28
  • Fallout 3 - Oct 28
  • Gears of War 2 - Nov 7
  • Call of Duty: World at War - Nov 11
  • Mirror's Edge - Nov 11
  • World of Warcraft: Wrath of the Lich King - Nov 13
  • Left 4 Dead - Nov 17
  • Prince of Persia - Dec 2

I've bolded the ones I'm definitely going to buy. Almost all of these are Xbox 360 titles. There are a couple good PS3 games (such as Little Big Planet), but I don't have a PS3.

I'm cautious about Fallout 3. It has probably the biggest advertising campaign out of all of them, but I have a feeling it won't meet expectations.

Lots of sequels in there. In fact, the only ones that aren't a sequel or in some way part of an existing series are Dead Space, Mirror's Edge, and Left 4 Dead.

Filed under: Video Games 4 Comments
12Aug/080

Yahoo Fire Eagle Now Public

Yahoo Fire Eagle has come out of private beta, and it's now open to everyone. Fire Eagle provides a unified way for applications and web sites to share location data tied to specific users.

For your average web surfer or mobile device user, this means a single, unified way for sharing your location across all of the sites and applications you use (that are on Fire Eagle). When you update your location in one application, other applications get updated as well. Privacy controls let you decide exactly how that sharing should work.

So imagine you are using an application on your phone that works with Fire Eagle. Let's say that app uses your GPS location to find restaurants near you. When it does so, it notifies Fire Eagle where you are. Now suppose a completely different web site exists that lets people tell their friends where they are. If it works with Fire Eagle, then it'll see the update from the restaurant finder on your phone.

As a developer, this opens up a whole new range of possibilities, and it also increases the amount of data you could observe.

This is relevant to one of my previous posts on location tracking, and it's an area I'm really interested in. Will developers embrace it? Will users fear the privacy implications? We'll see.

6Aug/083

Languages and Their Library Cultures

I saw an interesting blog post today titled Why I stick with Perl. In short, the author's reason for sticking with Perl was that he believes it to have the best "library culture." A strong library culture is more important to him than the language itself.

What does "library culture" even mean? I would define it as the practices, beliefs, and institutions regarding the sharing of modular code. In other words, it refers to the systems for sharing code and the practices of developers concerning those systems.

Perl certainly has a very strong library culture. CPAN is one of the most extensive library repositories out there. In my experiences with the Perl community, the use of CPAN has always been encouraged, and I've generally been pleased with code quality and documentation.

Besides Perl, this made me think about other library cultures. I never realized how drastically different some of them are. Here are my thoughts on a few of the more interesting cases.

Ruby on Rails

Ruby on Rails is very much so a cut and paste world. Despite the existence of RubyForge, I still find myself grabbing code from blogs, wikis, or some random guy's self-hosted repository way too often. It's always a little unsettling.

For example, a typical plugin hunt involves going to the Rails wiki, grabbing the plugin from a random guy's subversion repo, making the changes suggested on the Rails wiki, then going to a forum or blog and copy-pasting the changes suggested there. After all this work, you later find out that some other guy did all this and checked it into GitHub in a new repo. Now whose changes do I follow, the original random guy's repo or the new repo from a new random guy? And who are these people anyway? It all feels very questionable.

PHP

PHP provides so much in the core extensions that I rarely need to look outside of what's already there. Whenever I do need something, PEAR or a professional framework like Zend Framework often has it. I can't think of a single time I ever needed code from a random repo.

C# and .NET

.NET is a monstrous library which has everything under the sun. Microsoft also provides great documentation. Due to these two factors, C# doesn't even really need a strong library culture, it already has a behemoth with a ton of money taking care of it.

Microsoft can't cover everything, though, so they'll always be the need for other libraries. Unfortunately, the third-party library culture isn't as open as it is in languages like Perl. The Microsoft world encourages licensing libraries instead of collaborative repositories.

Close
E-mail It