Recap: Northeast Scala 2014 Day 2

I was in NY this past weekend attending nescala 2014 day 2. The day was organized as an unconference. I managed to attend a few presentations, namely Functional Reactive Programming, Concurrent Revisions, and Batshit-crazy Algebra with Types (full unconference grid is here). Here are my notes.

Functional Reactive Programming

Peter Fraenkel and Guillaume Marceau both presented on this topic at nescala.

The basic idea of FRP is explained on this Stackoverflow page: What is (functional) reactive programming?

If I had to boil it down, I might say something like this: imagine you can have variables that change over time as first class entities, and then you incorporate these variables into other calculations or reference them elsewhere, and when they change, all dependent calculations "react" automatically and change, too. Now try to do this in a functional way. Under the covers this works through a directed acyclic graph (DAG) that maintains what depends on what.

Here are some resources for follow-up:

Concurrent Programming with Revisions and Isolation Types

This presentation was by Daniel James @dwhjames. For this, imagine managing state in your Scala code as if it was git. Here's a link to the presentation.

Batshit-crazy Algebra with Types

I always enjoy going to Scala meetups and watching talks on type theory, especially ones that create great "aha!" moments. This was one of those talks. The talk was by Jon Pretty. Here are the slides.

The slides alone can't do justice to how well Jon presented the topic, so you might want to look at a series of blog posts by Chris Taylor starting with The Algebra of Algebraic Data Types, Part 1. Chris is listed as inspiration for Jon's talk.

Someone in the audience also suggested these books (some of which might free online):

  • Analytic Combinatorics
  • Generating Functionalogy
  • Enumerator Combinatorics

I'm not sure if this talk has much practical application, but it presents another way of looking at the world which will expand your mind as a programmer. For those of us that don't live and breathe algebraic types, Jon helps you make a few mental leaps that might just surprise you.

Tweet this post

MongoDB "Too Many Open Files"? Raise the limit

This blog post is intended to supplement the "Too Many Open Files" page in the mongoDB docs.

Raising the file limit for MongoDB

If you installed from the Ubuntu/Debian package, then there is a simple way to increase the open file limit. MongoDB's startup script is /etc/init/mongodb.conf. This is an upstart script which supersedes the /etc/init.d scripts we're all use to. All you need to do is add "ulimit -n {X}" to the script block before mongodb is launched, replacing X with the limit you want (I use 65000). That sets the ulimit for the script and any processes it launches (therefore mongodb). Here is an example /etc/init/mongodb.conf

# Ubuntu upstart file at /etc/init/mongodb.conf

pre-start script
  mkdir -p /db/mongodb/
  mkdir -p /var/log/mongodb/
end script

start on runlevel [2345]
stop on runlevel [06]

script
  ulimit -n 65000

  ENABLE_MONGODB="yes"
  if [ -f /etc/default/mongodb ]; then . /etc/default/mongodb; fi
  if [ "x$ENABLE_MONGODB" = "xyes" ]; then
    exec start-stop-daemon --start --quiet --chuid mongodb --exec  /usr/bin/mongod -- --config /etc/mongodb.conf
  fi

end script

Once mongoDB is up, you can check the limit for the process by doing cat /proc/{pid}/limit. Replace {pid} with the pid of mongoDB. To get the pid, just do "ps axu | grep mongodb".

If you aren't using the install packages, then you'll need to add the ulimit command to your own startup script. Keep in mind that ulimit is a bash command, not a regular binary tool, so look at "man bash" for more info on ulimit.

This blog post suggests that you can set system wide limits by editing /etc/security/limits.conf and /etc/pam.d/common-session, however, this only applies interactive and non-interactive shells (and processes started by them). When using just this technique, it didn't appear to affect the open file limit of the mongodb process started by the default upstart script (without my added ulimit statement).

If you want to try system wide limits, then add a line like the following to /etc/security/limits.conf:

*  -  nofile 65000

See "man limits.conf" for more info.

In /etc/pam.d/common-session, enable the limits module by adding:

session required pam_limits.so

Keep in mind that all this really does is have PAM set the limit for interactive and non-interactive shells when loaded. Processes then pickup these limits when started from shells/scripts. You should probably do a restart to fully enact these settings. If someone out there gets system wide settings to apply to mongoDB, let me know with a comment.

Update:

It occurred to me after writing this post that you should put "ulimit -n 65000" in /etc/default/mongodb. If that file doesn't exist, create it. This is the proper place for it. As you can see in the upstart file, it will run /etc/default/mongodb if it exists before it launches mongodb.

Building Ruby Gem Native Extensions on Windows

If you're using Ruby on Windows, but always encountering gems that require native extensions, then the new(ish) RubyInstaller project is for you.

When browsing the Ruby download page, you may have noticed the newfangled Windows installer for download. They've swapped out the old installer (ever wonder where the option for installing SciTE went?) in favor of packaging now being done by the kind folks at RubyInstaller.

Besides just providing newer/better Ruby for Windows, the RubyInstaller team has also been working on the RubyInstaller Development Kit (DevKit), an add-on for building native extensions under Windows. You'll find a download link for DevKit here and instructions for installation here.

Installing DevKit is pretty easy. It amounts to just extracting some files to your Ruby install path. Once done, building native extension just works (at least the ones I tried). This is great for gems like ruby-debug-ide which haven't been shipping pre-compiled Windows extensions with the latest releases.

It looks like RubyInstallers first stable releases came out around March of this year. I didn't notice it until now, but I'm glad someone is putting in the effort to make Windows Ruby development more seamless.

MongoDB Sharding

Messed around today with MongoDB sharding on version 1.2.2. It was pretty easy to setup. All I had to do was:

  1. Download MongoDB
  2. Do this setup
Here's the question that prompted me to try it: does MongoDB only fetch the necessary subset of chunks when doing range queries? The short answer is yes, it does. It was natural to assume it would, but I wanted to see it in action.

To test this, I created the test.people collection from step 2 above, then ran this with the mongo client:

for (i=0; i < 3000000; i++) { test.people.insert({name: i}) }

When that finished, I had three chunks.

> printShardingStatus(db.getSisterDB("config"))

--- Sharding Status ---
sharding version: { "_id" : ObjectId("4b7df95cc03e000000005d6b"), "version" : 2 }
shards:
{ "_id" : ObjectId("4b7df969c03e000000005d6c"), "host" : "localhost:10000" }
{ "_id" : ObjectId("4b7df96ec03e000000005d6d"), "host" : "localhost:10001" }
databases:
{ "name" : "admin", "partitioned" : false, "primary" : "localhost:20000", "_id" : ObjectId("4b7df969b90f0000000056aa") }
my chunks
{ "name" : "test", "partitioned" : true, "primary" : "localhost:10001", "sharded" : { "test.people" : { "key" :
{ "name" : 1 }, "unique" : true } }, "_id" : ObjectId("4b7df982b90f0000000056ab") }
my chunks
test.people { "name" : { $minKey : 1 } } -->> { "name" : 0 } on : localhost:10001 { "t" : 126654
7599000, "i" : 4 }
test.people { "name" : 0 } -->> { "name" : 2595572 } on : localhost:10001 { "t" : 1266547599000,
"i" : 2 }
test.people { "name" : 2595572 } -->> { "name" : { $maxKey : 1 } } on : localhost:10000 { "t" :
1266547599000, "i" : 5 }

You'll see that three chunks exist above. Two are on one shard, one is on the other. The important point to notice is that one is for [0, 2595572) and another for [2595572, maxkey). I'm not sure why [minkey, 0) and [0, 2595572) wasn't just [minkey, 2595572), but that's something for another day. For the purposes of my range test, this suffices.

I then tried operations such as:

> db.people.find({ name: { $gt: 1, $lt: 3 } } )
{ "_id" : ObjectId("4b7df9c85a4800000000485d"), "name" : 2 }

> db.people.find({ name: { $gt: 2595573, $lt: 2595575 } } )
{ "_id" : ObjectId("4b7dfb8f903c000000003c92"), "name" : 2595574 }

> db.people.find({ name: { $gt: 2595570, $lt: 2595575 } } )
{ "_id" : ObjectId("4b7dfb8d5a4800000027e34f"), "name" : 2595572 }
{ "_id" : ObjectId("4b7dfb8f903c000000003c91"), "name" : 2595573 }
{ "_id" : ObjectId("4b7dfb8f903c000000003c92"), "name" : 2595574 }
{ "_id" : ObjectId("4b7dfb8d5a4800000027e34e"), "name" : 2595571 }

I watched the mongod output on these finds. The first two queries only hit one shard. The last query hit both shards. So MongoDB does in fact only query the necessary chunks even when doing range queries.

TechStars Application Tips

With the deadline looming for TechStars Boston 2010, I've been asked for tips from a few people. It's time to share them in a blog post. Hopefully this helps.

Disclaimer: this is just my perspective from Localytics, a TechStars Boston 2009 company. My tips are anecdotal at best. As always, exercise your own judgement.

  1. There are a lot of applications, so short and sweet works well. Eliminate fluff. Most of our answers were less than a few concise paragraphs. The only answer that was long was for "Tell us about each founder." That's because...
  2. ... it's more about the founders than it is about the idea. A non-trivial percentage of TechStars companies change there idea, and it's not a big deal. Your application should help people like David and Shawn get to know you. Highlight cool things you've worked on in the past. Show that you're motivated and smart.
  3. Be mentor-able. TechStars is all about mentoring. You need to be open to advice while still being able to think critically and make decisions. To the extent that you can, convey those personality traits in your application.
  4. Be dedicated to the startup. TechStars strongly favors founding teams that are 100% committed and ready to go.
  5. Get your application in early - TechStars is already looking at them. Getting in early means your application has more time to float in front of different eyes. You are at less risk of being lost in the last minute flood.
  6. Get feedback. If you're connected to any alums or mentors, see if they'll review your application. We resubmitted our application several times based on feedback (though be warned, I'm not sure if resubmitting is possible on the application form, we sent updates over email).
  7. If your company/product is already chugging along, then demonstrate progress. Show how your company has met goals or achieved milestones like X number of users or high profile client Y.
  8. Try to get into TechStars for a Day. It gives you a chance for face time. If you do go, bring a laptop and be ready to informally demo your product.
  9. Apply! It isn't that much work, so just do it.
You should also check out: Best of luck to all of the applicants! TechStars is a truly amazing program.

Ruby MD5 and SHA1 Digest Benchmark

I did a benchmark of MD5 and SHA1 digests in Ruby. The benchmark was done in Ruby 1.8.6 on Windows.

The code used to benchmark:

require 'benchmark'
require 'digest/sha1'
require 'digest/md5'
require 'base64'

# Demonstration of digests

puts "SHA1.hexdigest     #{Digest::SHA1.hexdigest('test')}"
puts "MD5.hexdigest      #{Digest::MD5.hexdigest('test')}"
puts "SHA1.digest        #{Base64.encode64(Digest::SHA1.digest('test'))}"
puts "MD5.digest         #{Base64.encode64(Digest::MD5.digest('test'))}"
puts "MD5.digest.inspect #{Digest::MD5.digest('test').inspect}"

print "\nSHA1.digest bytes "
Digest::SHA1.digest('test').each_byte {|c| printf("%4d", c) }
print "\nMD5.digest bytes  "
Digest::MD5.digest( 'test').each_byte {|c| printf("%4d", c) }
puts "\n\n"

# Benchmark

TIMES = 50_000

Benchmark.bmbm do |b|
b.report("sha1 hexdig") { TIMES.times do |x|; Digest::SHA1.hexdigest(x.to_s); end }
b.report("md5  hexdig") { TIMES.times do |x|; Digest::MD5.hexdigest(x.to_s); end }

b.report("sha1 digest") { TIMES.times do |x|; Digest::SHA1.digest(x.to_s); end }
b.report("md5  digest") { TIMES.times do |x|; Digest::MD5.digest(x.to_s); end }
end

The output:

SHA1.hexdigest     a94a8fe5ccb19ba61c4c0873d391e987982fbbd3
MD5.hexdigest      098f6bcd4621d373cade4e832627b4f6
SHA1.digest        qUqP5cyxm6YcTAhz05Hph5gvu9M=
MD5.digest         CY9rzUYh03PK3k6DJie09g==
MD5.digest.inspect "\t\217k\315F!\323s\312\336N\203&amp;amp;amp;'\264\366"

SHA1.digest bytes  169  74 143 229 204 177 155 166  28  76   8 115 211 145 233 135 152  47 187 211
MD5.digest bytes     9 143 107 205  70  33 211 115 202 222  78 131  38  39 180 246

Rehearsal -----------------------------------------------
sha1 hexdig   0.797000   0.000000   0.797000 (  0.796000)
md5  hexdig   0.641000   0.000000   0.641000 (  0.641000)
sha1 digest   0.625000   0.000000   0.625000 (  0.625000)
md5  digest   0.500000   0.000000   0.500000 (  0.500000)
-------------------------------------- total: 2.563000sec

user     system      total        real
sha1 hexdig   0.750000   0.032000   0.782000 (  0.781000)
md5  hexdig   0.593000   0.000000   0.593000 (  0.594000)
sha1 digest   0.594000   0.031000   0.625000 (  0.625000)
md5  digest   0.469000   0.000000   0.469000 (  0.468000)

As expected, digest methods are faster than hexdigest and MD5 is faster than SHA1.

Database Normalization: First, Second, and Third Normal Forms

I read a great explanation of first, second, and third normal form a few weeks ago. For those that know what database normalization is but haven't seen the "forms", the different forms are essentially rules for having a well normalized relation DB. Keeping them in mind when doing DB design is key to keeping a great database. I'd like to make an attempt at condensing the linked tutorial into its essentials.

First Normal Form (1NF): No repeating elements or groups of elements

Don't repeat your columns. Avoid this:

OrderId ItemId1 ItemId2
1 100 101

ItemId1, 2, ... should be split out into relational tables.

Second Normal Form (2NF): No partial dependencies on a concatenated key

This is a complex way of saying that if a column isn’t intrinsically related to the entire primary key, then you should break out the primary key into different tables.

Example:

OrderId (PK) ItemId (PK) OrderDate
1 100 2009-01-01
1 101 2009-01-01

The primary key is (OrderId, ItemId).

Consider OrderDate. It is conceptually part of an order. An order always occurs at some time. But is an OrderDate related to an Item? Not really.

You may be saying, “but items are part of an order!”, and you would be right. But that’s not what I’m getting at. OrderDate is independent of the item itself.

Look at another way: in the table above the OrderDate will always be the same for a given OrderId regardless of the value of the ItemId column. This means data duplication, which is denormalization.

Here’s how we correct the problem:

Orders
OrderId (PK) OrderDate
1 2009-01-01
Order_Items
OrderId (PK) ItemId (PK)
1 100
1 101

Here is an excellent line from the article, “All we are trying to establish here is whether a particular order on a particular date relies on a particular item.”

Third Normal Form (3NF): No dependencies on non-key attributes

2NF covers the case of multi-column primary keys. 3NF is meant to cover single column keys. Simply stated, pull out columns that don’t directly relate to the subject of the row (the primary key), and put them in their own table.

Example:

Orders
OrderId (PK) OrderDate CustomerName CustomerCity
1 2009-01-01 John Smith Chicago

Customer information could be the subject of its own table. Pull out customer name and other customer fields into another table, and then put a Customer foreign key into Orders.

Wikipedia has a great quote from Bill Kent: “every non-key attribute ‘must provide a fact about the key, the whole key, and nothing but the key’.”

MySQL Join Performance

Earlier this week I was curious about the performance of JOINs in MySQL. How severe is the performance hit of joins? How much slower is a string join over an integer join? I decided to do some tests, and I’m going to share my results here.

I did these tests on a computer with the following specs and software:

  • 3.2 GHz Pentium 4
  • 2 GB of RAM
  • Windows XP
  • MySQL 5.0.51b running in development mode (no query caching)
I made two tables:

CREATE TABLE `parents` (
`id` int(10) unsigned NOT NULL auto_increment,
`uuid` varchar(128) default NULL,
`time` datetime default NULL,
`string` varchar(128) default NULL,
PRIMARY KEY  (`id`),
UNIQUE KEY `uuid` (`uuid`),
KEY `time` (`time`),
KEY `string` (`string`)
)

CREATE TABLE `children` (
`id` int(10) unsigned NOT NULL auto_increment,
`uuid` varchar(128) default NULL,
`parent_id` int(10) unsigned NOT NULL,
`parent_uuid` varchar(128) default NULL,
`time` datetime default NULL,
`string` varchar(128) default NULL,
PRIMARY KEY  (`id`),
UNIQUE KEY `uuid` (`uuid`),
KEY `parent_id` (`parent_id`),
KEY `parent_uuid` (`parent_uuid`),
KEY `time` (`time`),
KEY `string` (`string`)
)

I filled these tables with records using the following Ruby code:

require 'mysql'
require 'digest/sha1'
require 'base64'

RECORDS = 10_000
stringS = ['blue', 'red', 'green'].freeze

my = Mysql::new("localhost", "user", "password", "db_perf")

my.query("DELETE FROM parents")
my.query("DELETE FROM children")
my.query("ALTER TABLE parents  AUTO_INCREMENT = 1")
my.query("ALTER TABLE children AUTO_INCREMENT = 1")

def make_uuid(arg)
Digest::SHA1.hexdigest(arg.to_s)
end

start_time = Time.now.tv_sec
(1..RECORDS).each do |i|
if ((i % 500) == 0)
puts "Iteration #{i}"
time    = Time.now.tv_sec
elapsed = time - start_time
time_per_record = elapsed.to_f/i
time_to_complete = time_per_record * RECORDS
puts "Minutes to complete: #{time_to_complete/60}"
end

uuid   = make_uuid(i)
time   = Time.at(i * 3600).strftime('%Y-%m-%d %H:%M')
string = stringS[i % stringS.length]
res    = my.query(
"INSERT INTO parents " +
"(uuid, time, string) VALUES " +
"('#{uuid}', '#{time}', '#{string}')" )
res    = my.query(
"INSERT INTO children " +
"(uuid, time, string, parent_id, parent_uuid) VALUES " +
"('#{uuid}', '#{time}', '#{string}', #{i}, '#{uuid}')" )
end

puts ''
puts 'DONE'
end_time = Time.now.tv_sec
elapsed = end_time - start_time
puts "Elapsed time in minutes: #{elapsed.to_f/60}"

The table engine you use and the character encoding will make a difference. Why? Because the engine will affect how keys are stored and used, and the character encoding will affect the speed of string comparisons.

I use Rails these days, and when dealing with Rails, InnoDB (which is transactional) and UTF8 are common, so let’s look at them:

image

image

Blue designates queries without a join. They are:

Select Group by String SELECT * FROM children GROUP BY string
Select Group By Time Func SELECT * FROM children GROUP BY YEAR(time), DAYOFYEAR(time)
Select SELECT * FROM children

Red designates queries with a join, which are:

Join Integer Group by String SELECT * FROM children INNER JOIN (parents) ON (children.parent_id = parents.id) GROUP BY parents.string
Join Integer Group by Time Func SELECT * FROM children INNER JOIN (parents) ON (children.parent_id = parents.id) GROUP BY YEAR(children.time), DAYOFYEAR(children.time)
Join String Group by String SELECT * FROM children INNER JOIN (parents) ON (children.parent_uuid = parents.uuid) GROUP BY parents.string
Join String Group by Time Func SELECT * FROM children INNER JOIN (parents) ON (children.parent_uuid = parents.uuid) GROUP BY YEAR(children.time), DAYOFYEAR(children.time)
Join Integer SELECT * FROM children INNER JOIN (parents) ON (children.parent_id = parents.id)
Join String SELECT * FROM children INNER JOIN (parents) ON (children.parent_uuid = parents.uuid)

The 10K queries were run 10 times, and the 100K queries were run 6 times. I did a warmup round before each measured round (so for a 10K query, that means 10 times to warmup, 10 times to measure). The times you see in the charts are the average times. Be sure to turn off query caching.

The results speak for themselves, so I’m offering them up with no analysis for now.

The Past Few Months

It's been a while since I've updated my blog, so I thought I'd provide a rundown of what I've been doing.

Startup

We really found our focus over the past couple months. We are committed to making tools for mobile developers. Our name is Localytics, and our first tool is an analytics service for mobile applications.

We launched an early version of our analytics engine last week at a DemocampBoston1. Democamp involves a quick succession of presentations by local startups. Check out my recap.

We captured a video of our presentation. The presentation was 5 minutes with a 5 minute Q&A at the end. Check it out:

Registration is closed right now, but we will be opening it up in a couple weeks. During that time, we will add more features, documentation, and hopefully some guides on how to get the most out of the service.

For now, you can check out our demo, which might not make a whole lot of sense, but at least it's there and it works.

Gaming

I managed to get through a couple of the 2008 holiday season games. I have mixed feelings about most of them. The only ones I'd flat out recommend are Gears of War 2 and Left 4 Dead. Perhaps I'll go into more detail in future posts.

I finally discovered the wonder that is Xbox Live Arcade. There are some awesome games on there, many of them with a retro feel. Braid, Castle Crashers, and Geometry Wars are all great. Each is an update of classic game styles with gorgeous HD visuals.

Braid is a side-scroller with time manipulation, Castle Crashers invokes memories of Teenage Muntant Ninja Turtles beat-em-up games, and Geometry Wars is Asteroids on crack. I especially like how Geometry Wars pulls in high scores from your friends, encouraging you to beat them.

Wordpress and .htaccess Password Protected Directories

I tried to make a password protected directory using a .htaccess file earlier today and found out that the root Wordpress .htaccess file causes a little trouble.

The situation looks something like:

  • /
    • .htaccess <- From Wordpress
  • /ProtectedDir
    • .htaccess <- My file
As of Wordpress 2.3.2, its .htaccess file looks like this:

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>
# END WordPress

This lives at the base of Wordpress install, which for me happens to be the root of my web site. The rewrite rule throws any URL that leads to a non-existent file to Wordpress' index.php. This lets Wordpress do search engine friendly URLs and custom 404 handling.

The problem here is subtle. In order to do password authentication, the web server needs to serve up a "401 Unauthorized" header and optionally an error document. My server defaults to a specific error document path, but I have not created a document at that path. However, Apache still passes the predefined 401 document path through the rewrite rules in an attempt to use it. Since it doesn't exist, the request gets snagged by the Wordpress rewrite rules and index.php. The end result is that what should have been a 401 page now turns into a 404 and you can't authenticate to the directory.

The solution is simple: override the non-existent, predefined error document path using the ErrorDocument directive in either of the .htaccess files mentioned above.

You have several options. You can either use Apache's default hard-coded string for the error document by specifying:

ErrorDocument 401 default

Or you can use your own hard-coded error string:

ErrorDocument 401 "Unauthorized access"

Or you can create a 401 document and point the server to it like so:

ErrorDocument 401 /401.html

Pick one of those methods and your password authentication should work. Remember, the error document file must exist if you pick the last method.

Note that the same problem can occur with other HTTP error codes, so be on the lookout for other situations where Wordpress might bite you.

Updated Jan. 27 thanks to a suggestion made in this web forum post.

← Archive