AWS re:Invent Day One Keynote Recap and Thoughts

AWS re:Invent is happening this week and already we saw awesome announcements happen in the day one keynote today. The recording is on Youtube. There were major announcements for AI, IoT, natural language processing, and big data. I'm excited to see what people do with the new services. Here are my highlights and thoughts.

AI & Language

  • Lex - a conversational bot framework and backend, essentially the backend for Amazon Alexa now offered as a service on AWS.
  • Rekognition - image recognition as a service powered by deep learning.
  • Polly - text to speech as a service

We are increasingly seeing the big three in cloud (AWS, Google, Amazon) launch AI and NLP services, and it's great to see. Lots of applications are becoming easier to create with the addition of such services. It does make me wonder, though, will these companies end up owning much of the infrastructure for AI or are there still opportunities for startups? I'm not the only one who debates this.

IoT

  • Greengrass - for syncing data and running applications across IoT devices and providing seamless connectivity up into the cloud. This is more of an application framework for IoT combined with backend hooks. So far I find the explanations of Greengrass to be a bit muddy. It requires a closer look.

Big Data

  • Athena - run SQL on top of data in S3, completely serverless, pay per query. Built on top of Presto, supports a few different file formats. $5 per TB of data queried.

I've been screaming for this for a long time. Athena greatly simplifies running analytics in the cloud. In other words, Amazon just took a major step toward making it pointless to run your own analytics clusters. Before this, creating any sort of production ready, highly available cluster for SQL queries on terabytes of data required a ton of effort - I know cause I've been through that many times. This lowers the barrier of entry so much.

I'm interested in seeing how the cost works out for Athena at $5 per TB queried compared to running your own databases on EC2. There are so many variables in that cost equation. You could lower your S3 footprint using different file formats and compression. You can partition data in different ways to lower the amount of bytes scanned in each query.

For comparison, a d2.8xlarge instance with 48 TB of storage is $5.52 per hour. That's way cheaper if you're querying all 48 TB every hour. However, with the d2.8xlarge you get 36 cores. Athena might be using something like Lambda behind the scenes to run on many more cores, delivering faster results over an equivalent amount of data housed in S3. That only scratches the surface. I can't wait until more people get their hands on Athena and share results.

Compute

  • FPGA instances - this is a first in the cloud as far as I know. This is programmable hardware in the cloud, more specifically, Field Programmable Gate Arrays.

Really curious to see what the world does with this. We'll probably see an explosion of FPGA images being created and shared for all sorts of use cases. This could bring a bunch of people into the FPGA community.

Closing Thoughts

I always look forward to re:Invent. AWS never ceases to amaze. I'm glad to see another re:Invent start strong.

Becoming a VC

I'm excited to announce that as of October, I've joined Sigma Prime Ventures full time as Venture Partner. Sigma Prime is a venture capital firm based in Boston, and we do seed and Series A investments in promising startups. My role includes sourcing companies and refining Sigma's mission and message as we look toward the future. If you have an innovative software or hardware startup, especially in promising areas such as AI, I encourage you to reach out to me and start a conversation.

It's been quite the journey that brought me here. I'd like to explain how I got here.

In 2008, I left my job at Microsoft and cofounded a little startup called Localytics. Since then, that little startup has employed hundreds of people, raised $60 million in venture capital, and helped thousands of companies engage with their users. It moved between at least six offices and has been around long enough to see several generations of startups come and go. Localytics weathered the housing crash of 2008 and went through the first Boston class of Techstars in 2009. The engineering team built infrastructure that could handle billions of incoming data points every day. Sales and marketing brought that to thousands of companies. Together, we built products people wanted. I worked with some of the smartest people in the world at Localytics. None of this could have been accomplished without great people and great co-founders. Localytics is now one of the leading providers of marketing automation and analytics for apps.

During the last eight years in my role as Chief Software Architect of Localytics, I had the good fortune to become deeply familiar with running large systems processing tons of data. This experience gave me all kinds of ideas for products that could greatly simplify the management, speed up the querying, and reduce the cost of running large scale data warehouses. For a long time I've dreamed of bringing some of those ideas to market (developing high level analytics and marketing automation at Localytics was a great experience, but my passion was always for the underlying infrastructure and technologies that brought big data to the masses). This year, it came time to act. This February I made the leap out of Localytics and into full time exploration of these ideas. With that, a chapter of my life eight years in the making came to a close and a new one began.

I started doing customer development. Along the way I met a company called TVisions who needed a little advice building their analytics backend. Through TVisions I met Dave Andre from Sigma Prime. TVisions is a Sigma Prime portfolio company, and Dave was advising TVisions. In our discussions, Dave mentioned that Sigma Prime had a fellow program where they invite entrepreneurs to be a visiting member of the Sigma Prime team. Fellows get to participate in the venture investing process but also have the freedom to spend time on their next thing (or maybe find their next thing in a Sigma portfolio company).

Dave introduced me to John Simon, a Sigma partner, and after a few conversations, I happily agreed to spend the summer as the next Sigma Fellow. Sigma has deep connections and experience investing in data driven companies. I knew it would be advantageous to have their help while working on my ideas. They set me up with a desk, for the rest of the summer I went to their office to listen to pitches and weigh in. I also observed a few board meetings. Moreover, I got to meet a great bunch of people and learn a ton. Up until that point, I had only experienced venture from the perspective of a founder. Now I got to see if from the other side.

As the summer unfolded, my ideas evolved. I started to realize that my concepts for improving data warehousing were really just a piece of something much bigger - the general movement toward machine learning and AI (I'd like to do a more thorough write up of what I learned over the summer, but for now, I'll leave it at that). I wondered if there were ways to have a bigger impact, to work across the ecosystem instead of tinkering with one piece.

I didn't quite have the answer, but when Sigma asked me out of the blue about joining full time, it hit me like a ton of bricks. Venture investing, if done right, could be a way to have a bigger impact than just building my one piece of technology. I could instead support a whole spectrum of companies, offering something unique coming from a deeply technical background, having started a data driven company, and having a history building scalable infrastructure. That's the realization that brought me here today.

Looking to the future, I believe that good things are coming for Sigma and advancements in life changing technologies. I'm also bullish on Boston's ability to create companies in areas like AI, machine learning, robotics, and IoT. We have some of the best universities and talent in the world, with heavy investments from companies like Google, Amazon, Microsoft, and Facebook.

Wrapping up, my full faith and support goes out to everyone at Localytics. Keep building great product. Thanks to everyone that helped me vet my ideas over the summer, including those of you at Sigma, but also everyone I met in Boston, Silicon Valley, and abroad. To the team at Sigma: let's do this.

The Cake Pattern in Scala - Self Type Annotations vs. Inheritance

I'm a fan of the Cake Pattern for managing dependencies in Scala code. However, for many people, the Cake Pattern is often the first time they see self type annotations. Upon seeing them, a natural question is often, "why use self type annotations instead of just using inheritance?"

In other words, what is the point of doing this:

trait A
trait B { this: A => }

when you could instead just do this:

trait A
trait B extends A

It's a valid question if you've never used self type annotations. Why should you use them?

Some have answered this question on Stack Overflow, and the answer generally comes down to "B requiring A" (annotations) vs "B being an A" (inheritance) and that the former is better for dependency management. However, there isn't much of an explanation as to why - we're still left hanging. That leaves us with the question: why does that difference matter in practice?

The best hint is in this comment:

The practical problem is that using subclassing will leak functionality into your interface that you don't intend.

Huh? Leak how? I want some code to show me.

Look at what happens in this example using self type annotations:

trait A { def foo = "foo" }

trait B { this: A => def foobar = foo + "bar" }

// !!! This next line throws a "not found: value foo" compilation error
trait C { this: B => def fooNope = foo + "Nope" }

A has method foo. B requires A. C requires B. But C can't call methods in A! If I were to do the same thing with inheritance:

trait A { def foo = "foo" }

trait B extends A { def foobar = foo + "bar" }

// This next line works
trait C extends B { def fooYep = foo + "Yep" }

So C can call foo from A. Now we start to see what we mean by "requiring an A" vs "being an A" and what it means to "leak functionality".

But why is it bad if C can call methods from A? That's just normal inheritance. What is so special about the Cake Pattern that we care about hiding A from C? To answer that, let's consider what A, B, and C could be in a real Cake Pattern scenario.

Suppose we are building a typical web app.

"A" is a database interface.

"B" is an abstraction on the database interface to just manipulate user information.

"C" is a service that sends emails to users based on their username. It looks up the user's email address from their username via the user abstraction that B provides.

With inheritance it looks something like this:

trait Database {
  // ...
}

trait UserDb extends Database {
  // ...
}

trait EmailService extends UserDb {
  // Ends up getting access to all Database methods
  // when it really should just be able to talk to the UserDb abstraction!
}

If we instead used self type annotations:

trait Database {
  // ...
}

trait UserDb {
  this: Database =>
    // ...
}

trait EmailService {
  this: UserDb =>
    // Can only access UserDb methods, cannot touch Database methods
}

So we've hidden the full Database functionality from EmailService. This was a fairly simple example, but Database or UserDb could have required many other components in practice. We've avoided all of it being exposed to EmailService by using self type annotations instead of inheritance. That's probably a good thing when doing dependency management, right?

There's a famous quote in the book Design Patterns: Elements of Reusable Object-Oriented Software which says:

Favor 'object composition' over 'class inheritance'.

(Gang of Four 1995:20)

Further explanation is given in the form of the following:

Because inheritance exposes a subclass to details of its parent's implementation, it's often said that 'inheritance breaks encapsulation'.

(Gang of Four 1995:19)

There are other hidden benefits with self-type annotations, too. My colleage at Localytics, Dan Billings, reminded me that you can have cyclical references with self type annotations and you can't with inheritance - at least in Scala. In other words, you can do this:

trait A {this: B=>}
trait B {this: A=>}

This might make sense in some settings.

UPDATE (Aug 9, 2014): It's worth considering composition via constructors or members as another possibility. We could do the following:

class A { def foo = 1 }

class B( val a: A )

class C( val b: B )

Or this:

trait A { def foo = 1 }

trait B { def a: A }

trait C { def b: B }

In these cases, however, whatever ultimately ends up using or extending C will have access to b.a.foo.


At the end of the day, you can probably get away with using inheritance and notice no difference in practice, but sometimes that extra bit of compiler enforced separation can keep your code neatly compartmentalized.

I'll leave you with a fun quote:

All problems in computer science can be solved by another level of indirection, except of course for the problem of too many indirections.

(said by David Wheeler, brought to my attention by Jon Bass)

Further reading:

Recap: Northeast Scala 2014 Day 2

I was in NY this past weekend attending nescala 2014 day 2. The day was organized as an unconference. I managed to attend a few presentations, namely Functional Reactive Programming, Concurrent Revisions, and Batshit-crazy Algebra with Types (full unconference grid is here). Here are my notes.

Functional Reactive Programming

Peter Fraenkel and Guillaume Marceau both presented on this topic at nescala.

The basic idea of FRP is explained on this Stackoverflow page: What is (functional) reactive programming?

If I had to boil it down, I might say something like this: imagine you can have variables that change over time as first class entities, and then you incorporate these variables into other calculations or reference them elsewhere, and when they change, all dependent calculations "react" automatically and change, too. Now try to do this in a functional way. Under the covers this works through a directed acyclic graph (DAG) that maintains what depends on what.

Here are some resources for follow-up:

Concurrent Programming with Revisions and Isolation Types

This presentation was by Daniel James @dwhjames. For this, imagine managing state in your Scala code as if it was git. Here's a link to the presentation.

Batshit-crazy Algebra with Types

I always enjoy going to Scala meetups and watching talks on type theory, especially ones that create great "aha!" moments. This was one of those talks. The talk was by Jon Pretty. Here are the slides.

The slides alone can't do justice to how well Jon presented the topic, so you might want to look at a series of blog posts by Chris Taylor starting with The Algebra of Algebraic Data Types, Part 1. Chris is listed as inspiration for Jon's talk.

Someone in the audience also suggested these books (some of which might free online):

  • Analytic Combinatorics
  • Generating Functionalogy
  • Enumerator Combinatorics

I'm not sure if this talk has much practical application, but it presents another way of looking at the world which will expand your mind as a programmer. For those of us that don't live and breathe algebraic types, Jon helps you make a few mental leaps that might just surprise you.

Tweet this post

MongoDB "Too Many Open Files"? Raise the limit

This blog post is intended to supplement the "Too Many Open Files" page in the mongoDB docs.

Raising the file limit for MongoDB

If you installed from the Ubuntu/Debian package, then there is a simple way to increase the open file limit. MongoDB's startup script is /etc/init/mongodb.conf. This is an upstart script which supersedes the /etc/init.d scripts we're all use to. All you need to do is add "ulimit -n {X}" to the script block before mongodb is launched, replacing X with the limit you want (I use 65000). That sets the ulimit for the script and any processes it launches (therefore mongodb). Here is an example /etc/init/mongodb.conf

# Ubuntu upstart file at /etc/init/mongodb.conf

pre-start script
  mkdir -p /db/mongodb/
  mkdir -p /var/log/mongodb/
end script

start on runlevel [2345]
stop on runlevel [06]

script
  ulimit -n 65000

  ENABLE_MONGODB="yes"
  if [ -f /etc/default/mongodb ]; then . /etc/default/mongodb; fi
  if [ "x$ENABLE_MONGODB" = "xyes" ]; then
    exec start-stop-daemon --start --quiet --chuid mongodb --exec  /usr/bin/mongod -- --config /etc/mongodb.conf
  fi

end script

Once mongoDB is up, you can check the limit for the process by doing cat /proc/{pid}/limit. Replace {pid} with the pid of mongoDB. To get the pid, just do "ps axu | grep mongodb".

If you aren't using the install packages, then you'll need to add the ulimit command to your own startup script. Keep in mind that ulimit is a bash command, not a regular binary tool, so look at "man bash" for more info on ulimit.

This blog post suggests that you can set system wide limits by editing /etc/security/limits.conf and /etc/pam.d/common-session, however, this only applies interactive and non-interactive shells (and processes started by them). When using just this technique, it didn't appear to affect the open file limit of the mongodb process started by the default upstart script (without my added ulimit statement).

If you want to try system wide limits, then add a line like the following to /etc/security/limits.conf:

*  -  nofile 65000

See "man limits.conf" for more info.

In /etc/pam.d/common-session, enable the limits module by adding:

session required pam_limits.so

Keep in mind that all this really does is have PAM set the limit for interactive and non-interactive shells when loaded. Processes then pickup these limits when started from shells/scripts. You should probably do a restart to fully enact these settings. If someone out there gets system wide settings to apply to mongoDB, let me know with a comment.

Update:

It occurred to me after writing this post that you should put "ulimit -n 65000" in /etc/default/mongodb. If that file doesn't exist, create it. This is the proper place for it. As you can see in the upstart file, it will run /etc/default/mongodb if it exists before it launches mongodb.

Building Ruby Gem Native Extensions on Windows

If you're using Ruby on Windows, but always encountering gems that require native extensions, then the new(ish) RubyInstaller project is for you.

When browsing the Ruby download page, you may have noticed the newfangled Windows installer for download. They've swapped out the old installer (ever wonder where the option for installing SciTE went?) in favor of packaging now being done by the kind folks at RubyInstaller.

Besides just providing newer/better Ruby for Windows, the RubyInstaller team has also been working on the RubyInstaller Development Kit (DevKit), an add-on for building native extensions under Windows. You'll find a download link for DevKit here and instructions for installation here.

Installing DevKit is pretty easy. It amounts to just extracting some files to your Ruby install path. Once done, building native extension just works (at least the ones I tried). This is great for gems like ruby-debug-ide which haven't been shipping pre-compiled Windows extensions with the latest releases.

It looks like RubyInstallers first stable releases came out around March of this year. I didn't notice it until now, but I'm glad someone is putting in the effort to make Windows Ruby development more seamless.

MongoDB Sharding

Messed around today with MongoDB sharding on version 1.2.2. It was pretty easy to setup. All I had to do was:

  1. Download MongoDB
  2. Do this setup
Here's the question that prompted me to try it: does MongoDB only fetch the necessary subset of chunks when doing range queries? The short answer is yes, it does. It was natural to assume it would, but I wanted to see it in action.

To test this, I created the test.people collection from step 2 above, then ran this with the mongo client:

for (i=0; i < 3000000; i++) { test.people.insert({name: i}) }

When that finished, I had three chunks.

> printShardingStatus(db.getSisterDB("config"))

--- Sharding Status ---
sharding version: { "_id" : ObjectId("4b7df95cc03e000000005d6b"), "version" : 2 }
shards:
{ "_id" : ObjectId("4b7df969c03e000000005d6c"), "host" : "localhost:10000" }
{ "_id" : ObjectId("4b7df96ec03e000000005d6d"), "host" : "localhost:10001" }
databases:
{ "name" : "admin", "partitioned" : false, "primary" : "localhost:20000", "_id" : ObjectId("4b7df969b90f0000000056aa") }
my chunks
{ "name" : "test", "partitioned" : true, "primary" : "localhost:10001", "sharded" : { "test.people" : { "key" :
{ "name" : 1 }, "unique" : true } }, "_id" : ObjectId("4b7df982b90f0000000056ab") }
my chunks
test.people { "name" : { $minKey : 1 } } -->> { "name" : 0 } on : localhost:10001 { "t" : 126654
7599000, "i" : 4 }
test.people { "name" : 0 } -->> { "name" : 2595572 } on : localhost:10001 { "t" : 1266547599000,
"i" : 2 }
test.people { "name" : 2595572 } -->> { "name" : { $maxKey : 1 } } on : localhost:10000 { "t" :
1266547599000, "i" : 5 }

You'll see that three chunks exist above. Two are on one shard, one is on the other. The important point to notice is that one is for [0, 2595572) and another for [2595572, maxkey). I'm not sure why [minkey, 0) and [0, 2595572) wasn't just [minkey, 2595572), but that's something for another day. For the purposes of my range test, this suffices.

I then tried operations such as:

> db.people.find({ name: { $gt: 1, $lt: 3 } } )
{ "_id" : ObjectId("4b7df9c85a4800000000485d"), "name" : 2 }

> db.people.find({ name: { $gt: 2595573, $lt: 2595575 } } )
{ "_id" : ObjectId("4b7dfb8f903c000000003c92"), "name" : 2595574 }

> db.people.find({ name: { $gt: 2595570, $lt: 2595575 } } )
{ "_id" : ObjectId("4b7dfb8d5a4800000027e34f"), "name" : 2595572 }
{ "_id" : ObjectId("4b7dfb8f903c000000003c91"), "name" : 2595573 }
{ "_id" : ObjectId("4b7dfb8f903c000000003c92"), "name" : 2595574 }
{ "_id" : ObjectId("4b7dfb8d5a4800000027e34e"), "name" : 2595571 }

I watched the mongod output on these finds. The first two queries only hit one shard. The last query hit both shards. So MongoDB does in fact only query the necessary chunks even when doing range queries.

TechStars Application Tips

With the deadline looming for TechStars Boston 2010, I've been asked for tips from a few people. It's time to share them in a blog post. Hopefully this helps.

Disclaimer: this is just my perspective from Localytics, a TechStars Boston 2009 company. My tips are anecdotal at best. As always, exercise your own judgement.

  1. There are a lot of applications, so short and sweet works well. Eliminate fluff. Most of our answers were less than a few concise paragraphs. The only answer that was long was for "Tell us about each founder." That's because...
  2. ... it's more about the founders than it is about the idea. A non-trivial percentage of TechStars companies change there idea, and it's not a big deal. Your application should help people like David and Shawn get to know you. Highlight cool things you've worked on in the past. Show that you're motivated and smart.
  3. Be mentor-able. TechStars is all about mentoring. You need to be open to advice while still being able to think critically and make decisions. To the extent that you can, convey those personality traits in your application.
  4. Be dedicated to the startup. TechStars strongly favors founding teams that are 100% committed and ready to go.
  5. Get your application in early - TechStars is already looking at them. Getting in early means your application has more time to float in front of different eyes. You are at less risk of being lost in the last minute flood.
  6. Get feedback. If you're connected to any alums or mentors, see if they'll review your application. We resubmitted our application several times based on feedback (though be warned, I'm not sure if resubmitting is possible on the application form, we sent updates over email).
  7. If your company/product is already chugging along, then demonstrate progress. Show how your company has met goals or achieved milestones like X number of users or high profile client Y.
  8. Try to get into TechStars for a Day. It gives you a chance for face time. If you do go, bring a laptop and be ready to informally demo your product.
  9. Apply! It isn't that much work, so just do it.
You should also check out: Best of luck to all of the applicants! TechStars is a truly amazing program.

Ruby MD5 and SHA1 Digest Benchmark

I did a benchmark of MD5 and SHA1 digests in Ruby. The benchmark was done in Ruby 1.8.6 on Windows.

The code used to benchmark:

require 'benchmark'
require 'digest/sha1'
require 'digest/md5'
require 'base64'

# Demonstration of digests

puts "SHA1.hexdigest     #{Digest::SHA1.hexdigest('test')}"
puts "MD5.hexdigest      #{Digest::MD5.hexdigest('test')}"
puts "SHA1.digest        #{Base64.encode64(Digest::SHA1.digest('test'))}"
puts "MD5.digest         #{Base64.encode64(Digest::MD5.digest('test'))}"
puts "MD5.digest.inspect #{Digest::MD5.digest('test').inspect}"

print "\nSHA1.digest bytes "
Digest::SHA1.digest('test').each_byte {|c| printf("%4d", c) }
print "\nMD5.digest bytes  "
Digest::MD5.digest( 'test').each_byte {|c| printf("%4d", c) }
puts "\n\n"

# Benchmark

TIMES = 50_000

Benchmark.bmbm do |b|
b.report("sha1 hexdig") { TIMES.times do |x|; Digest::SHA1.hexdigest(x.to_s); end }
b.report("md5  hexdig") { TIMES.times do |x|; Digest::MD5.hexdigest(x.to_s); end }

b.report("sha1 digest") { TIMES.times do |x|; Digest::SHA1.digest(x.to_s); end }
b.report("md5  digest") { TIMES.times do |x|; Digest::MD5.digest(x.to_s); end }
end

The output:

SHA1.hexdigest     a94a8fe5ccb19ba61c4c0873d391e987982fbbd3
MD5.hexdigest      098f6bcd4621d373cade4e832627b4f6
SHA1.digest        qUqP5cyxm6YcTAhz05Hph5gvu9M=
MD5.digest         CY9rzUYh03PK3k6DJie09g==
MD5.digest.inspect "\t\217k\315F!\323s\312\336N\203&amp;amp;amp;'\264\366"

SHA1.digest bytes  169  74 143 229 204 177 155 166  28  76   8 115 211 145 233 135 152  47 187 211
MD5.digest bytes     9 143 107 205  70  33 211 115 202 222  78 131  38  39 180 246

Rehearsal -----------------------------------------------
sha1 hexdig   0.797000   0.000000   0.797000 (  0.796000)
md5  hexdig   0.641000   0.000000   0.641000 (  0.641000)
sha1 digest   0.625000   0.000000   0.625000 (  0.625000)
md5  digest   0.500000   0.000000   0.500000 (  0.500000)
-------------------------------------- total: 2.563000sec

user     system      total        real
sha1 hexdig   0.750000   0.032000   0.782000 (  0.781000)
md5  hexdig   0.593000   0.000000   0.593000 (  0.594000)
sha1 digest   0.594000   0.031000   0.625000 (  0.625000)
md5  digest   0.469000   0.000000   0.469000 (  0.468000)

As expected, digest methods are faster than hexdigest and MD5 is faster than SHA1.

Database Normalization: First, Second, and Third Normal Forms

I read a great explanation of first, second, and third normal form a few weeks ago. For those that know what database normalization is but haven't seen the "forms", the different forms are essentially rules for having a well normalized relation DB. Keeping them in mind when doing DB design is key to keeping a great database. I'd like to make an attempt at condensing the linked tutorial into its essentials.

First Normal Form (1NF): No repeating elements or groups of elements

Don't repeat your columns. Avoid this:

OrderId ItemId1 ItemId2
1 100 101

ItemId1, 2, ... should be split out into relational tables.

Second Normal Form (2NF): No partial dependencies on a concatenated key

This is a complex way of saying that if a column isn’t intrinsically related to the entire primary key, then you should break out the primary key into different tables.

Example:

OrderId (PK) ItemId (PK) OrderDate
1 100 2009-01-01
1 101 2009-01-01

The primary key is (OrderId, ItemId).

Consider OrderDate. It is conceptually part of an order. An order always occurs at some time. But is an OrderDate related to an Item? Not really.

You may be saying, “but items are part of an order!”, and you would be right. But that’s not what I’m getting at. OrderDate is independent of the item itself.

Look at another way: in the table above the OrderDate will always be the same for a given OrderId regardless of the value of the ItemId column. This means data duplication, which is denormalization.

Here’s how we correct the problem:

Orders
OrderId (PK) OrderDate
1 2009-01-01
Order_Items
OrderId (PK) ItemId (PK)
1 100
1 101

Here is an excellent line from the article, “All we are trying to establish here is whether a particular order on a particular date relies on a particular item.”

Third Normal Form (3NF): No dependencies on non-key attributes

2NF covers the case of multi-column primary keys. 3NF is meant to cover single column keys. Simply stated, pull out columns that don’t directly relate to the subject of the row (the primary key), and put them in their own table.

Example:

Orders
OrderId (PK) OrderDate CustomerName CustomerCity
1 2009-01-01 John Smith Chicago

Customer information could be the subject of its own table. Pull out customer name and other customer fields into another table, and then put a Customer foreign key into Orders.

Wikipedia has a great quote from Bill Kent: “every non-key attribute ‘must provide a fact about the key, the whole key, and nothing but the key’.”

← Archive