Sunday, October 17, 2010

The Friends List: Fun With Facebook, Part 2

(Note: I'll be uploading the source of all the scripts I'm writing for this project to pastebin, but I make no guarantees on the quality of the code. I'm just writing quick-and-dirty Ruby to get some results without a ton of regard for efficiency or readability. That'll come later.)

Source for this post

As I mentioned in my intro to this series, the friends list is the simplest info in this whole package - it's quite literally just a list of names in plain text (592 total for me). The first thing I did was just a basic frequency count, which came out basically how you'd expect.
Top 5 first names and their frequency:
  1. Alex, 11
  2. Michael, 10
  3. Matt, 10
  4. Chris, 9
  5. Will, 8
It might be worthwhile to try and put together a list of nicknames, because I know have some Michaels on my friends list that go by Mike, and some Christophers instead of Chris, etc. That'll be a project for another day. The last names list is kind of funny though, as you'll see.

Top 5 last names and their frequency:
  1. Smith, 8
  2. Kim, 8
  3. Lewis, 6
  4. Starr, 6
  5. Williams, 5
Starr is the 4th most common last name? I don't think so. The last names deviated from the actual averages for the US much faster than the first names, presumably because the space of last names is much bigger than the space of first names.

Anyway, those results aren't very interesting. The next thing I tried out is actually inspired by iTunes. When I'm trying to search for a specific song, artist, or album, I often make a little game out of trying to find the shortest search term that'll pick out only what I want. For example, if I wanted to bring up all the songs I have by Pendulum, I could just search "Pendulum", but that's boring. What about just "ndu"? Close, but no cigar. That also brings up a couple songs with the word "Industry" in their title. Turns out there are no 3-letter-long search terms that will work for Pendulum, but there are some with 4 letters that will work, e.g. "ulum".

(Of course, this all depends on my music library. If someone else had a much broader library than me, "ulum" might not work, and there might not even be a solution for "Pendulum" at all, e.g. if they had a song by another artist with "pendulum" in the title. There would be no way to do a search and only get the artist Pendulum without that other song. Taken to the other extreme, someone could have a library with only Pendulum, and then even searching just "p" would work.)

Unfortunately, the actual Facebook search doesn't work this way. Searching "ck" wouldn't bring up "Nick"; your query has to actually start at the beginning of a result, e.g. "Ni" for "Nick". Presumably this is because Facebook is using a tree internally to look up the search results, which would cut the time required exponentially. Definitely a good design decision, but it blocks me from playing my little game. Thankfully, I now have my nice plain-text list of friends, so I can do it myself!

Of course, I'm not about to start listing the full names of all my friends on here, so these results are going to be pretty non-specific, but there's still some interesting stuff. The first thing I did was to take the reverse approach from my example. I looked at all search strings of a given length and saw which ones got exactly one search result. Unsurprisingly, there were no single letters that pulled up a single name. Expanding it to two letters saw some results. In total, 55 of my friends can be uniquely identified by two letter search strings, and two of them actually had two search strings that would identify them.

When I upped the length to three letters, the number of people (and redundancy) increased dramatically. 344 people (more than half of my friends!) can be uniquely identified by a three-letter string, and quite a lot of them have multiple strings that will identify them. The winner actually had 8 different 3 letter strings uniquely identifying him, but he also had a longer name than most (Note to self - normalize by name length later).

However, when I took the next step and ran it for 4 letters, my code took nearly an hour to run. I knew from the start that the method I'd coded was a simpler but ultimately very inefficient solution. Since I was looking at all the possible strings of a certain length, the amount of search strings that returned no results, like "zxqx" or "kkkk", would outnumber the useful strings at an ever-increasing rate (I'm guessing at least exponential, but I haven't proved it). In technical terms, only a tiny volume of the search space was actually useful, but I was still searching the whole thing.

How bad was the situation? Well, I did some number crunching to show you just how bad. Let's go back to two-letter search strings. There's 26 choices (a-z) for one letter, and 26 choices for the other letter, so there's a total of 26^2 = 676 two-letter strings. Of those, 379 actually returned some results among my friends list, which is about 56%. Not bad, but not great either. For three letters, there's now 26^3 = 17,576 total strings, but only 1564 of them are relevant. Now we're down to 8.9%. Uh-oh.

Moving on to four letters, we now only have 2039 out of 456,976, which is only 0.45%. Notice how the number of useful strings has barely gone up, but the total space has jumped a whole order of magnitude? That's a bad sign. For five letters, a little back-of-the-envlope calculations suggests it'll take around two hours to run (it'll probably end up being more). I'll leave that one running just for kicks, but there's an obvious (though trickier) way to implement it so that I can only examine the relevant search strings. However, I don't feel like coding any more right now, so right now this is to be continued...


Edit: It finally finished! I'm proud of how well my estimation worked out: it ended up taking 121 minutes and 47 seconds, which is just a little bit over two hours! Ha HA.

Anyway, the total number of five-letter search strings which return results is...... 1604? Fewer useful searches than for four letters? Hold on a minute. How can that be? When I first thought about it, it seemed like the number should go up indefinitely. After all, say we had someone named "Nicholas". Then "icho" is a valid four-letter search term, and from that, we get two five-letter searches, "nicho" and "ichol". Moreover, that same process can happen for every name that "icho" matches. Based on that, it seems like the number should just keep going up and up.

However, it has to go down at some point. After all, there are no 1000 letter searches that will match the names of any of my friends. So how does it go down? Well, eventually the growth will run into issues with the length of people's names. Take "Nick" for example. "Nick" is a four letter search term that will match, well, "Nick", but it actually doesn't give rise to any five-letter search terms, because there's nothing to add on to it. That's one mechanism for the number to go down, but I didn't think it would happen so soon. It doesn't seem like there are enough four-letter names for the name-length mechanism to overpower the adding-on mechanism.

Thankfully for my sanity, Drodge pointed out another mechanism that I'd missed. Two search terms can collapse together when you up the number of letters. For example, "nic" and "ick" both match "Nick", but when you go up to four letters and use the adding-on process I described, they both become "nick", so two terms collapse into one. Unfortunately, if I were to run my current code for 6 letter strings, it would take more than two days, so I'm going to wait on that until I write the better code.

Anyway, crunching the numbers, 1604 useful searches out of 11,881,376 possibilities comes out to 0.014%. Yikes.

No comments:

Post a Comment