Monday, October 18, 2010

Fun With Facebook, Part 3

(Previous post here, and source code for this post here)

As I mentioned before, my method for generating searches to find names was really bad, because I was searching the whole space of strings and wasting a lot of time on searches that found no results, like "zx" or "qq". Last time, it took a little over two hours to generate only the useful searches that were 5 letters long. With my new, non-terrible code, it takes under 10 seconds to generate all 8348 search strings of any length. That's a pretty massive improvement. How massive?

Well, there are a total of 8348 search strings of any length that will produce at least one result. The actual alphabet they drew from is bigger than just "a" to "z", because some people's names have special symbols, like dashes for compound last names, but we'll just assume that they drew from "a" to "z" in order to get a lower bound. The longest search string that gets results is 13 letters long, so the space of all possible search strings I would have had to look through using the old method is the sum of all the spaces for each length, i.e. all the 1 letter strings plus all the 2 letter strings plus all the 3 letter strings, and so on up to 13. Computing that sum yields a whopping 2,580,398,988,131,886,038 possible searches. The 8348 searches that return results represent a rather small portion of those, approximately 3.2 * 10^-13%. In other words, if I'd run my bad program enough to find all the searches,  only 0.00000000000032% of my time would have been productive.

Anyway, I wrote a new program that didn't suck, and you can read a description of how it works after I go over some of the results. Like I said before, there were 8348 search strings that produced any results at all. Of those, 5507 return exactly one result. Of course, a lot of those are redundant. For example, if "icho" were to only hit "Nicholas", then so would "Nicho", "Nichol", "Nichola", "ichola", and so on. If you then reduce the redundant results down to just the shortest one, we'll get a list of what we'll call "minimal searches": searches (1) that return only one result and (2) for which there are no shorter searches that return the same result. Condition (2) is worded a little bit awkwardly because there can be more than one minimal search of a certain length. For example, it could be that "ich" and "cho" are both minimal searches for "Nicholas", if they both only match "Nicholas" and there are no two-letter strings that match only "Nicholas".

With that concept in mind, we can answer some even more questions. For example, how many names have minimal searches at all? Shouldn't all of them? Interestingly, there are only 481 names that have minimal searches. Part of this oddity is a result of the way I've implemented the searching. One name that fails to have a minimal search is mine. It doesn't work because there are other people in my friends list named Nick and other ones with the last name Starr. The way I've implemented this so far, you can't have multiple words (separated by spaces) in a search, so you couldn't search something like "ck rr" to try and get "Nick Starr". With that capability, there wouldn't be any names without minimal searches, unless there were literal duplicates in the list.

Well, that's about all I can think of doing with a list of my friends' names. If only facebook had provided more info (like which of my friends were friends with each other), there would be all kinds of other things I could do. As it is, my next goal is to find a good HTML parsing library for Ruby and write some code to start messing with the other data, hopefully in a more organized way than my friends list code.

Appendix A: How The Better Algorithm Works
The gist of how I made my code better is to restrict it to only look at valid search terms, by working from the names themselves rather than looking at all the possible search strings. For every name I go through a process of generating all the search strings that will match it. For example, for Nick, I would generate the following:

n, i, c, k
ni, ic, ck
nic, ick
nick

For each one of those search strings, I would run it against the whole list of names, seeing which names are matched by "n", which ones are matched by "i", and so on. Fortunately, this is a textbook example for the use of a simple optimization technique known as memoization. If you start running this method for a few sample names, you'll see that there's a lot of duplication. For example, in my dataset the string "nic" will match "Nick", "Nicholas", "Nichole", "Nicole", and a few last names. This means that every time I run the process above on any of those names (most of which are repeated a few times throughout the list), my code would have to search the whole list for names to see what matches "nic", even though it's already done that before. The key of memoization is to save the results of a function that gets repeated a lot. Sometimes you have to be careful about which results to save, but this computation is small enough that I just saved all of them. Then you end up with something like this (in pseudocode)

          Did we already search for this string?
                    If yes
                              Use the stored value instead of re-searching.
                    If no
                              Perform the search and then store the result for the future.


It may seem like a minor change, but memoization can often have a huge impact on a function's efficiency. In this case, it cut off about two thirds of my code's running time. Of course, memoization only works if the function being considered is what's called referentially transparent, which is a fancy way of saying that given the same input, it'll always return the same output no matter how many times you call it. Trivial examples of functions that are not referentially transparent would be one that returns a random number, or one that asks the user for input. In this case, if I were to change the underlying list of names that I'm working with, my memoized values would no longer be valid, because my searching function would no longer be referentially transparent.

No comments:

Post a Comment