hacking day at yahoo

6 April, 2007

Chad Dickerson presented his experiences with the Yahoo Hack Day at Etech 2007. This post can be fairly short. Yahoo regularly organizes internal hack days to get developers together to work on quick prototypes or demonstrations. There are just a few rules:

  • it’s about building stuff, not about powerpoint
  • the day goes for 24 hours, starts at 12am on one day, and ends at 12a, the next day with presentation to collegues and management (no powerpoint!)
  • presentations last 90 seconds
  • there are no upfront reviews
  • and basically, that’s it!

At Yahoo this results in hundreds of prototypes. Results are that a lot of people get to know new collegues, because they do not all work on prototypes with their direct collegues. People also learn from each other what their knowledge and skills are. This is a benefit in day-to-day work, as people know where to go for specific problems. And most importantly: it’s a lot of fun! (Yahoo has a website with a gallery of ‘mashups’ where typical examples of such hack events can be found).


The way Jeff Jonas gave his talk at etech07 was similar to the content he was presenting. His theme was ‘enterprise amnesia’. Jonas has a history in Las Vegas, where he worked on fraud detection for casinos. The main problem with these organisations is that the left hand does not know about what the right hand is doing. For example, a casino might not know that a dealer and a player at the same table might share the same streetaddress. This could indicate a fraud case. Tying different databases (e.g. the employee and the visitor database) together could solve some problems.

Of course, just tying the databases together does not solve the problem automagically. Data in different places can be slightly different. Therefore, some smart techniques (pdf link) need to be in place to connect data from one record to another. This technique is now featured by IBM (where Jonas is a chief scientist). Basically, all data that shares some elements is compared to connect them into one ‘entity’. A byproduct is that while a database keeps containing more and more records on individuals, the number of individuals grows slower than the database. In other words: the information overload becomes a virtue, more detailed information about individuals is known.

Another interesting point Jonas made is to treat data and queries as the same thing. When someone queries a system, this is also information that enriches the system. Just a simple example would be that a user is looking for information, doesn’t find it, but does find someone else was looking for the same. Having stored the query before, now makes it possible to connect these two individuals. Also, treating new data that enters the database as a query has serious benefits. In this case the new data is used to asked the question: “what does this change to what we already knew?” Jonas calls this ‘perpetual analytics‘.

These techniques can be used for good and bad things. An example where having a sound data storage about individuals might have helped was in the Katrina aftermath.

At the Etech 2007, Marc Hedlund and Brad Greenlee gave a technical talk about privacy techniques for web applications. They both work for Wesabe, a online community where people can manage their money. Users can upload their bank account information which is aggregated in the community. From the collected information good tips and recommendations can be made to help people reach their financial goals.

However, the talk was more about low level techniques for better privacy. There were five points that were dealt with:

  1. critical data local – for a user it can be rather frightening to upload all his account information into wesabe. Not just for ‘shameful expenses’, but there just are some things you don’t want the rest of the world to see about your spending habits. The solution Wesabe takes is to offer a local download client with filters. This tool downloads information from your bank, filters it and then uploads it to Wesabe. It might not be about real privacy (a user cannot really see if the information is actually filtered), but it solves the trust issue. There are some downsides to this approach: users have to go over a threshold (download a client), the burden is now placed on the security level of the user’s computer and there is a severe risk for trojans.
  2. privacy wall – this is a clever idea: Normally, in a database tables are connected through keys: each row in one table has an identifier, the other table a reference to that identifier. In case of Wesabe, there are some tables that connect the user (with its id) to some piece of information (referencing the id). However, it would be better to keep this connection secret. This is easily done by cryptographic hash of the reference to an id. This way, without some sort of password the connection cannot be made. Again, there are some problems with this approach: the biggest is when a user forgets his password. In this scenario, it takes some more effort to get all information back. (read Brad’s comment on this writeup) For a more in depth explanation of this idea, read Brad’s blog posting on the subject.
  3. partitioning – in a way, this concept is somewhat related to the previous. It is always possible that a system become compromised to people with bad intentions. When this happens the actual damage should be kept to a minimum. What Wasabe does, is partioning the databases in such a way that different kinds of data about the same user are stored in different places. For example, a membership and account database can be kept apart. Then a security breach stays compartimentalized. This compartimentalisation is even better when the databases are stored on different systems (not only physical, but also different OSes, database systems, etc).
  4. data fuzzing and log scrubbing – when building a web application with modern tools, a lot of debug and logging is done automatically by the system (for example in ruby on rails or django). This poses a serious threat, as these logs often contain sensitive information. Not just explicitly, also timestamps and IP addresses might be traced back to certain users or other information. When designing and building such a system logs and debug information should be handled very carefully. Wesabe made a point of scrubbing the logs meticiously, and had a retainment policy for logs. Error messages, which are normally send around to developers are now stored on disk and link is sent. When the error is dealt with the log is immediately deleted. However, it keeps being a challenge to fix all possible holes (for example backups of logs also pose problems).
  5. voting algorithms – Wesabe relies on the community to build up knowledge about account information. For example, codes for bankaccount numbers are hard to read. When a user changes such a number to a sensible name, this might be interesting for other users as well. Again, this might be a privacy problem: not all users have to see the name someone gives to an account number. This is fixed by a voting algorithm, just like the one that is being used by Google to classify pictures. If a certain amount of people classify a picture as being a cat, then it is probably a cat. This way, only common knowledge becomes public, without introducing privacy problems.
  6. miscellaneous – furthermore, there were some best practices. Of course, one should always hash passwords in a database. Database IDs should be randomized instead of sequential (although this often is the default in database systems). Finally, the company or website should have a policy to describe how is being dealt with privacy sensitive information.

In a way I’m a bit disappointed by the talk of Jeff Hawkins on Hierarchical Temporal Memory (HTM). But, I guess, it’s my own fault as well. Already I read his book (On Intelligence), read up on the NuPIC platform and played around with the demo application. So what can you expect from a 45 minutes talk on the subject.

One thing that came up in the talk were limitations of the software. At this moment two things are not yet tackled. First, precise timing cannot be addressed by the system. This would mean that (near) real-time control of a system will be a challenge. Another issue is modeling natural language with HTMs. At this moment there is not a clear idea yet how to go about this. In general, Numenta does not focus on specific applications. Their business model is based on earning money with a licensing model. For demonstration purposes they focus on vision system, which is also shown with the released demo application.

In the end I’m still curious to see what NuPIC can do, and like to see another application other that the (still impressive) demo.

technorati tags: