In 2014 I attended a talk by Ramanathan V. Guha on schema.org. Its remarkable adoption offers several lessons about building internet properties.
Schema.org is an ontology - a vocabulary for annotating HTML documents with information. For instance, you can put in some info in the web-page markup stating that what you’re displaying is blog post or a list of products.
For instance, a website can present its content and state that this content is an instance of
Blog (say). Schema.org is a central repository describing what
Blog is, what and where it fits in an ontology.
As of 2015, Schema.org is by far the most dominant vocabulary used on the internet .
Schema.org accomplished this feat by aligning incentives correctly. First, annotated webpages would result in a richer search-result presentation - translating to better traffic to the website - a massive economic incentive. Next, the top 4 players in search agreed to support Schema.org - Google was obviously the mothership, bing, Yahoo and Yandex supported it.
Schema.org offers an amazing lesson in setting up the incentives correctly. When you’re building a standard, waiting for people to adopt your rules on the basis of their merit is slightly less efficient than not building the standard at all. A clear economic incentive is a powerful motivator.
The Schema.org team understood that well. And now there is an amazing foundation that is being exploited by a new generation of applications.
LMDB is the nicest, no-nonsense, no-surprise key-value store I’ve ever used.
In several benchmarks, LMDB destroys competitors - it is a beloved tool in high-profile circles.
Here’s clojure bindings: https://github.com/shriphani/clj-lmdb
One of the more memorable courses I took at Carnegie Mellon was Manuel Blum’s algorithms.
The CSD made it a mandatory course for some major (I don’t know about this since I was just fucking around for the most part).
It is a little known secret that you don’t attend Manuel’s course to learn about algorithms - you can learn quite well by opening a book.
You attend to watch a genius combine expertise, brilliance, and passion.
When I took the course, Manuel wanted to get up to speed with machine learning. As of 2014 Manuel had won the Turing award, his students shared among them three Turing awards, founded new fields in computer science like Quantum Computing, and founded successful companies which went on to become cultural phenomena - ReCaptcha and Duolingo among them.
Manuel’s course essentially involved delivering a lecture on a topic of your choice - I made a proper horlicks of my talk. However, it inspired a lively blog post and visualization here.
I was pretty excited to meet Manuel in person. Back in high school I was avid follower of Scott Aaronson’s blog and he spoke highly of his advisor Umesh Vazirani. Manuel was Umesh’s advisor and it was a meeting-your-childhood-hero moment for me.
From the entire course, two things stood out:
- Manuel is incredibly hard to lecture to. This is not a criticism, it is a harsh reminder that most academic talks are pathetic. Manuel placed immense emphasis on getting the axioms, the aphorisms and the fundamentals right. Opening a talk like you open an abstract was a sureshot way of getting asked a question two sentences in.
- Manuel engaged heavily with students. Prior to World War II, a student was evaluated based on an oral examination by their instructors. My colleagues at work who were educated in Europe mention that this tradition is being preserved across the pond. Manuel believed in this interaction - education for him was a socratic dialogue.
- Manuel randomly sampled material from tons of books - a technique that broadened his expertise and world-view immensely.
Manuel has tons of amazing advice for students on his website. Personally, when I took his course, due to a variety of circumstances, I had mentally checked out of CMU.
Manuel was the display of intellect I needed to not descend into deep cynicism about a life devoted to learning.
Pegasus is a durable, multithreaded web-crawler for clojure.
I wrote Pegasus after the existing choices in the Java ecosystem left me frustrated.
The more popular crawler projects (Heritrix and Nutch) are clunky and not easy to configure. I have often wanted to be able to supply my own extractors, save payloads directly to a database and so on. Short of digging into large codebases, there isn’t much of an option there.
Tiny crawlers hold all their data structures in memory and are incapable of crawling the entire web. A simple crash somewhere causes you to lose all state built over a long-running crawl. I also want to be able to (at times) modify critical data-structures and functions mid-crawl.
Pegasus gives you the following:
- Parallelism using the excellent
- Disk-backed data structures that allow crawls to survive crashes, system restarts etc. (I am still implementing the restart bits).
- Implements the bare minimum politeness needed in crawlers (support for
This is what it took to start up a 90s F1 car.
This is the PED for your mind. A brilliant read
My Uber driver a few weeks ago was a refugee from Afghanistan.
When he first arrived here, he spoke no English. He had no family and friends.
Uber was the first job he held. Conversing with dozens of passengers every day was vital in picking up English.
When I met him, he had spent a year in the Bay Area. His command of English was solid and he was considering going to community college.
I am extremely proud of the fact that this new economy allows a person with no connections or language skills - incredible barriers to success - to actually move up in life.
This is the definition of a successful society - one where everyone has a shot at a good life. And one where forces beyond your control don’t get in the way of your success.
I came across this test recently. If you need to discover a Russian under-cover agent you can show them color words (words like red, green, blue) colored in a different color.
Green Blue Purple
If you comprehend the words, you perform poorly at this test than otherwise.
The Stroop test on Wikipedia.