Pegasus: A Modular, Durable Web Crawler For Clojure

Pegasus is a durable, multithreaded web-crawler for clojure.

I wrote Pegasus after the existing choices in the Java ecosystem left me frustrated.

The more popular crawler projects (Heritrix and Nutch) are clunky and not easy to configure. I have often wanted to be able to supply my own extractors, save payloads directly to a database and so on. Short of digging into large codebases, there isn’t much of an option there.

Tiny crawlers hold all their data structures in memory and are incapable of crawling the entire web. A simple crash somewhere causes you to lose all state built over a long-running crawl. I also want to be able to (at times) modify critical data-structures and functions mid-crawl.

Pegasus gives you the following:

  1. Parallelism using the excellent core.async library.
  2. Disk-backed data structures that allow crawls to survive crashes, system restarts etc. (I am still implementing the restart bits).
  3. Implements the bare minimum politeness needed in crawlers (support for robots.txt and rel='nofollow').

The F1 Spaceship

This is what it took to start up a 90s F1 car.

1990 Leyton House March Start up

How to start a 1990 Formula 1 car

Posted by Motorsport Retro on Sunday, November 29, 2015

Previously.


A More Perfect Society

My Uber driver a few weeks ago was a refugee from Afghanistan.

When he first arrived here, he spoke no English. He had no family and friends.

Uber was the first job he held. Conversing with dozens of passengers every day was vital in picking up English.

When I met him, he had spent a year in the Bay Area. His command of English was solid and he was considering going to community college.

I am extremely proud of the fact that this new economy allows a person with no connections or language skills - incredible barriers to success - to actually move up in life.

This is the definition of a successful society - one where everyone has a shot at a good life. And one where forces beyond your control don’t get in the way of your success.


The Stroop Test

I came across this test recently. If you need to discover a Russian under-cover agent you can show them color words (words like red, green, blue) colored in a different color.

Like so:

Green Blue Purple

If you comprehend the words, you perform poorly at this test than otherwise.

The Stroop test on Wikipedia.


The Grind

I was recently wading through a very horrible codebase and my morale had dropped to an all-time low. It was also paper-acceptance season and I began suffering from the grass-is-greener syndrome.

In such situations it helps to read Philip Guo’s excellent post on the subject: Unicorn Jobs.

I sometimes use this mental trick. I ask myself why I am putting myself through this. Shitty codebases come in two main flavors:

  1. Shitty packaging and quality but high barrier to duplication - research prototypes tend to fall in this category. Mastery of this codebase translates to an economic advantage.
  2. Shitty packaging, shitty quality and low barrier to duplication - this is a candidate ripe for disruption. Rewrite it well and you will have done the world a massive favor.

Voice Activity Detection in Python and SWIG

The WebRTC codebase contains a very solid voice activity detection (VAD) algorithm. The project itself is a treasure-trove of solid solutions to common problems in speech, audio and video streaming, encoding etc.

Recently, I was in need of a solid VAD I could use from Python. I wrote one myself in college (and to be fair it was a bit shit).

In a few hours I was able to isolate the source code from the WebRTC project and write a Python wrapper for it in SWIG.

A working VAD for Linux in Python on x86_64 is available in this repo.

The WebRTC VAD components are in this repo.

Some SWIG tips:

  • C functions typically have the following signature: int funcName(int *input_array, size_t array_size); Numpy ships with a fantastic set of typemaps (defined in numpy.i) for just this sort of thing. Drop numpy.i into your directory and include it in your SWIG setup.
  • A lot of typemaps aren’t defined in numpy.i - do not hesitate to write a header. For instance, numpy.i doesn’t have a typemap involving a const int * - a small wrapper around your desired function call it perfect and allows you to use existing typemaps.

Academic Software - Spaceships on Life Support

After a day of trying to get some source code from an academic group to work, I took to facebook to kvetch about academia and polish.

Ranting about the quality and polish of academic software is not fair to academics. I am reminded of this excerpt about Renault’s Fomula 1 car in the LA Times:

At Indy, their garage next to pit road contained Kovalainen’s and Fisichella’s cars and one complete spare, in case either crashes. The garage was about twice the size of a typical residential two-car garage.

But the word “garage” doesn’t do justice to the area. It looked more like a hospital operating room, and, when the cars were parked, they looked as if they were on life-support systems.

And like many doctors, each crewman was a specialist in only one aspect of the car — tires, engines, front end, rear end, traction, hydraulics and so forth.

Before the practice runs and qualifying, above each car was a high-tech metal canopy with lights, electrical outlets and more than a dozen black cables that dropped down and attached to the cars.

What were the cables for? After Fisichella and Kovalainen brought their cars in from the track, some cables instantly transmitted data about the cars to the team’s computers: Fuel consumption, tire wear and the car’s balance, to name a few areas.

Four other cables provided power to the blankets placed around each tire. Teams wanted the tires kept warm, so they would be soft and “grippy” the moment a driver went back racing.

Conversely, teams wanted to prevent the car’s brakes, engine and oil from overheating. That’s why they instantly attached the blowers that look like giant hair dryers to the wheels. They were actually pumping cool air to the brakes.

F1 cars don’t have radiators. So the team attached what look like oversized cup holders — each containing another blower — to each side of the body, then poured in dry ice. That forced cold air into the engine.

When the drivers came in, I saw their crews put a TV monitor in front of them and hand the drivers the remote control. That way Fisichella and Kovalainen picked from two viewing choices: A readout of every driver’s lap times, or the TV feed showing cars going around the track.

Full article

When you ask for research prototype code, you’re getting a Formula 1 car - a spaceship on life support.

Don’t expect a German luxury saloon car. That’s an itch for the industry to scratch.


Obituary - Beryl Nelson

I first met Beryl Nelson in high school. Our first discussion covered lisp, math, biology and her insights on education. Beryl was instrumental in my decision to study CS and mathematics in college.

Over the years, Beryl’s immense contributions to CS and CS outreach have had massive impact on hundreds of people directly (and several millions through her work at Google). She led teams in the early years at Google Hyderabad, spoke at several ACM Grace Hopper events and did amazing work steering hundreds of high-schoolers like me towards productive careers in computing.

Here’s a detailed bio and interview - Beryl Nelson.



Twitter: @shriphani
Instagram: @life_of_ess
Fortior Per Mentem
(c) Shriphani Palakodety 2013-2020