How to Build a Twitter Agent

on February 15, 2008

Note, while working on this project this ReadWriteWeb article was released, illustrating the future potential of the Jabber/XMPP protocol.

In this article we will build an actual useful Twitter Service that will allow us to track the Blogosphere. In the process we will get hands on programming experience with Ruby, DRb, Twitter and Jabber. This will sharpen our developer skill-set to get ready for the upcoming (Folk)Semantic Web. Also we evaluate the problems seen and opportunities ahead.

Background

Whether you want to call it Web 2.0, Web 3.0, the Semantic Web or the Web of Data – change is happening. The past years we’ve seen the tremendous power of Folkosonomies and now this Social Web is colliding with the emergence of the Semantic Web, resulting in the first semantic services. For us developers and creative entrepreneurs it’s important to get ready for this new wave of business opportunities. I find the whole notion of Intelligent Agents very interesting. For our little project however, we will create a Stupid Agent :]

Technologies like Jabber/XMPP and DRb will enable us to move from a reactive web to a proactive web. Right now this proactive realtime push of data is important for the more liquid content creation services. Micro/nano blogging platforms as Twitter and Tumblr are good examples of this. This is one of the reasons that they already have a Jabber service set up.

I’ve had used Jabber before to communicate with my geek friends. For this project, I had to set up a Jabber client and Jabber account. Call me stupid, but I actually had to spend 30 minutes figuring out how the hell I had to create an account and choose which server to use (turns out you can do that in the client). Now of course XMPP/Jabber is just a standard for enabling IM communication, but apart from Google’s GTalk there hasn’t been much widespread use by ordinary users. In my view, these uses of XMPP for machine to machine (to human) are much more interesting.

Case: The Observatory Bot

I’ve been programming for quite some time now. When learning a new language I really hate doing little examples that produce zero user/business value. That’s why I think the best way to learn new technologies is to solve real world problems right away. Of course tutorials are valuable to just get a general idea of what’s going on, but don’t waste too much time on them – implement straight away.

Our Twitter Service will also need to create some value for the user and must be production ready. However, don’t get your hopes up too much since these are experimental technologies with dependence on external services like Jabber and Twitter.

Remember my last little project? Wigitize.com is actually generating a lot of data. It’s tracking about 5000 6000 feeds every hour! Let’s do something with that data :]

The Observatory:
  • is a Twitter-only service
  • will allow you to ‘track’ the Blogosphere
  • will send you a direct message when something happens in the Blogosphere

Basically this is like Twitter’s IM functionality to track the Twittersphere. So in a sense The Observatory will be a proof of concept portal between the Twittersphere and the Blogosphere.

The Architecture

Right now Wigitize.com uses BackgrounDRb to perform background tasks and also to update all RSS feeds periodically. Everytime the feed aggregation process finds a new feed entry it will create a FeedEntry instance. The creation of these objects serve as events for the Observatory Bot. These events have to be pushed to the Observatory Bot in some way.

What can Twitter’s IM service do for us?

To play around with Twitter’s agent you need to set up a Jabber account and a Twitter account. For debugging I’ve found the MacOS tool JabberFox very helpful.

Basically, these commands are available:

However, I think there are a lot more hidden commands which can be used. After emailing with the Twitter developers they told me there is a command called “d”. This can be used to send direct messages (d username message). Very useful!

Coding the Bot

Our implementation choice for today will be Ruby. If you’ve programmed intensively in other languages before you’ve probably come to the conclusion that Ruby is quite different from most other. Ruby’s flexible object models allow for great extension of the language itself (eg 3.minutes.ago). It is therefore no surprise that interesting Semantic Web projects like ActiveRDF are choosing Ruby as their language.

To communicate with Twitter we can use this cool Ruby Twitter API. Unfortunately all of these API’s are HTTP/REST driven and really limit what we can do in terms of realtime response. Also, if you want to build a serious production ready service, constantly polling Twitter will kill both parties.

So we need to interface with their Jabber Service. Jabber is a friendly name for Instant Messaging (IM) using the open XMPP protocol. Luckily, there is a Ruby library called XMPP4R which does most of the XMPP work for us. This blog post provides some simple examples and this German wiki entry provides sample code how to use callbacks (very important for a bot).

I’ve wrapped all of this in a simple JabberBot class jabber_bot.rb that can be used like this:

1
2
3
4
5
6
7
8
  class MyJabberBot < JabberBot
    def on_message(from, body)
      say(from, "You said: #{body}")
    end
  end
  my_jabber_bot = MyJabberBot.new('observatory@jabber.org', 'password')
  my_jabber_bot.connect_and_authenticate
  my_jabber_bot.run

As you can see in the diagram, I’ve build a TwitterBot on top of this JabberBot. Unfortunatly it’s not possible to do all communication with Twitter through Jabber yet. For example: there are no events for when users start following other users or ways to retrieve information. This is why twitter_bot.rb is essentially a hybrid using both the Twitter API and Twitter’s Jabber service. Feel free to use all sourcecode provided here, I know it will be useful to some of you out there. This is how to use this TwitterBot:

1
2
3
4
5
6
7
8
9
10
11
12
  twitter_bot = TwitterBot.new('observatory', 'password', 'observatory@jabber.org', 'password')
  twitter_bot.track_phrases = ['observatory.topoints.com']
  twitter_bot.on_directed_tweet do |username, message|
    puts("directed tweet: #{username} says #{message}")
  end
  twitter_bot.on_tweet do |username, message|
    puts("something from #{username}: #{message}")
  end
  twitter_bot.on_track do |username, message, phrase|
    puts("track: #{username} says #{message} (keyword: #{phrase}")
  end
  twitter_bot.runn(:follow_all_followers => true)

Now that we have the basic building blocks to build our service, let’s build our core business logic (observatory_twitter_bot.rb):

This means that we will send a greeting when people start following us:

1
2
3
4
5
6
7
8
  on_follow do |username|
    logger.info("#{username} is following us, will follow #{username} too and send welcome message")
    follow(username)
    direct_message(username, "the Observatory is now ready to serve you, use '@observatory track [keyword]' to get blogosphere updates.")
  end
  on_unfollow do |username|
    logger.info("#{username} stopped following us")
  end

Note: in order to get the on_follow event, we have to poll the Twitter HTTP API . Since Twitter limits the rate to 70 requests per hour, I poll every two minutes to be on the safe side.

And that we will start tracking the Blogosphere for them when they say the magic word:

1
2
3
4
5
6
7
8
9
10
11
12
  on_directed_tweet do |username, message|
    logger.info("directed tweet: #{username} says #{message}")
    if (phrase = track_phrase(message))
      logger.info("tracking '#{phrase}' for user #{username}")
      begin
        direct_message(username, "Will send a direct message anytime something happens in the Blogosphere regarding '#{phrase}'")
        Tracker.for(username, phrase)
      rescue => e
        logger.error("tracking failure: #{e.to_s}")
      end
    end
  end

Now when a new FeedEntry is created, we need to make sure that these Twitter users get notified when their tracked phrase matches the FeedEntry. Since this might take up some time, I’ve created a background worker task for it:

As you might see, Distributed Ruby (DRb) makes it extremely easy to control our bot remotely. In the ObservatoryBot we say:

1
  DRb.start_service("druby://:8997", self)

And all bot functionality can be accessed by calling: observatory_bot = DRbObject.new(nil, ‘druby://:8997’)

Now that we have our autonomous agent it would be nice if we could easily start and stop it in a production environment. I found the Ruby Gem called Daemons extremely useful to wrap these things up.

First, set up a file that runs the never ending process (eg script/observatory_twitter_bot.rb):

1
2
3
4
5
6
7
8
require 'logger'
require File.dirname(__FILE__) + '/../config/boot'
require File.dirname(__FILE__) + '/../config/environment'
require File.dirname(__FILE__) + '/../lib/observatory_twitter_bot'

logger = Logger.new(File.join(RAILS_ROOT, 'log/observatory_twitter_bot.log'))
observatory_twitter_bot = ObservatoryTwitterBot.new(logger)
observatory_twitter_bot.runn

Next, wrap this up in a daemon script (eg script/observatory_twitter_bot):

1
2
3
4
5
6
7

#!/usr/bin/env ruby
require File.dirname(__FILE__) + '/../config/boot'
require 'rubygems'
require 'daemons'

Daemons.run('script/observatory_twitter_bot.rb')


The Demo

Right now if all communication lines with Twitter are working fine, the service is up and running. I’ve made a little bot homepage at observatory.topoints.com

@observatory track ‘Twitter’:

Problems and Opportunities

In the development of the Observatory I had one big obstacle: Twitter is often down and it can cripple your service and development time. I understand that Twitter is a small team under enormous pressure but there have been a lot of complaints about this.

Nevertheless Twitter and it’s developers really kick ass. I emailed Alex Payne and he was excited about what I’m doing (and also the Twitter things happening the iKnow! project in Japan). He responded fairly quickly and immediately whitelisted my Twitter account to up the rate limits.

While working on this project I realized that Twitter in it’s current state isn’t really suitable for system-to-human notifications. Twitter could expand their system to be a true notification framework, but I’m not sure if they will. If they don’t, there is a tremendous business opportunity here. Imagine an open API that mashes up with technologies like XMPP and Growl. A service like that could become THE notification-bus of the web! (Already Growl is pretty big in Mac land). A full blog post about this startup idea coming up!

Geek Food for Thought

What about RubyOnRails, Jabber, DRb, Daemons and BackgrounDRb? I think we are seeing a new framework here! In this interesting article by Danny Ayers he talks about a toolset for agents. On Java there is already “an Agent Framework” that I haven’t checked out yet. I can imagine that these frameworks ease doing development like this and facilitate better system autonomy. Of course it’s desirable to give such a framework a pragmatic paintjob by using Rails paradigms.

Above here I’ve illustrated Rails’ missing brother. I think it’s also good to take into account interesting technologies like Juggernaut and Comet. These are basically Javascript Push techniques to make a synchronous interaction on the asynchronous web possible.

When you combine all these micro- asynchronous communication lines you get one big synchronous connection line between machine agents and user agents.

Building a .com in 24 hours

on January 07, 2008

This is about how I spend 24 concentrated hours spread out over 4 days during Holidays to build the online service Wigitize.com. It is part of my ongoing learning process on how to run a successful web startup.

Even though I’m a super pimple-faced code-geek, I strive to be a creative entrepreneur who can utilize modern day tools and navigate the chaos to build cool stuff. What I tried to do for this project is use some new methods/tools out there to solve practical problems in my weakness area: design, frontend coding, system administration and SEO.

Purpose of this article is to show my thought process on the multidisciplinary aspects of this project. Also to invoke the discussion on how these things could be done much better (correct me!) and hopefully also to educate other entrepreneurial minded hackers.

Wigitize.com, A geeky Idea

While working for my current client here in Tokyo, I’m often running into problems where I wish there was a third party web service that could solve them. For example: Uploading/Managing pictures for user generated content. We want to build cool shit, not reinvent the wheel.

Another one of these ‘problems’ – by saying problems I mean ‘important new features’ – was allowing users to import their blog’s RSS feed. When users input the URL of their blog, a listing of their most recent blog entries will be displayed, sounds simple enough right?

Several weeks before, I already wrote a feature that allowed users to display their latest Twitter ‘tweets’. Building this ‘passive twitter integration’ seriously did not take any longer than one hour! This is because Twitter provides a blog badge or widget. More and more sites are starting to provide these widgets making it easier for people to take their data and display it on their blog. And that’s great!

However, the most popular data feed format – RSS – does not have these widget benefits. Widgets rely on a smart technique called JSON which allows your browser to fetch the actual data. With RSS feeds you need relatively complicated server-side processing to display the data.

So wouldn’t it be cool if there was a web service that allowed you to convert any RSS/Atom feed into a embeddable widget? I’m sure some of you will suggest similar services now :]

A service like that should be:

  • simple, with one big ass URL input
  • smart, it should detect these data feeds
  • integratable (another adjective!), by providing an API for other web services

Aahhh a new project has been born. Let’s open up MS Word and fill in our Prince 2 Project template. Herein we can properly plan for all things we need to be doing and all risks and…

Neh, just kidding! Let’s open up that bottle of wine and… just fucking build it!™

Design Mockup, 3 hours

For me, this is one of the more tricky parts. But fortunately, the project is simple and Web 2.0 designs simple too! Since I don’t aspire to be the next whoever cool designer, I will just give you my Idiot’s view of doing design:

  • being able to draw is not a necessity
  • always start by putting down your content first
  • learn a bit about colors: Color in Motion, Vleere and Kuler (not cool but useful)
  • don’t force making a design, do it in the evening when you find the creative flow – or in an onsen if you have a waterproof mac

The weapon of choice: Adobe Photoshop (CS3, the latest shizne!). (Any good web alternatives yet? :) They say Photoshop has a steep learning curve and I agree. Still there are some things you can do to speed things up:

Learn the most important shortcut keys, you will use them a hell lot (ordered by usage):

  • V, get the arrow key to start dragging shit around
  • M, marker to select things
  • T, text tool
  • I, eyedropper to pick a color
  • CTRL+T, transform current selection
  • G, paint bucket to fill in stuff
  • W, magic wand to select stuff of the same shape/color

There are some things you will be seeing a lot in particular when doing Web20-like designs:

When performing blending options on a layer, you should be playing a lot with the following marked options:

You can learn here and here how to make these babies:

Woops, I shouldn’t get to deep into the details. Anyway, based solely on the content a design like this rolled out:

The colors can be justified as soil and nature, yin yang, peace man!

Storing your Files in an online Project Manager, 30 minutes

After having made those designs, I’d like to throw them online somewhere right away. Preferably in a SCM repository like Subversion. Luckily, I remembered an article on ReadWriteWeb about such Developer Tools

Basically there are two startups that provide freemium SVN and project management hosting. I chose Assembla for disk space, but Unfuddle was a close second.

That said, 10 minutes later I could start throwing things into my very own http://svn2.assembla.com/svn/wigitize/trunk

Hell, it even puts changeset notifications in my Junk box!

Playing with Rails 2.0, 4 hours

Weapons of choice: RubyOnRails 2.0 and Textmate.

Note: If one piece of software is worth paying for, it’s TextMate – am doing so now

RubyOnRails is very suitable for me, but I’m not sure if it is that suitable for you. I myself have a clear CS-engineering background and am very comfortable with digging in deep. On the other hand, depending slightly on luck, you can do a lot with this framework by just modestly hacking away and watching a screencast every now and then.

Rails is quite controversial in terms of scaling and production-ready. However, I think things are changing fast and as I will show you later. Serious innovation is done on how this framework plays with it’s systems layer and technologies like Amazon’s Elastic Computer Cloud

Like making the design, the coding also starts with content. First you jack in the important texts/inputs you have in your design and than you enclose them in the necessary divs so they can be CSS-styled later. In Rails this starts with making a home/welcome controller and writing the default layout: app/views/layouts/application.html.erb. Yes we are using the new Rails 2.0 way of html.erb.

A very important part of Rails I think is the config/routes.rb file. This file holds all your pretty URLs like: http://wigitize.com/json/for/http://dominiek.com/ . In a way routes.rb lays out all the abstract functionality of your web application. Rumors are that the routes part of the framework is internally the most complex one.

Some quick hinters on Rails practices that will speed up things:

  • Never do things like: link_to(’’, :controller => ’’), use Named Routes
  • If something seems complicated, break it down into simple steps and write a unit test first
  • Breathe consistentness, everything you code – even whitespace – has perfect reason. This also means applying Don’t Repeat Yourself all the time as a habit.
  • Naming can be a bitch and you can take up a lot of time. Do your best and pick a name consistently, you can always change it later.
  • In Rails, take Skinny Controller, Fat Model to extreme heights, it will make your life more easy
  • Do it in AJAX right away, it’s often much more simple than supporting plain HTML CRUD in the first place

There is a lot of material out there available on coding Rails in general, so I will let go here.

These first two hours are basically setting up all the routes, filling in the HTML and making things work in a very basic sense. For wigitize that meant:

  • hooking up the URL input to a feed detector and aggregator (most of that code is from here)
  • making sure that the aggregated data is stored in JSON format (In rails that means calling .to_json on any Object, easy as pie!)
  • adding a Widget model that can actually hold the URL, detected feed URL and JSON data:

I spend the next 2 hours seriously code-monkeying on the feed detecting and parsing part of the system. I will soon open source it under the name feedeater.

Style that HTML! 1 hour

Styling, already? Yes, I think it’s good to style quite early in the process. For me, there is one single argument: Flow.

When you are making something work it is nice when things already look quite tangible and usable. When you apply styling in an early stage you can see direct usable results of the things you are building, increasing the psychological state of the Flow.

Ok, so I also kinda suck doing CSS, but I learned enough to turn an image into a web page. These are my bullet point lessons:

  • Develop for Firefox first, using the tool Firebug. If you aren’t using Firebug and doing web development you’re either slow or an imbecile ;]
  • Always put in some basic CSS, I got some here and made it into my own html.css which I can include as a base all the time.
  • Padding and margin: These things are great and you need a lot of them. However, padding often gives you shit so try to choose margin over padding. Also, choose margin-bottom over margin-top since everything tends to float upwards (Thanks Simon)
  • Current fashion: Get some nice fonts going on, I’m using Trebuchet MS a lot for Wigitize and I mix it in with plain Arial.
  • Current fashion dictates that you use a less fierce black for your text, make it #111 or #222.
  • For mozilla/safari I’m using -moz-border-radius/border-radius, will never work in IE6 – fuck them, being a plain user is fine, but you’re not getting any round corners! Besides, isn’t Microsoft planning a forced upgrade soon?
  • You only need tables for tables. They will make your life a pain and you will not be cool if you use them for layout. Other than that it’s fine.
  • Little side note about AJAX: don’t use AJAX for navigation EVER! There are strong SEO and usability arguments against it (learned that the hard way, like most things in my life).

Spin it Baby! 2 hours

OK, I must admit that I spend way WAY too much time on this! However, when you’re doing things with AJAX, you need to put a spinner somewhere to indicate loading.

This so called spinner is pretty kickass and based on a piece of Javascript I wrote earlier:

Yeah Baby! It’s all hypnotic and stuff.

Note: As someone pointed out in the comments, on ajaxload.info you can get a lot of spinner images, like the one I’m using here.

Designing and Coding the Footer, 2 hours

I really like footers and I think they are becoming more and more important. Nowadays footers are used as sitemaps and often they are contextualized as well. These are nice examples of footers: last.fm and snooth.com

For Wigitize the footer is rather small since it’s a small site:

Making the JSON Embeddable, 3 hours

The embeddable Widget has the following code:
<div class="feed_widget">
  <ul id="feed_widget_34"></ul>
  <script type="text/javascript" src="http://wigitize.com/javascripts/wigitize.js"></script>
  <script type="text/javascript" src="http://wigitize.com/feeds/34.json"></script>
</div>

This works in three steps:

  • define a containing list (ul)
  • include a JS library that has a special callback function, in our case wigitize_feed()
  • include the JSON file that will call wigitize_feed with the appropriate data

Providing styles obviously complicates things a bit. When choosing a style, it will include a generic wigitize.css and assign a class to the containing div.

Near future improvements:

  • Provide a ‘grab the grabber’ so that people can provide widgets of their own feeds (eg ReadWriteWeb providing a last-10 articles widget). This could have a lot of potential if provided in a simple feed-burner like button.
  • Put in better default styles than the lame ass ones I have now.
  • Option to include data (useful for photo feeds).
  • Option to display other kinds of aggregated data, eg microformats.

Making things run in the Background, 3 hours

Running things in the background – dubbed backgrounding – is an important part of production ready web applications. It’s a relatively new concept, since web applications used to be less complex. Now however, we are marching towards the Semantic Web where web apps are expected to become intelligent (the Intelligent Agents are coming, just like in the Langoliers!). I think being able to make your applications autonomous now will already reap you benefits (to be continued as an article).

There are several ways to achieve backgrounding in Rails, but the far away most easy one is using BackgrounDRb. BackgrounDRb is a plugin for Rails that allows you to easily kick off background processes and schedule regular tasks. Good for our purposes: detecting and fetching feed data.

I don’t agree with all of Zed Shaw’s big rant about the Rails community being a ghetto, but I sure do agree that there are a lot of idiots out there that produce things that can screw over your production apps. BackgrounDRb has become one of these projects and I strongly recommend that you do not use the latest code. If you start comparing code and read the mailing list you will see that a new guy has taken over Ezra fine project and I suspect that he has lowered the project’s level to pre-stable. I don’t hear any signals from the community and that worries me. Either I’m seeing ghosts or people are blindly accepting anything that’s marked as stable. In any case, I’m using Ezra’s version which works fine.

Finishing up the API, 4 hours

The Wigitize API for now is quite quick and dirty. There are two simple ways of using the API and examples are provided. In this area, a lot of improvements will be made down the line since it’s a key point in making any future freemium revenue.

Domain and Domain Email, 30 minutes

When I buy a domain I always buy a DNS managing package with it as well. This means that I can login somewhere and setup subdomains and set mail records. So the total price of Wigitize.com was 20$ per year.

Providing info@domain.com email is easy, just get yourself a free Gmail for organizations account. In your Google Apps domain manager you can simply add your domain and in your DNS tool you set the MX record to ASPMX.L.GOOGLE.COM. Now you can use gmail and IMAP to read mails sent to your domain.

Setting up a Production Server, 1 hour

I was really eager to put this project on Amazon’s Elastic Computer Cloud until I calculated my monthly costs. Running costs alone for simple projects like these will cost you 60$ a month. Still I think it’s worthy to look into this once you scale beyond simple project.

After digging around I figured out that slicehost.com would be a good cheap second. For 20$ a month I have a 256 memory slice with 100GB in data transfer – awesome.

Setting up your slice takes 5 minutes with a credit card. This slice is essentially a virtual machine with an IP address, completely yours. And the best thing: you get an awesome web console to control everything! Adding a new machine is a no brainer.

Now, I’ve used linux/unix for a long time as a working station. Eventually I got lazy and switched to my current MacBook. Fortunately, you can be lazy for the systems side of things too. All thanks to a lovely tool called Deprec.

Deprec allows you to install the complete Rails stack with a small set of commands. Shortly thereafter you can deploy your application to your production server by typing cap deploy_with_migrations. Please note that for Deprec you need to install Ubuntu Linux on your machine which you can do as follows:

Deprec installs the Rails stack as: Apache, Mongrel cluster (default of 2 instances) and MySQL. As I’ve written earlier, NginX is a nice nano-alternative for Apache. I would like to see that in my Rails stack someday, but I’m not going to worry about that now. Clock is ticking!

Little Pimps and Tweaks, 3 hours

I think it’s good to prepare a little bit for the storm (and I felt like doing something else for a bit), so I’ve created a nice maintenance message for in case there are system/scaling problems. In here I think it’s important to give people an extra reminder to bookmark and come back.

Which brings us to another great service, addthis.com. AddThis provides you with a button that makes it easy to bookmark on multiple social bookmarking services.

Another one of those little tweaks was proper error checking and displaying it to the user. I am using a pink error message to make it look more friendly (maybe I should go even further and make it yellow or something):

Statistics and Search Engine Optimization, 1 hour

Statistics is another 10 minute no-brainer by using Google Analytics. However I’m on the lookout to find something more real-time like Mint (but than free). Any suggestions?

Note: I just added getclicky.com to get more realtime stats than GA

I found a blog which is solely about Rails and SEO which I thought was very promising but in fact doesn’t have much content. I do found something on how to provide different meta tags in rails which I applied right away.

After looking around a bit I also saw some discussion about whether to use www. or not. The way to roll with this is: you permanently redirect your www. domain to say http://wigitize.com. It makes sense, www is old and architecture centric, http://wigitize.com is less typing and pretty.

On that note, make sure that you always write pretty URL’s by practice. This means thinking them through extra carefully, because changing URL’s is painful after going live.

Wigitize.com was also chosen with SEO in mind. The word is a mis-spelling of the term widgetize and yields 105 Google results (at this moment). Additionally, it is a verb, which lubricates the prettiness of the URLs :]

Near future improvements:
  • Making a list of ‘last wigitized sites’ and ‘most wigitized sites’, those pages can be accessed by the search spiders and thus associating external content with Wigitize.com. I think this might improve search rankings.
  • Providing a sitemap.xml, would that help?
  • Focus on the viral aspects of these widgets. For example a FeedB urner style button for on popular blogs.

Let’s throw it out there!

While writing this article – which took quite some time. Wigitize.com is already running and doing it’s job. However, I’m sure there are still some kinks to work out which I will do over the past coming days (eg IE6 support, SEO/viral tasks).

Also, I didn’t discuss anything about an important aspect: How to make revenue with Wigitize.com? I’m not sure yet, but since I’m solving a problem for myself, I’m sure others out there had it. Besides, the costs are extremely low at this point so I will worry about monetizing later. Although I would like to hear YOUR thoughts about it!

I realize that this is a geeky project and I must say it’s quite different than the web apps I normally work on. It was fun for me however to write down my thought process, especially on the non-tech parts of building which I find increasingly interesting.

Note about BackgrounDRB

on January 03, 2008

BackgroundRB was a nice project that allowed you to easily do background jobs in Ruby and Rails.

Now, it is maintained by some PRICK who is rewriting everything making it IMPOSSIBLE to understand. (Maybe Zed Shaw has a Point?)

While doing some quick prototyping for one of my projects I was building a week, month, year period selector. When loading stuff with AJAX it is important to indicate to your users that you are actually refreshing a piece of data on your page. There are many ways of indicating this load, but for now I just wanted something quick and useful (without requiring any strings that might have to be translated in the future):

example data:

when loading:

I meshed together a quick piece of Javascript and CSS since I couldn’t find anything out there. If you know something that does this, please let me know! (Since my code is obviously quick and dirty and only tested on Firefox).

Put this in one of your Javascript files (in case of rails: application.js):

1
2
3
4
5
function spin_div(div_id) {
  container = $(div_id);
  positioning = 'top: '+container.offsetTop+'px; width: '+container.offsetWidth+'px; height: '+container.offsetHeight+'px; ';
  container.innerHTML += '<div class="spin_div" style="position: absolute; ' + positioning + '"></div>';
}

And here is some CSS which you can customize (spinner.gif is a generated spinner from ajaxload.info):

1
2
3
4
5
6
7
.spin_div {
  background: #fff url('/images/spinner.gif') no-repeat center center;
  opacity: 0.75;
  filter:alpha(opacity: 75);
  -moz-opacity: 0.75;
  -khtml-opacity: 0.75;
}

Now when using Rail’s or prototype’s AJAX routines, just pass spin_div as a onLoad parameter:

1
link_to_remote('label', :url => takes_awhile_url, :loading => "spin_div('container_id');")

The great thing about this method is that the spinner get’s automatically destroyed when the content of the container is refreshed.

these are quick notes I spontaneously ranted down about my experience with rails and making it perform

One of the reasons I am working on this current project here in Tokyo is because I can experience the hardships that come with user growth. Apart from learning how to actually get a project to take off, it is also interesting what to do when it actually does!

When we had our first growth spikes we had a lot of people using the system at the same time. Being a learning system, we have the disadvantage of having a lot of data intensive processing. This article is about the code optimization part of a rails project rather than the systems part (which is another chapter, properly described in articles as these)

Finding the slowest requests

There are several tools to spit through the production logs ot find out what the slowest pages are. These pages you will have to tackle first.

Depending on your system load (high mem / cpu / db), you might want to prioritize render-heavy over db-heavy pages, but from what I’ve seen most of the first optimization steps are in DB-heavy pages.

Optimizing a request / page

Render heavy, Database heavy? What are you talking about? Basically, when you look at the mongrel development log load time is separated into two categories: render time and db time. Render time is simply the time it takes excluding calls to the database.

When you have a page that has a high render time, but a low percentage of database time it means that a lot of time is spent on calculations or moving around data. With these pages you have to make sure that:

  • Have no code that blocks the request (HTTP/network calls, external commands, Disk IO). This code should be moved to it’s own background worker.
  • Don’t have too many ActiveRecord code that loads big chunks of data. These calls will appear as having a low DB load, but in fact use up a lot of CPU and memory. Only load the data you display.

When optimizing individual pages, this is my way the way to go:

  • run mongrel_rails on your powerful development machine
  • this might be controversial, but…. LOAD THE ENTIRE PRODUCTION DATABASE. I’m not kidding. This will give you benefits in terms of optimization but also for the usability aspect of developing (might be an extreme literal example of a getting real chapter). However, I do recommend that you make a ‘rake db:make_developer_friendly’ task that will obfuscate the private user data.
  • open up a terminal and run ‘tail -f log/development.log’. That way you can take a good realtime look at all the stuff happening when a request is done. Hit enter a couple of times to create a visual separation between requests :)

ActiveRecord is killing with a thousand cuts

ActiveRecord is a great thing, but when it comes to performance you have to keep it in check. (Even when you’re in production I think it still adds great value!). Stuff like blog_entry.user.username looks quite innocent, but when you have a listing of 100 blog_entries, you’re screwed (it will do a query to load the belongs_to :user relationship everytime this is called in the listing, so you will have another 100 queries to your 1 HTTP request).

You can combat this by preloading and customizing ActiveRecord loads. In the case of blog_entry.user.username, you could do a BlogEntry.find(:all, :include => :user) which will preload the user belongs_to, however this might be inappropriate:

  • :include doesn’t preload polymorphic associations
  • :include doesn’t play well with :joins/:select yet
  • if you only need the username, don’t load the whole user

I’m not sure if I remember correctly, but sometimes it is actually faster to don’t preload at all, but just use the ActiveRecord craze.

Explain these slooow queries

When you have one of those queries that take more than 0.03 secs, you might want to analyze it a bit.

In my current project there are MySQL pro hired-gun consultants that go very far with this stuff, but it’s always good to know a few of their tricks yourself:

Open up your MySQL client and start executing it on your production data copy:

  • Always put SQL_NO_CACHE after your SELECT statement, this will make sure you aren’t looking at MySQL cache load-times.
  • Put the ‘explain’ statement in front of your query to look for big integers which might mean that you’re missing an index.
  • Geez, this output looks fucked! Yes, put \G at the end of you’re pipe characters are gone.

Appropriate indexes should be set up in an early stage. Adding and removing an index can take up to hours when you’ve accumulated a lot of data!

SQL caching is great, but totally useless for datasources that change by the second, an SQL_NO_CACHE can be faster in those cases. For those places that ARE suitable for SQL caching, make sure you don’t work against it. SQL caching needs queries to be always the same.

1
2
('created_at > ?', 5.days.ago.utc) # Not SQL cached
('created_at > ?', 5.days.ago.beginning_of_day.utc) # SQL cached!

In some cases you might be pulling in data of a restricted subset of parents. For example: You want to get all the messages posted by the users that belong to a certain group with x conditions. In those cases, it might be faster to actually retrieve the id’s of all those users in one query. And doing a second SELECT with a giant “user_id IN (?)” condition.

Fragment cache the hell out of it!

You can lower the load of your pages by fragment caching certain area’s in your views. A fragment cache works like this:

1
2
3
<% cache_method(identifier) do %>
  your code here
<% end %>

Code in that block will only run once until clear_cache_method(identifier) is called.

There are several ways of clearing these fragments:
  • Clearing it on specific places during the execution of alterations. This requires specific knowledge of the behavior.
  • Clearing it whenever a change is made to an entity/model. You can use cache_sweepers (observers) for that.
  • Clearing it periodically with a cronjob. This is useful for when behavior is very complicated.

There is one rails good-practice guideline that plays very well with fragment caching: Fat models, Skinny controllers.

1
2
3
<% cache_method(identifier) do %>
  <% @newest_users.each do %>
    ... # no DB calls cached here!!

The instance variable @users is populated in the controller, making the cache nothing more then a HTML cache. What you should do, is MAKE SURE THAT THE DATABASE CALL IS DONE IN THE VIEW.

PHP users will go insane now. What? Database queries in the view? Are you an amateur? Well, it’s actually quite elegantly tucked away in the model:

1
2
3
<% cache_method(identifier) do %>
  <% User.newest.each do %>
    ...

No code in the controller, all execution in the fragment. Yeah!

Join the summaries!

When you have a system with very complicated datasets you will need big queries with a lot of joins. To improve performance you can ‘denormalize’ the database – making the structure more simple. But sometimes you can’t. What you CAN do and probably have to do, is summarize that data so it can be accessed quickly.

Finding the right architecture for a summary table took a few trials and errors. At the moment this is the way I roll with this:
  • add a AR model MyEntitySummary
  • add a class method MyEntitySummary.full_regenerate (truncate table and insert all my_entity_summaries)
  • add a class method MyEntitySummary.update_for(my_entity) (update one row of my_entity_summaries)
  • make sure that both are using the same pieces of SQL (DRY)
  • the first time you migrate, call MyEntitySummary.full_regenerate

And now comes the tricky part. Preferably, you only want to call update_for from now on. full_regenerate is only for the first time or emergencies. You can call full_regenerate on an after_save or an observer (preferably through backgroundrb)

You might tempted to put full_regenerate in a cronjob and run it every hour. Only do that when it’s really necesarry since it will cause big load spikes on your servers. Also, we have had some troubles with table locks etc.

Size, count, length?? Cache that count

As you might know, these methods have different behavior when running them on an association. For example user.blog_entries.length will pull in the full data set and return the size of that data set. user.blog_entries.count on the other hand will just do a count query without pulling in any data.

I could show you a nice table of when to use what, but I’m not going too. Actually, I’m not so sure anymore since I’ve seen some weird stuff lately. Basically, you don’t use length unless you know what you’re doing. I like size, but to be sure I just use count.

If you have a lot of count queries or you want to join in a count for a big query, you might want to take a look at counter_cache. Documentation for counter_cache can be find nowhere, so I will tell you briefly:

  • counter cache stores the count in the parent that has many
  • this count is stored in an SQL field and should ALWAYS BE AN INTEGER (I wasted some time on that, rails will not say anything)
  • In the example of user.blog_entries, you have to open the BlogEntry model and add: belongs_to :user, :counter_cache => true
  • All you need to do now is add a migration saying: add_column :users, :blog_entries_count, :integer, :default => 0
  • If you are testing properly, you WILL get failing tests now :-)

C’est tout

This is just a small set of things you can do to get better performance. Some of it might be wrong or idiotic since it is based on my own trial and error experience (only way I learn). I hope you can use this to solve your performance luxury problem soon ;-)

In the current SNS - Social Networking Site - boom it is becoming increasingly important to deal with usability. People have accounts for many different websites and it's getting more and more tiring to register for a new account. This is one of the main reasons why Confabio.com doesn't require you to signup and login. And it's also one of the main reasons why websites like Wakoopa.com make their registration as painless as possible. A colleague of mine was even experimenting with the idea of omitting the username/email requirement at all. Also, OpenID is yet too young and Sun's Liberty Alliance is just too corporate and slow.

But for most social networking sites it's pretty simple: they just need people to enter information. So let's make that as easy for the user as possible.

Entering Syndication Feeds

For one of my projects I have to let users enter information about themselves. This is so they can build up their own profile. What I really like about some of the new sites is that they aggregate your blog's contents and your FlickR pictures.

One of such websites is the Tokyo based Social Networking Site Asooboo.com. After signing up you can enter your blog feed and FlickR username and it will keep track of all your stories and pictures. I think that's really cool and it's one of the first steps in making the web more ubiquitous. You can later change your Feed URL in your 'edit my profile':

Entering Links instead of Feeds

Entering feeds is nice, but to users that are not tech-savy 'Feed, RSS and Atom' might raise question marks. Therefore I think it would be nice if the users wouldn't have to worry about feeds, but instead can just enter their links like:

My Websites and Profiles:

  • http://blog.dominiek.com/
  • http://www.flickr.com/photos/dominiekterheide/
  • http://del.icio.us/dominiekth

It would then show a fancy spinner and convert it to 'My Blog', 'My Pictures' and 'My Links'. All content will be automatically aggregated if it can detect any RSS feeds on those pages.

Detecting RSS feeds

When you use a proper browser like Mozilla Firefox you will see a syndication icon every time you visit a website that has RSS feeds:

It does this by reading certain HTML tags.

After a quick search I couldn't find any code to do this in my own project, so I wrote a little piece of code for it with a RubyOnRails integration test.

You can use it like this:

 FeedDetector.fetch_feed_url('http://blog.dominiek.com/')
 => "http://blog.dominiek.com/feed/atom.xml"
 FeedDetector.fetch_feed_url('http://blog.dominiek.com/feed/atom.xml')
 => "http://blog.dominiek.com/feed/atom.xml"
 FeedDetector.fetch_feed_url('http://www.flickr.com/photos/dominiekterheide/', :rss)
 => "http://api.flickr.com/services/feeds/photos_public.gne?id=71386598@N00&amp;lang=en-us&format=rss_200"
 # alternatively you can parse HTML with FeedDetector.get_feed_path(html_data)
 # see integration test for more examples

FeedDetector + Test

Excuse my quick mash code. The FeedDetector (lib/feed_detector.rb):


require 'net/http'

class FeedDetector

  ##
  # return the feed url for a url
  # for example: http://blog.dominiek.com/ => http://blog.dominiek.com/feed/atom.xml
  # only_detect can force detection of :rss or :atom
  def self.fetch_feed_url(page_url, only_detect=nil)
    url = URI.parse(page_url)
    host_with_port = url.host
    host_with_port << ":#{url.port}" unless url.port == 80
    req = Net::HTTP::Get.new(url.path)
    # something fishy going on with URI.host
    res = Net::HTTP.start(url.host.gsub(/:[0-9]+/, ''), url.port) {|http|
      http.request(req)
    }
    feed_url = self.get_feed_path(res.body, only_detect)
    feed_url = "http://#{host_with_port}/#{feed_url.gsub(/^\//, '')}" unless !feed_url || feed_url =~ /^http:\/\// 
    feed_url || page_url
  end

  ##
  # get the feed href from an HTML document
  # for example:
  # ...
  # <link href="/feed/atom.xml" rel="alternate" type="application/atom+xml" />
  # ...
  # => /feed/atom.xml
  # only_detect can force detection of :rss or :atom
  def self.get_feed_path(html, only_detect=nil)
    unless only_detect && only_detect != :atom
      md ||= /<link.*href=['"]*([^\s'"]+)['"]*.*application\/atom\+xml.*>/.match(html) 
      md ||= /<link.*application\/atom\+xml.*href=['"]*([^\s'"]+)['"]*.*>/.match(html) 
    end
    unless only_detect && only_detect != :rss
      md ||= /<link.*href=['"]*([^\s'"]+)['"]*.*application\/rss\+xml.*>/.match(html) 
      md ||= /<link.*application\/rss\+xml.*href=['"]*([^\s'"]+)['"]*.*>/.match(html) 
    end
    md && md[1]
  end

end

The integration test (test/integration/feed detector test.rb:


require "#{File.dirname(__FILE__)}/../test_helper"


class FeedDetectorTest < ActionController::IntegrationTest

  def test_fetch_feed_url
    return # uncomment me to test HTTP fetching

    # test mephisto
    feed_url = FeedDetector.fetch_feed_url('http://blog.dominiek.com/')
    assert_equal('http://blog.dominiek.com/feed/atom.xml', feed_url)
    # test wordpress
    feed_url = FeedDetector.fetch_feed_url('http://digigen.nl/')
    assert_equal('http://digigen.nl/feed/', feed_url)

    # test non conventional port
    feed_url = FeedDetector.fetch_feed_url('http://blog.dominiek.com:8000/')
    assert_equal('http://blog.dominiek.com:8000/feed/atom.xml', feed_url)

    # test only_detect rss/atom on flickr
    feed_url = FeedDetector.fetch_feed_url('http://www.flickr.com/photos/dominiekterheide/', :atom)
    assert_equal('http://api.flickr.com/services/feeds/photos_public.gne?id=71386598@N00&amp;lang=en-us&format=atom', feed_url)
    feed_url = FeedDetector.fetch_feed_url('http://www.flickr.com/photos/dominiekterheide/', :rss)
    assert_equal('http://api.flickr.com/services/feeds/photos_public.gne?id=71386598@N00&amp;lang=en-us&format=rss_200', feed_url)

    # make sure that feeds return themselves
    feed_url = FeedDetector.fetch_feed_url('http://blog.dominiek.com/feed/atom.xml')
    assert_equal('http://blog.dominiek.com/feed/atom.xml', feed_url)
    feed_url = FeedDetector.fetch_feed_url('http://digigen.nl/feed/')
    assert_equal('http://digigen.nl/feed/', feed_url)
  end

  def test_get_feed_path
    body = []
    body << ' <html>'
    body << '  <head>'
    body << '   <link href="/super.css" rel="alternate" type="text/css"/>'
    body << '   <link href="/feed/atom.xml" rel="alternate" type="application/atom+xml" />'
    body << '  </head>'
    body << ' </html>'

    # Mephisto
    feed_path = FeedDetector.get_feed_path(body.join("\n"))
    assert_equal('/feed/atom.xml', feed_path)
    body[3] = '   <link href=\'/feed/atom.xml\' rel="alternate" type="application/atom+xml" />'
    feed_path = FeedDetector.get_feed_path(body.join("\n"))
    assert_equal('/feed/atom.xml', feed_path)

    # FlickR
    body[3] = '<link rel="alternate" type="application/atom+xml" title="Flickr: Photos from dominiekth Atom feed" href="http://api.flickr.com/services/feeds/photos_public.gne?id=71386598@N00&amp;lang=en-us&format=atom">'
    feed_path = FeedDetector.get_feed_path(body.join("\n"))
    assert_equal('http://api.flickr.com/services/feeds/photos_public.gne?id=71386598@N00&amp;lang=en-us&format=atom', feed_path)
          body[4] = '<link rel="alternate"   type="application/rss+xml" title="Flickr: Photos from dominiekth RSS feed" href="http://api.flickr.com/services/feeds/photos_public.gne?id=71386598@N00&amp;lang=en-us&format=rss_200">'
    feed_path = FeedDetector.get_feed_path(body.join("\n"))
    assert_equal('http://api.flickr.com/services/feeds/photos_public.gne?id=71386598@N00&amp;lang=en-us&format=atom', feed_path)
    feed_path = FeedDetector.get_feed_path(body.join("\n"), :rss)
    assert_equal('http://api.flickr.com/services/feeds/photos_public.gne?id=71386598@N00&amp;lang=en-us&format=rss_200', feed_path)

    # Wordpress
    body[3] = '<link rel="alternate" type="application/rss+xml" title="Digigen RSS Feed" href="http://digigen.nl/feed/" />'
    body[4] = ' </head>'
    feed_path = FeedDetector.get_feed_path(body.join("\n"), :atom)
    assert_equal(nil, feed_path)
    feed_path = FeedDetector.get_feed_path(body.join("\n"), :rss)
    assert_equal('http://digigen.nl/feed/', feed_path)
  end

end

I'm sure this might be useful to some people so Enjoy!

Tokyo on Rails

on February 14, 2007

This post will be a personal life update (with pics).

Today was my first working day as a RubyOnRails Developer in Shibuya . The company I work at does normal Japanese hours (from 10 till 10), but fortunately, they have a laid back international atmosphere. On a normal working day I communicate in Japanese, English and Flemish-Dutch with my colleagues. My co-developers are pretty cool and like-minded. The core Rails guy lives in Honolulu and flies in every few weeks. Also most of them have read the book that just got delivered to me: Getting Real

Another core activity next to my job is studying Japanese. Every Tuesday and Thursday evening I go to Japanese Kaiwa (conversation) class. It’s kind of interesting how slowly – but gradually – all that noise starts having a melody.

Also, I will try to make some more time for art. This weekend I visited an Australian Contemporary Art Exhibition and the Dutch movie ‘Phileine zegt Sorry’

Also I came up with my own idea of a cool art project: Yamanote Clock (This link is a writeboard, password is empty, feel free to edit!)

Since my domain darkwired.org is only used for IRC (chatting) I decided to write a web interface. The Web interface uses the EyeRC API and is built on RubyOnRails.

Currently, you can join our test channel on http://www.darkwired.org/ :

I have no intention to develop it any further, however, I might take some time to finish it off (sourcecode):

  • refactoring
  • unit test case

I hope this was the last IRC client I ever wrote :)

(related code: BASH IRC Client )