MongoDB UK & FR Conferences Ex Post Facto

The recent MongoDB conferences in London & Paris were very well organized, and also useful in a practical way. The 10gen founders were there, so we could discuss issues/challenges, and hear future development plans first-hand. Plus there were lots of interesting talks, covering tons of topics, including case studies by other folks using MongoDB, showing different use scenarios.

Both conferences were very good as a refresher course on Mongo’s features. We discovered cool things we assumed Mongo wasn’t doing because previous versions weren’t (but are now!), and also new things we never even dreamt Mongo could do. The most interesting highlights for us were:

  • The official sharding features, planned to launch in version 1.6 of MongoDB this July will finally allow us to trash our ops scripts.
  • It was good to see how labs use MongoDB for their academic research. It shows different document designs to what we’re used to in the industry.
  • The talk about administration tools by Mathias Stearn was a good reminder that MongoDB admin commands return a lot of useful numbers.
  • And finally, a Personal Message: Eliot, you should speak slower in front of an international audience ;-)
For those of you who missed it, we’ve posted our slides on Slideshare: One Year with MongoDB at Silentale. Other shared presentations we’ve found include: Let us know if there’s any others out there that we missed, and we’ll add ‘em to the list!

Attend MongoUK or MongoFR, get 20% off

Join Nicolas and myself at MongoUK or MongoFR conferences on June 18 and June 21, and see how MongoDB plays a key role on how to handle the massive amount of data we’re storing for our users, how we’ve built a search engine on top and which are the challenges we’ve been facing. And use the discount code ’silentale’ during registration to get 20% off.

We’re really excited to participate to the first MongoDB-related events in Europe and are looking forward to meeting you in London and Paris!

Gmail and OAuth, Ruby developers haz it

You’ve probably heard about Google launching OAuth to access Gmail and Google Apps Mail.

Developers can now grab emails thanks to the IMAP protocol, without asking for a password. It’s awesome to see an “old” technology, IMAP, meeting a new one, OAuth. And no need to use a new API.

To help other Ruby developers integrating Gmail+OAuth, we published a new gem, gmail_xoauth. Once installed, you get updated Net::IMAP and Net::SMTP libraries, ready to authenticate via XOAUTH. And of course, it works with Google Apps too.

To authorize on ‘imap.gmail.com’, instead of giving a string password, you give a hash of options so the SASL Initial Client Request will be generated and sent over the wires for you.

require 'gmail_xoauth'
imap = Net::IMAP.new('imap.gmail.com', 993, usessl = true, certs = nil, verify = false)
imap.authenticate('XOAUTH', 'roger.moore@gmail.com',
  :consumer_key => 'anonymous',
  :consumer_secret => 'anonymous',
  :token => '4/nM2QAaunKUINb4RrXPC55F-mix_k',
  :token_secret => '41r18IyXjIvuyabS/NDyW6+m'
)
messages_count = imap.status('INBOX', ['MESSAGES'])['MESSAGES']
puts "Seeing #{messages_count} messages in INBOX"

The principle is the same for SMTP:

require 'mail'
require 'gmail_xoauth'
 
mail = Mail.new do
     from 'roger.moore@gmail.com'
       to 'marcel@amont.com'
  subject 'This is a test email'
     body 'Hi!'
end
 
smtp = Net::SMTP.new('smtp.gmail.com', 587)
smtp.enable_starttls_auto
secret = {
  :consumer_key => 'anonymous',
  :consumer_secret => 'anonymous',
  :token => '4/nM2QAaunKUINb4RrXPC55F-mix_k',
  :token_secret => '41r18IyXjIvuyabS/NDyW6+m'
}
smtp.start('gmail.com', 'roger.moore@gmail.com', secret, :xoauth) do |session|
  session.send_message(mail.encoded, mail.from_addrs.first, mail.destinations)
end

Feel free to fork the public github repository and add new features !

Note: we just launched OAuth support for Gmail and Google Apps Mail on Silentale.

Don’t miss the Silentale Case Study at the MongoDB UK & FR Events!

10Gen, the company behind MongoDB, is organizing two cool events on June 18th (London) and June 21st (Paris). The events will cover everything from sharding, to workshops on Ruby & Python, as well as examples of it in action, from folks like Fotopedia, Oupsnow, OCWSearch, and Boxed Ice.

We’ll also be presenting a case-study of what we’re doing at Silentale with MongoDB, including the challenges we face with the massive amount of data we store for our users, and what our infrastructure looks like. The agenda is very practical and interesting, and we are looking forward to talking with you at both events !

p.s. follow MongoDB on Twitter to keep up-to-date

Everyone deserves a Face

FacesA pluggable avatar architecture for universal implementation of avatars from multiple/external sources.

Avatars have become a fundamental building block for any person-orientated web application. Avatars add a spark of life into your site and allow your users to relate to an otherwise faceless name. (Yeah, you liked that one didn’t you?)

As any developer who has worked with implementing an avatars system will know however, it can sometimes be a real pain in the derrière. Usually avatars start in a web application as a quick-fix solution with no thought for later expansion & scaling. This leaves the code isolated so that if you then decide to grow it out later, you’re going to hit some serious road bumps. This is sometimes down to bad planning by the developer and sometimes down to a lack of implementation consistency between multiple gems/plugins. The end result however is the same – it sucks.

I joined Silentale just at the start of January and one of my first tasks was to improve our current avatar architecture. When first starting out on the project I battled with many gems & plugins to try and achieve the result we wanted. When I completed my first build I sat back and looked at my code. The thought that came into my head was:

“What a big pile of dung!”

The reason for this was that to achieve the result we needed, I had to hack together 3 gems, some of our existing avatars architecture and some other messy bits and bobs to just get close to an okay implementation. This resulted in a big pile of slow and yucky code that when the next developer came to… well, I don’t like to think about that.

It was because of this I decided to create Faces. is an open-source gem that I built to rectify the problem we faced at Silentale. It allows developers to pull avatars from multiple external sources by simply providing an identifier and provider name like so:

avatar_url('someone@example.com', 'gravatar')

Pretty neat, huh? Faces can also:

  • Generate/manage a consistent HTML style with classes, ids, alts and sizes
  • Auto-calculate the closest possible size to an image size requested
  • Manage multiple default image sizes
  • A whole load more

Faces is crazily customizable – the entire purpose of the gem is to have a universal method for managing all your different types of avatars. I’d really advise reading the documentation to get the really juicy info on how it works and how far you can take it.

At the moment Faces supports by default the following providers:


Faces is built in such a way to allow developers to add their own providers with ease by either passing Faces a direct avatar URL or building their own provider.rb file (which is so simple that monkeys could do it).

At the moment Faces is only available as a Ruby gem/plugin however I will be releasing it in first PHP and then Python in the near future, so make sure to keep an eye out for that.

Make sure to check it out for implementing your next avatars system – we hope it makes your next project that little bit easier!

How to check streamed files in a Rails’ functional test

For administration purposes, I had to rework an uploaded CSV file, and stream it back to the user. I started wondering how I could test this uncommon response body type.

I knew about:

fixture_file_upload

which allows the simulation of a file upload, so I looked at the Rails source, in ActionController’s TestProcess class, where the method lies.

There I found the:

binary_content

method, which does the job by returning the entire streamed content in a string!

How it works

First, let’s take a look at the send_file sources:

...
render :status => options[:status], :text => Proc.new { |response, output|
  logger.info "Streaming file #{path}" unless logger.nil?
  len = options[:buffer_size] || 4096
  File.open(path, 'rb') do |file|
    while buf = file.read(len)
      output.write(buf)
    end
  end
}
...

We see that the response is returned as a Proc. All that binary_content has to do is to:

  1. create an IO object,
  2. call the proc and pass it the IO object
  3. read the IO and return the content as a string

Here is the binary_content method:

def binary_content
  raise "Response body is not a Proc: #{body.inspect}" unless body.kind_of?(Proc)
  require 'stringio'
 
  sio = StringIO.new
  body.call(self, sio)
 
  sio.rewind
  sio.read
end

StringIO’s for the win!

Cacti templates for beanstalkd

Following on the previous post about monitoring beanstalkd with cacti, I have just uploaded the corresponding templates and script to github.

I hope you’ll find them useful !

Multiple Elastic IP addresses on AWS instances

Recently, we reworked our web frontend infrastructure to separate our corporate website from the Silentale web application. The main goal was to make sure we don’t have to involve our operations resources (a.k.a … me :-) ) too much when we update our corporate web site …

This involved setting up a few new virtual hosts and names:

My original idea was to use virtual servers, and have the web servers running from the same host. I had prepared all the necessary valid certificates and DNS configuration, the configuration for nginx was ready and working, but I had forgot one little problem with SSL and virtual hosts: as the server tries to retrieve which virtual server should respond to the request by reading the value of the Host: request header, the SSL connection must have already been established, therefore a proper certificate must have been presented to the client, and nginx cannot decide which one to send (as it can’t read the Host: header) !

Ok, i’m dumb, and I shouldn’t forget you can’t host serveral SSL servers with different names from the same IP address unless you buy one of these expensive wildcard certificates (buying certificates is as close to extortion as to ‘renew’ a domain name or trying to use one of these “limited unlimited 3G+” offers, but that will be the subject of another post …)

Listening to different ports ?

I could have decided to make the servers listen to different ports, but
  • our platform is getting more and more complex, and we’d rather avoid doing that sort of things
  • we don’t want to confuse our users (and especially the new beta users who have to get a very good impression of our service) with TCP ports that could be filtered on their end …

Off to elastic IP ?

Amazon web services provide a very useful tool for web servers which they call Elastic IP. You reserve a static, public IP address in their subnet, and you can seamlessly attach it to any instance. You can switch it from instance to instance, and in a matter of seconds, the new server will be reachable on this address, without you having to update DNS record or incur any downtime to your users. Very cool stuff when upgrading or migrating !

So my problem could have been solved if i could assign multiple elastic IPs to the same EC2 instance, and make each of my SSL servers listen on each one of them … unfortunately, you just can’t do that :-( . An EC2 instance gets an address on a RFC1918 subnet, and that’s the only one you can see from it. In order to be able to connect to it, amazon creates some sort of reverse NAT between the outside world and your instance (i’d like to know how they handle that in their border routers. It must be really interesting and challenging to operate ;-) )

Several instances …

So, in our case, the only proper solution is to start 3 instances when we could have used only one . This is really fustrating and unfortunate … If you came across the same problem on AWS and have found a better solution, i’d be very happy to hear about it.

If someone from Amazon reads me, this feature isn’t critical, but it prevents any kind of mutualization on your cloud. I don’t think it is part of your core business, but it may represent a significant share of it in the future (low cost hosters, etc …)

Monitoring beanstalkd with cacti & creating custom cacti templates

[update: the script and templates discussed below are available on github: http://github.com/earzur/cacti-beanstalkd-templates]

At Silentale, we use working queues to orchestrate the different services in our platform.

Passing messages between different processes is a pretty common pattern for distributing a process across many computers and our platform makes an extensive use of Beanstalkd, which is a fast, lightweight and robust server that does just that: let us implement working queues.

Monitoring the number of jobs available (ready, in the beanstalkd idiom), the number of workers consuming jobs, and the number of jobs that failed (buried) allows operators to get a pretty good idea of how the platform is performing and add or remove capacity when needed. This is particularly useful in our AWS-based environment, when you can add and remove capacity in no time.

For instance, when sending a new batch of invitations to beta users to test our product, we get significant spikes in our working queues, and may need to temporarily add new servers, and then shut them down when we have finished processing the queues.

The protocol used by beanstalkd is pretty simple and provide ways to easily collect usage statistics about the server itself and every individual queue that you might have defined.

Another tool we use a lot at Silentale is Cacti. By default, Cacti allows to graph data collected through snmp-enabled servers pretty easily, so you can graph metrics about your servers such as network bandwidth, load average, and memory usage. But you are not limited to snmp only, you can also graph data collected through scripts. This is a bit more involved, as the Mysql Cacti templates for cacti demonstrates.

After spending a while looking for an equivalent project that would support beanstalkd and work in our environment, I went on writing my own cacti templates for monitoring our beanstalk queues …

available statistics

The beanstalk protocol provides a lot of different figures about what’s going on in the server both globally and by individual queue. For my little project, I chose to concentrate on the individual queues, as monitoring the number of new jobs in each queue is critical to our operations.

The beanstalk-client gem’s ‘tube_stats’ method returns a hash containing a lot of different values, out of which the ones that we want to graph.

  • current-jobs-buried
  • current-jobs-delayed
  • current-jobs-ready
  • current-jobs-reserved
  • current-jobs-urgent
  • current-using
  • current-waiting
  • current-watching
  • total-jobs
Collecting those involves writing a very short and simple ruby script:

#!/usr/bin/env ruby
 
require 'rubygems'
require 'beanstalk-client'
require 'trollop'
 
script_name = __FILE__.split('/').last
opts = Trollop::options do
version "#{script_name} © 2009 Silentale S.A."
banner <<-EOS
display statistics about a queue on a beanstalkd server
EOS
opt :server, 'beanstalk server address', :type => :string
opt :port, 'beanstalk server port (default: 11300)', :type => :integer
opt :queue, 'name of the beanstalk queue', :type => :string
end
Trollop::die :server, 'is mandatory' unless opts[:server]
Trollop::die :queue, 'is mandatory' unless opts[:queue]
opts[:port] = 11300 unless opts[:port]
 
B = Beanstalk::Connection.new "#{opts[:server]}:#{opts[:port]}"
 
ts=B.stats_tube opts[:queue]
 
ts.delete 'name'
 
result = ''
ts.keys.sort.each do |k|
result << "#{k}:#{ts[k]} "
end
 
puts result

We just connect to the server, issue the tube_stats method, and dump the resulting hash, after having sorted the keys by names.

cacti configuration

The output of the script is pretty straightforward, a single line with name: value pairs, separated by spaces. Cacti can directly parse the output using what is called a data input method. Data input methods link external scripts with data fields that can be used in data templates, then those data templates can be used in graph templates.

One unfortunate problem with cacti is that everything has to be done using the GUI, involving a lot of clicks and typing. I didn’t find any way to circumvent this problem, using scripts or any automated tool.

If you managed to read this far and know of any such tool (or a pointer to cacti’s data model) so we can script such tasks, I will be more than happy to receive comments ;-)

data input method

Using cacti’s console, defining a new data input method is straightforward. You just have to create the output fields one by one and map them to the output of the script.

Data input method screen

Data input method screen

As you can see in this screenshot, you need to specify the full path to the data-collection script, specify input parameters between brackets (< and >) (here we define the <server>,<port> and <queue>), and then specify the resulting data field one by one.

Now we need to define a data template, that cacti will use to update the RRD archives and generate the graphs.

data template

The data template maps the output fields of the data input method to “data source items”, which are fields in the RRD archive that will store the collected data.

Those fields have to be carefully named, I have used a convention which I found in the MySQL cacti templates. A few letters in capitals, the same for every fields in the template, followed by the name of the parameter. It is extremely useful to do that, because cacti orders the field names alphabetically in the field selection dropdown list, and there can be many of them ! so remember to carefuly name your input fields in this screen !

In this screen you also tell cacti how to collect the input parameters for your script. This way, cacti will allow you to specify default values when you will create graphs, by associating hosts to the graph templates. Checking the box at the left of the input field will make cacti ask for the associated input value in the forms where you create graphs for hosts. Not checking the box will let cacti decide for you. Here, i’ve left the host field blank with the box unchecked to make sure cacti will automatically fill it with host name to which you are attaching a graph using the template.

Bottom of the "data template" screen

Bottom of the "data template" screen

graph template

Graph templates are pretty simple to define. You specify the list of fields that you want to display in the graphs, the graph style (colors, line size, etc …)

In our platform, we want new users, those that just registered, to see messages in their timeline as soon as possible, so we are using a nifty beanstalkd feature which allow us to mark some jobs “urgent” and make sure that those jobs will be handled before the others. Another very nice feature is the ability to delay the jobs for later handling. For instance, if our backend fails at parsing some kind of messages because of encoding issues or whatever. Instead of giving up, we decided to just delay those jobs for a while, get an alert, try to fix the problem in our backend and let the messages flow back in again. That’s the kind of things that beanstalkd and working queues can help you do. I just can’t figure the complexity of doing such things in a monolithic, old-school process !

Let’s define a graph that will display

  • the number of jobs ready to be processed
  • the number of jobs with the “urgent” flag
  • the number of jobs which have been delayed
  • the number of processes currently actively picking up jobs from the queue
Create a new graph template … Most of the default values will work ok, and again, take care of the naming convention here, your graphs will be much easier to pick up when linking them with hosts …

Add graph template screenshot ...

Add graph template screenshot ...

Then you’ll be presented with a screen where you will have to painfully enter the names of the fields you wish to graph, and you will quickly realise why I insisted on the naming convention for the data fields above … ;-)

Graph template with data fields defined

Graph template with data fields defined

Almost there !!!!

And now, you just have to associate this graph template to the host running your beanstalk server. After a while, you’ll realize that cacti started collecting data and will display nice graphs showing the state of your beanstalk queues, thanks to rrdtool !

Not wanting to disclose too much about our operations, i have edited this screenshot to remove figures and hostnames, but you get the idea. The most interesting one is on the upper right corner, with the crawler queue, where we have a constant pool (green stripe on the lower part) of workers pulling crawling jobs (IMAP,POP,twitter, etc …) and posting every new message they’ve found in another queue whose job is to index them … One of my goals as an operator is to keep all these queues as flat as possible. The whitespace indicates that I need to add some crawling capacity, because some jobs are not performed on-time. They are ultimately, but not within the window of time that we allowed ourselves to operate … we are also currently working on some changes that should help reduce the spikes in that particular graph.

resulting graphs in cacti's thumbnail view mode

resulting graphs in cacti's thumbnail view mode

What’s next ?

As you can see, defining custom templates for cacti is a really involved task. You spend a lot of time fiddling with settings and try/error/revert cycles. That’s why I plan to package my templates, make sure they are generic enough to work outside of our own environment and publish the result or contribute it to the MySQL cacti templates project. Stay tuned …

Having such monitoring in place is invaluable and has been absolutely necessary for us to keep our service online !

Lazy loading with Javascript and Firefox

Lazy loading has been a hot topic highly discussed over the internet recently. The amount of bandwidth for the average user has grown tenfold during the last few years, but the web itself has evolved a lot, and so have the contents of the temp folder.

This has created a need to dynamically load a piece of script (to make the host machine download it), instead of just forcing the user download everything from the very first load.

(more…)