Ruby & OpenCalais: Semantically Tag Anything
The Semantic What??
If you’ve been wondering what all the fuss regarding the Semantic Web is about, fear not – you’re not the only one! As well as being a trendy buzz phrase, the Semantic Web has many practical applications. In this tutorial, we’ll explore a few of these using a Semantic Analysis Web Service right from within our favorite programming language!
The OpenCalais Web Service
In short, and quoting the OpenCalais creators:
“The OpenCalais Web Service automatically creates rich semantic metadata for the content you submit – in well under a second. Using natural language processing (NLP), machine learning and other methods, Calais analyzes your document and finds the entities within it. But, Calais goes well beyond classic entity identification and returns the facts and events hidden within your text as well.“
The OpenCalais Simple XML Response Format (that we’ll be using for this tutorial) returns three kinds of tags: Entitites, Events and Topics. Entities are static ‘things’, like Persons, Places, et al. that are involved in the textual context in some capacity. OpenCalais assigns a relevance score to each entity to indicate it’s relevance within the context of the data source’s general topic. Events are facts or actions that pertain to one or more Entities. Topics are a characterization or generic description of the data source’s context.
We can use this metadata and its attributes to extract relevant information from the data or to draw useful conclusions about it, as we shall see in the next few sections.
Before We Start…
-
To use the OpenCalais Web Service, you need to get an OpenCalais API key, which is easily obtainable from the OpenCalais web site.
-
To interact with OpenCalais through our Ruby code, we’ll use the DoverToCalais gem. DoverToCalais wraps nicely around the OpenCalais API and provides some useful functionality on-top, such as callbacks, multiple data-source formats, and result filtering. To use the gem, just add this line to your application’s Gemfile:
gem 'dover_to_calais'
And then execute:
$ bundle
Or install it yourself as:
$ gem install dover_to_calais
Also, bear in mind that DoverToCalais requires the presence of a working JRE.
Let’s Begin
We’ll start small by analyzing a URL and see what we get back. As DoverToCalais uses the power of EventMachine, our code must be placed within an EM reactor loop:
1 require 'dover_to_calais'
2
3 EM.run do
4
5 # use Control + C to stop the EM
6 Signal.trap('INT') { EventMachine.stop }
7 Signal.trap('TERM') { EventMachine.stop }
8
9 # we need an API key to use OpenCalais
10 DoverToCalais::API_KEY = 'my-opencalais-api-key'
11 # create a new dover
12 dover = DoverToCalais::Dover.new('https://www.bbc.co.uk/news/world-africa-24412315')
13 # parse the text and send it to OpenCalais
14 dover.analyse_this
15 puts 'do something....'
16 # set a callback for when we receive a response
17 dover.to_calais { |response| puts response.error ? response.error : response }
18
19 puts 'do something else....'
20
21 end
The above is quite straightforward: we wrap our data-source in a Dover object (line 12), set a callback to do what we’d like with the response data (line 17), and don’t forget to send our dover for tagging (line 14). This will produce the following result:
As you can see, the OpenCalais response is basically a big pile of XML. Fortunately, DoverToCalais allows us to easily parse and filter this XML, as we’ll see further on in this tutorial.
Also bear in mind that, although we passed a URL to the DoverToCalais constructor, the gem can handle most file formats, too. We could just as easily have passed it a path to any (well, almost any) kind of file and have its content tagged as well.
To get a feel for some more in-depth and realistic usage of DoverToCalais, we have to talk about Connor.
Connor’s Conundrum
Connor is a budding journalist for an up-state newspaper. His boss has just asked him to write a one-pager on all the significant sport-related news that took place in the (fictional) town of Alderwood during the last year. She sends him a link to a network folder containing a scan of every article published by the Alderwood Gazette in the last year (provided courtesy of the VAST Challenge 2006). “That’s where you’ll find what you’ll need”, she says. “I want your page on my desk first thing tomorrow morning!”. And with that, she leaves the office. Connor is left contemplating his choices:
-
Read each and every article and filter out the ones containing sports news. Unfortunately, poor Connor just doesn’t have that many hours, nor the Buddhist-monk-like mental focus required for this task.
-
Use a grep-like tool to search for sport-related keywords in each article. At first, this seems like a good idea. However, Connor soon realizes that this isn’t such a practical idea, after all. What sports is he meant to search for? Basketball, football, baseball, who knows what the good residents of Alderwood are keen on? And what about the more exotic sports, such as sumo wrestling or nude yodelling? How is he meant to know all the possible sport-related activites that were mentioned in the Alderwood Gazette over the last year so that he can search for them?! Not to mention, the false positives he’s going to get when an unrelated article mentions a sport in passing and out of context. No, he thinks, grep isn’t the answer either.
Luckily, Connor knows Ruby and he’s read about DoverToCalais. Suddenly, Connor’s future looks bright!
Making Sense of Big Data
Connor starts churning out code:
1 require 'dover_to_calais'
2 EM.run do
3
4 # use Control + C to stop the EM
5 Signal.trap('INT') { EventMachine.stop }
6 Signal.trap('TERM') { EventMachine.stop }
7
8 DoverToCalais::API_KEY = 'my-opencalais-api-key'
9 data_dir = '/home/connor/data/Alderwood_News/'
10
11 dovers = []
12 Dir.foreach(data_dir) do |filename|
13 next if filename == '.' or filename == '..'
14 dover = DoverToCalais::Dover.new(data_dir + filename)
15 dovers << dover
16 end
17
18 ##what now?
19 end
So far, so good. Connor has created a list of all his data files as Dover objects. Next, he has to analyze those objects. He replaces the ##what now?
(line 18) comment with:
18 dovers.each do |dover|
19 dover.to_calais do |response|
20 if response.error
21 puts "*** Data source #{dover.data_src} error: #{response}"
22 else
23 topics = response.filter({:entity => 'Topic'})
24 puts "Data file: #{dover.data_src} is about #{topics.map{|x| x.value}.join(",")}"
25
26 end #if
27 end #block
28
29 dover.analyze_this
30 end #each
For each dover object (i.e. each news article), Connor has set a callback (line 19) specifying what to do when he gets a valid response, which is to extract the ‘Topic’ entities so that he can see what the article is all about. Then he’s sending the object to OpenCalais for tagging (line 29).
Also note that, in order to display only the names of the topics of each dover, Connor has mapped and joined the value attribute of each Topic into a string. Clever Connor!
Now he’s ready to run the code. But, when he does, his face turns ashen… a number of error messages have started popping up on his screen!
Connor shouldn’t panic though, the explanation is quite simple: OpenCalais limits the number of concurrent requests per user to a maximum of 4 per second. When Connor calls dover.analyze_this
, DoverToCalais makes an HTTP request to OpenCalais. As this request happens once for each iteration over the dovers
array, it means that dozens requests are being fired to OpenCalais at the same time. No wonder OpenCalais complains about it!
How can Connor fix this problem?
Hand on the Throttle
Connor might be tempted to address the problem by adding a sleep
statement just before dover.analyze_this
. However, that just wouldn’t work. Connor’s forgetting that his code runs within an EventMachine thread and -for reasons outside the scope of this tutorial and related to the EM threading model- using sleep
would be a baaad idea!
This is where the aptly-named EM Throttled Queue comes into play. A throttled queue allows us to control the rate at which items are popped off it.
So, Connor modifies the code like so:
1 require 'dover_to_calais'
2 require 'em/throttled_queue'
3 EM.run do
5 # use Control + C to stop the EM
6 Signal.trap('INT') { EventMachine.stop }
7 Signal.trap('TERM') { EventMachine.stop }
8
9 DoverToCalais::API_KEY = 'my-opencalais-api-key'
10 data_dir = '/home/connor/data/Alderwood_News/'
11
12 # allow up to 2 dovers to be de-queued in a second
13 # that should keep us well within the OpenCalais
14 # concurrency limit
15 queue = EM::ThrottledQueue.new(2, 1)
16
17 dovers = []
18 Dir.foreach(data_dir) do |filename|
19 next if filename == '.' or filename == '..'
20 dover = DoverToCalais::Dover.new(data_dir + filename)
21 dovers << dover
22 # push the dover on our throttled queue as well
23 queue.push(dover)
24 end
25
26 dovers.each do |dover|
27 dover.to_calais do |response|
28 puts "----------------------------------------"
29 if response.error
30 puts "*** Data source #{dover.data_src} error: #{response}"
31 else
32 topics = response.filter({:entity => 'Topic'})
33 puts "Data file: #{dover.data_src} is about #{topics.map{|x| x.value}.join(",")}"
34
35 end #if
36 end #block
37
38 # because we told the queue to pop a maximum of two dovers per second
39 # we're not exceeding the OpenCalais limit so we'll get no errors
40 dovers.length.times { queue.pop { |dover| dover.analyze_this } }
41 end #each
42 end
Connor is all smiles. In a short time, his code tagged all the articles with Topic tags, allowing him to quickly know what each article is about. Now, he can easily select the sports-related articles for further reading and save hours of effort and worry. He can scan through only the relevant articles and produce his piece in time for tomorrow’s edition.
Epilogue
In this tutorial, we’ve seen how we can use the OpenCalais service to tag and summarize large volumes of content. Still, we’ve only just scratched the surface of the rich semantic metadata the service offers and how we can use it for some more refined data analysis, mining and intelligent search.