Hypertable
Taking Hypertable 0.9.2.1 for a ride.
Filed in archive Tools by Mateusz Berezecki on February 4, 2009
This is the first post in the series about tools and how you can leverage them to increase your ability to scale. In this post we're going to take a look how it is possible to store a large quantity of data in the Hypertable. The data store chosen here is for a reason, as you will see in a moment.

Consider the following scenario, that you have a lot of data exposed in some internet location that you want to have at local site and be able to sift through it. Perhaps you are creating a web crawler and you must have this data on site, not in a remote location to avoid latency factor when processing it. Let's take Wikipedia for example. You could get the dump of the whole Wikipedia and try one of the two: search through the exported wikipedia in a XML file, maybe even extract articles into separate files (that would be inefficient and would probably be a real pain for your filesystem) or store it in some more clever way. What you need in such scenario is a data storage that can handle this amount of data with great ease. It turns out a Hypertable is a good tool for doing this kind of work.

If you want to follow this example then I'm assuming that you have hypertable compiled, installed and configured on some cluster already, and if you are having trouble then please leave me a note in the comments.

We're going to consider the latest data dump of an english version of Wikipedia that can be downloaded at the following address: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This file is 4.1 gigabytes in size, and it's about 20 gigabytes after decompression so you'll need at least 25 gigabytes of free space to decompress it. After decompression you can remove the compressed file.

Downloading the Wikipedia can be done with a standard set of tools



$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2



Once you are done, then grab this script and save it somewhere.
The next step is to prepare Python thrift library so that we can connect to Hypertable using Python.

This can be done by copying the gen-py and hyperthrift directories from the source code directory you've compiled Hypertable at. For example



$ mkdir ~/wiki_example
$ cd ~/wiki_example
$ wget http://meme.pl/convert_wiki.py
$ cp -R ~/hypertable/src/py/ThriftClient/gen-py ~/wiki_example/gen-py/
$ cp -R ~/hypertable/src/py/ThriftClient/hypertable ~/wiki_example/hypertable/
$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2



Next step is to create a wikipedia table and we do so by typing in following commands into the
hypertable command line utility



hypertable> create table wikipedia (title,article, ACCESS GROUP meta (title), ACCESS GROUP text (article));



The above command will create a table called wikipedia with just two column families:
  • title - containing the title of the article

  • article - containing the page content


and two access groups
  • meta

  • text


These two access groups tell the hypertable to store their member column families in a separate physical file. With the above type of configuration we're going to have titles stored in a different file than full text of the articles. This will enable us to search through titles much faster as the files containing article text will not be processed at all during this process.
Once done with this setup we go back to the command shell and run the conversion script



$ cd ~/wiki_example
$ env PYTHONPATH="gen-py" python convert_wiki.py



This step is a last step and it can take a few hours depending on your cluster configuration, but once everything goes well you'll have a wikipedia dump imported into your hypertable installation.

The import script uses articles' titles as record keys so they are used for querying.
So let's search for some articles which titles start with Dirichlet:



hypertable> select title from wikipedia where row =^ "Dirichlet";
[... output truncated for brevity ...]
Elapsed time: 0.21 s
Avg value size: 22.02 bytes
Avg key size: 23.02 bytes
Throughput: 13245.13 bytes/s
Total cells: 63
Throughput: 294.13 cells/s



We can see there're 63 articles starting with the word Dirichlet.
Doing the same type of query for the word Time yields following results



hypertable> select title from wikipedia where row =^ "Time";
[... output truncated for brevity ...]
Elapsed time: 0.04 s
Avg value size: 26.74 bytes
Avg key size: 27.74 bytes
Throughput: 4348717.61 bytes/s
Total cells: 3585
Throughput: 79815.66 cells/s



You can clearly see the throughput numbers are really high but to be honest with the readers I have to admit that this is a heavy caching in action, as doing the query for the first time is not going to yield such impressive numbers. For example running the query for articles starting with ZX yields these results:



hypertable> select title from wikipedia where row =^ "ZX";
[... output truncated for brevity ...]
Elapsed time: 0.19 s
Avg value size: 9.51 bytes
Avg key size: 10.51 bytes
Throughput: 3783.18 bytes/s
Total cells: 35
Throughput: 188.89 cells/s



Now, running it for the second time gives us much higher numbers



Elapsed time: 0.01 s
Avg value size: 9.51 bytes
Avg key size: 10.51 bytes
Throughput: 47612.58 bytes/s
Total cells: 35
Throughput: 2377.23 cells/s



Let's assume, for the sake of example , that you're interested in articles starting with a letter 'A'. You can get these easily by running this query



hypertable> select title from wikipedia where ('A' <= ROW < 'B');



and the numbers you get for cached and uncached version are really high in both cases. Initial query results in these performance numbers



[... output truncated for brevity ...]
Elapsed time: 14.17 s
Avg value size: 18.42 bytes
Avg key size: 19.42 bytes
Throughput: 1005019.91 bytes/s
Total cells: 376198
Throughput: 26557.52 cells/s



Subsequent queries result in this kind of performance



Elapsed time: 2.77 s
Avg value size: 18.42 bytes
Avg key size: 19.42 bytes
Throughput: 5131230.21 bytes/s
Total cells: 376197
Throughput: 135591.75 cells/s



As you can see clearly for yourself the numbers are really high. They could be even better! Trying to be objective I have to confess that the first test was much slower due to output going to the console over a wireless network which results in delays when performing output.

This post's goal was not to be a thorough performance review of hypertable so if you are interested in numbers then you should definitely take hypertable for spin yourself, but if you're interested in how you can utilize extra scalability then stay tuned for the next post in the Tools series. I'm going to explain how to make wikipedia imported into the hypertable searchable.

Update: I forgot to mention that these tests were performed on a x86_64 Ubuntu 8.10 Linux with hypertable compiled and operating in 64 bit mode on a single Intel Core 2 Quad Q6600 processor clocked at 2.4 GHz with 4 GB of physical memory and a single 500 GB SATA II disk.

Related Entries:

Permalink: Taking Hypertable 0.9.2.1 for a ride.
Tags: wikipedia  hypertable  import  convert  openx  noscript+section  size+bytes  enwiki+latest 
Trackback: http://publish.creative-weblogging.com/publish/mt-tb.pl/142492
img Addthis img Ask img Blinklist img del.icio.us img Digg img Fark img Facebook img Google img Lycos img Ma.gnolia Add this page to Mister Wong Mr Wong img Netscape img Netvousz img Newsvine img Reddit img StumbleUpon img Slashdot img Tailrank img Technorati img Wink img Yahoo

Vote for Taking Hypertable 0.9.2.1 for a ride.:

  • Currently 7.00/10
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
Rating: 7.00 out of 5 vote(s) cast.
 
Subscribe
Share It
RSSrss
See all blog subscribe options
Google google
What is RSS?
Yahoo! yahoo
Addthis Subscribe using any feed reader!
Bloglines Bloglines
Newsletter

TwitterFollow us on Twitter!