Hypertable
Hypertable installation guide.
Filed in archive Tools by Mateusz Berezecki on February 21, 2009
During the course of many months there were many people complaining about the difficulty of installing Hypertable on their machines, but the root cause of the problem was many sophisticated dependencies required for performance reasons, and not Hypertable itself. Time passed and things have changed. There's now a more or less complete, down to the gory details dependency installation guide for various linux flavors as well as OS X 10.4 and 10.5 family.

For more technical details please visit HowToBuild page on Hypertable Wiki.
Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo
Relational vs Key/Value based
Filed in archive Articles by Mateusz Berezecki on February 14, 2009
Tony Bain from Read Write Web is running an interesting piece which is easy to follow, yet may become an eye opener. Brief, concise, straight to the point in just 3 pages. This is highly recommended reading:
http://www.readwriteweb.com/archives/is_the_relational_database_doomed.php

I am just wondering why they did not mention Hypertable?
Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo
Taking Hypertable 0.9.2.1 for a ride.
Filed in archive Tools by Mateusz Berezecki on February 4, 2009
This is the first post in the series about tools and how you can leverage them to increase your ability to scale. In this post we're going to take a look how it is possible to store a large quantity of data in the Hypertable. The data store chosen here is for a reason, as you will see in a moment.

Consider the following scenario, that you have a lot of data exposed in some internet location that you want to have at local site and be able to sift through it. Perhaps you are creating a web crawler and you must have this data on site, not in a remote location to avoid latency factor when processing it. Let's take Wikipedia for example. You could get the dump of the whole Wikipedia and try one of the two: search through the exported wikipedia in a XML file, maybe even extract articles into separate files (that would be inefficient and would probably be a real pain for your filesystem) or store it in some more clever way. What you need in such scenario is a data storage that can handle this amount of data with great ease. It turns out a Hypertable is a good tool for doing this kind of work.

If you want to follow this example then I'm assuming that you have hypertable compiled, installed and configured on some cluster already, and if you are having trouble then please leave me a note in the comments.

We're going to consider the latest data dump of an english version of Wikipedia that can be downloaded at the following address: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This file is 4.1 gigabytes in size, and it's about 20 gigabytes after decompression so you'll need at least 25 gigabytes of free space to decompress it. After decompression you can remove the compressed file.

Downloading the Wikipedia can be done with a standard set of tools



$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2



Once you are done, then grab this script and save it somewhere.
The next step is to prepare Python thrift library so that we can connect to Hypertable using Python.

This can be done by copying the gen-py and hyperthrift directories from the source code directory you've compiled Hypertable at. For example



$ mkdir ~/wiki_example
$ cd ~/wiki_example
$ wget http://meme.pl/convert_wiki.py
$ cp -R ~/hypertable/src/py/ThriftClient/gen-py ~/wiki_example/gen-py/
$ cp -R ~/hypertable/src/py/ThriftClient/hypertable ~/wiki_example/hypertable/
$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2



Next step is to create a wikipedia table and we do so by typing in following commands into the
hypertable command line utility



hypertable> create table wikipedia (title,article, ACCESS GROUP meta (title), ACCESS GROUP text (article));



The above command will create a table called wikipedia with just two column families:
  • title - containing the title of the article

  • article - containing the page content


and two access groups
  • meta

  • text


These two access groups tell the hypertable to store their member column families in a separate physical file. With the above type of configuration we're going to have titles stored in a different file than full text of the articles. This will enable us to search through titles much faster as the files containing article text will not be processed at all during this process.
Once done with this setup we go back to the command shell and run the conversion script



$ cd ~/wiki_example
$ env PYTHONPATH="gen-py" python convert_wiki.py



This step is a last step and it can take a few hours depending on your cluster configuration, but once everything goes well you'll have a wikipedia dump imported into your hypertable installation.

The import script uses articles' titles as record keys so they are used for querying.
So let's search for some articles which titles start with Dirichlet:



hypertable> select title from wikipedia where row =^ "Dirichlet";
[... output truncated for brevity ...]
Elapsed time: 0.21 s
Avg value size: 22.02 bytes
Avg key size: 23.02 bytes
Throughput: 13245.13 bytes/s
Total cells: 63
Throughput: 294.13 cells/s



We can see there're 63 articles starting with the word Dirichlet.
Doing the same type of query for the word Time yields following results



hypertable> select title from wikipedia where row =^ "Time";
[... output truncated for brevity ...]
Elapsed time: 0.04 s
Avg value size: 26.74 bytes
Avg key size: 27.74 bytes
Throughput: 4348717.61 bytes/s
Total cells: 3585
Throughput: 79815.66 cells/s



You can clearly see the throughput numbers are really high but to be honest with the readers I have to admit that this is a heavy caching in action, as doing the query for the first time is not going to yield such impressive numbers. For example running the query for articles starting with ZX yields these results:



hypertable> select title from wikipedia where row =^ "ZX";
[... output truncated for brevity ...]
Elapsed time: 0.19 s
Avg value size: 9.51 bytes
Avg key size: 10.51 bytes
Throughput: 3783.18 bytes/s
Total cells: 35
Throughput: 188.89 cells/s



Now, running it for the second time gives us much higher numbers



Elapsed time: 0.01 s
Avg value size: 9.51 bytes
Avg key size: 10.51 bytes
Throughput: 47612.58 bytes/s
Total cells: 35
Throughput: 2377.23 cells/s



Let's assume, for the sake of example , that you're interested in articles starting with a letter 'A'. You can get these easily by running this query



hypertable> select title from wikipedia where ('A' <= ROW < 'B');



and the numbers you get for cached and uncached version are really high in both cases. Initial query results in these performance numbers



[... output truncated for brevity ...]
Elapsed time: 14.17 s
Avg value size: 18.42 bytes
Avg key size: 19.42 bytes
Throughput: 1005019.91 bytes/s
Total cells: 376198
Throughput: 26557.52 cells/s



Subsequent queries result in this kind of performance



Elapsed time: 2.77 s
Avg value size: 18.42 bytes
Avg key size: 19.42 bytes
Throughput: 5131230.21 bytes/s
Total cells: 376197
Throughput: 135591.75 cells/s



As you can see clearly for yourself the numbers are really high. They could be even better! Trying to be objective I have to confess that the first test was much slower due to output going to the console over a wireless network which results in delays when performing output.

This post's goal was not to be a thorough performance review of hypertable so if you are interested in numbers then you should definitely take hypertable for spin yourself, but if you're interested in how you can utilize extra scalability then stay tuned for the next post in the Tools series. I'm going to explain how to make wikipedia imported into the hypertable searchable.

Update: I forgot to mention that these tests were performed on a x86_64 Ubuntu 8.10 Linux with hypertable compiled and operating in 64 bit mode on a single Intel Core 2 Quad Q6600 processor clocked at 2.4 GHz with 4 GB of physical memory and a single 500 GB SATA II disk.
Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo
Hypertable project has a new sponsor!
Filed in archive News by Mateusz Berezecki on January 24, 2009
After a few months of evaluating Hypertable project, many of you probably noticed that Hypertable has a new sponsor - Baidu. Here are the numbers according to Wikipedia:

Baidu provides an index of over 740 million web pages, 80 million images, and 10 million multimedia files.[5] The domain baidu.com attracted at least 5.5 million visitors annually by 2008 according to a Compete.com scentury
, and what's more Baidu is going to throw Hypertable at their engineers and the whole infrastructure will be operating under the control of Hypertable. Neat! This is a no doubt one of the early symptoms of project quality acknowledged by the community.

Here's the official announcement: Baidu now an Official Sponsor of Hypertable

Update: Here you can find the official announcement on the Hypertable PR page
Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo
Columns versus Rows
Filed in archive Concepts by Mateusz Berezecki on January 11, 2009
Columns versus Rows
In an attempt to bring the column oriented storage closer to the readers, I've managed to find a series of really great explanations about these two distinct storage schemes. While still theoretical these will definitely serve as a good and thorough (but not so hard to read) explanation on the subject matter. Once we got through these we'll see some more practical things in the next posts. Here's the list of my findings:


The last of these links contains interesting benchmark results which will definitely make you think and maybe reconsider your approach when it comes to managing your data.
Bookmark
img Addthis
img Ask
img Blinklist
img del.icio.us
img Digg
img Fark
img Facebook
img Google
img Lycos
img Ma.gnolia
Add this page to Mister Wong Mr Wong
img Netscape
img Netvousz
img Newsvine
img Reddit
img StumbleUpon
img Slashdot
img Tailrank
img Technorati
img Wink
img Yahoo
Subscribe
Share It
RSSrss
See all blog subscribe options
Google google
What is RSS?
Yahoo! yahoo
Addthis Subscribe using any feed reader!
Bloglines Bloglines
Newsletter

TwitterFollow us on Twitter!