Whole Network Concepts Tools

 

Hypertable installation guide. Title: Hypertable installation guide.
PermaLink: http://www.googlestack.com/50226711/hypertable_installation_guide.php

Filed in archive Tools by Mateusz Berezecki on February 20, 2009

During the course of many months there were many people complaining about the difficulty of installing Hypertable on their machines, but the root cause of the problem was many sophisticated dependencies required for performance reasons, and not Hypertable itself. Time passed and things have changed. There's now a more or less complete, down to the gory details dependency installation guide for various linux flavors as well as OS X 10.4 and 10.5 family.

For more technical details please visit HowToBuild page on Hypertable Wiki.

 

Relational vs Key/Value based Title: Relational vs Key/Value based
PermaLink: http://www.googlestack.com/50226711/relational_vs_keyvalue_based.php

Filed in archive Articles by Mateusz Berezecki on February 14, 2009

Tony Bain from Read Write Web is running an interesting piece which is easy to follow, yet may become an eye opener. Brief, concise, straight to the point in just 3 pages. This is highly recommended reading:
http://www.readwriteweb.com/archives/is_the_relational_database_doomed.php

I am just wondering why they did not mention Hypertable?

 

Taking Hypertable 0.9.2.1 for a ride. Title: Taking Hypertable 0.9.2.1 for a ride.
PermaLink: http://www.googlestack.com/50226711/taking_hypertable_0921_for_a_ride.php

Filed in archive Tools by Mateusz Berezecki on February 3, 2009

This is the first post in the series about tools and how you can leverage them to increase your ability to scale. In this post we're going to take a look how it is possible to store a large quantity of data in the Hypertable. The data store chosen here is for a reason, as you will see in a moment.

Consider the following scenario, that you have a lot of data exposed in some internet location that you want to have at local site and be able to sift through it. Perhaps you are creating a web crawler and you must have this data on site, not in a remote location to avoid latency factor when processing it. Let's take Wikipedia for example. You could get the dump of the whole Wikipedia and try one of the two: search through the exported wikipedia in a XML file, maybe even extract articles into separate files (that would be inefficient and would probably be a real pain for your filesystem) or store it in some more clever way. What you need in such scenario is a data storage that can handle this amount of data with great ease. It turns out a Hypertable is a good tool for doing this kind of work.

If you want to follow this example then I'm assuming that you have hypertable compiled, installed and configured on some cluster already, and if you are having trouble then please leave me a note in the comments.

We're going to consider the latest data dump of an english version of Wikipedia that can be downloaded at the following address: http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2. This file is 4.1 gigabytes in size, and it's about 20 gigabytes after decompression so you'll need at least 25 gigabytes of free space to decompress it. After decompression you can remove the compressed file.

Downloading the Wikipedia can be done with a standard set of tools



$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2



Once you are done, then grab this script and save it somewhere.
The next step is to prepare Python thrift library so that we can connect to Hypertable using Python.

This can be done by copying the gen-py and hyperthrift directories from the source code directory you've compiled Hypertable at. For example



$ mkdir ~/wiki_example
$ cd ~/wiki_example
$ wget http://meme.pl/convert_wiki.py
$ cp -R ~/hypertable/src/py/ThriftClient/gen-py ~/wiki_example/gen-py/
$ cp -R ~/hypertable/src/py/ThriftClient/hypertable ~/wiki_example/hypertable/
$ wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
$ bunzip2 enwiki-latest-pages-articles.xml.bz2 && rm enwiki-latest-pages-articles.xml.bz2



Next step is to create a wikipedia table and we do so by typing in following commands into the
hypertable command line utility



hypertable> create table wikipedia (title,article, ACCESS GROUP meta (title), ACCESS GROUP text (article));



The above command will create a table called wikipedia with just two column families:
  • title - containing the title of the article

  • article - containing the page content


and two access groups
  • meta

  • text


These two access groups tell the hypertable to store their member column families in a separate physical file. With the above type of configuration we're going to have titles stored in a different file than full text of the articles. This will enable us to search through titles much faster as the files containing article text will not be processed at all during this process.
Once done with this setup we go back to the command shell and run the conversion script



$ cd ~/wiki_example
$ env PYTHONPATH="gen-py" python convert_wiki.py



This step is a last step and it can take a few hours depending on your cluster configuration, but once everything goes well you'll have a wikipedia dump imported into your hypertable installation.

The import script uses articles' titles as record keys so they are used for querying.
So let's search for some articles which titles start with Dirichlet:



hypertable> select title from wikipedia where row =^ "Dirichlet";
[... output truncated for brevity ...]
Elapsed time: 0.21 s
Avg value size: 22.02 bytes
Avg key size: 23.02 bytes
Throughput: 13245.13 bytes/s
Total cells: 63
Throughput: 294.13 cells/s



We can see there're 63 articles starting with the word Dirichlet.
Doing the same type of query for the word Time yields following results



hypertable> select title from wikipedia where row =^ "Time";
[... output truncated for brevity ...]
Elapsed time: 0.04 s
Avg value size: 26.74 bytes
Avg key size: 27.74 bytes
Throughput: 4348717.61 bytes/s
Total cells: 3585
Throughput: 79815.66 cells/s



You can clearly see the throughput numbers are really high but to be honest with the readers I have to admit that this is a heavy caching in action, as doing the query for the first time is not going to yield such impressive numbers. For example running the query for articles starting with ZX yields these results:



hypertable> select title from wikipedia where row =^ "ZX";
[... output truncated for brevity ...]
Elapsed time: 0.19 s
Avg value size: 9.51 bytes
Avg key size: 10.51 bytes
Throughput: 3783.18 bytes/s
Total cells: 35
Throughput: 188.89 cells/s



Now, running it for the second time gives us much higher numbers



Elapsed time: 0.01 s
Avg value size: 9.51 bytes
Avg key size: 10.51 bytes
Throughput: 47612.58 bytes/s
Total cells: 35
Throughput: 2377.23 cells/s



Let's assume, for the sake of example , that you're interested in articles starting with a letter 'A'. You can get these easily by running this query



hypertable> select title from wikipedia where ('A' <= ROW < 'B');



and the numbers you get for cached and uncached version are really high in both cases. Initial query results in these performance numbers



[... output truncated for brevity ...]
Elapsed time: 14.17 s
Avg value size: 18.42 bytes
Avg key size: 19.42 bytes
Throughput: 1005019.91 bytes/s
Total cells: 376198
Throughput: 26557.52 cells/s



Subsequent queries result in this kind of performance



Elapsed time: 2.77 s
Avg value size: 18.42 bytes
Avg key size: 19.42 bytes
Throughput: 5131230.21 bytes/s
Total cells: 376197
Throughput: 135591.75 cells/s



As you can see clearly for yourself the numbers are really high. They could be even better! Trying to be objective I have to confess that the first test was much slower due to output going to the console over a wireless network which results in delays when performing output.

This post's goal was not to be a thorough performance review of hypertable so if you are interested in numbers then you should definitely take hypertable for spin yourself, but if you're interested in how you can utilize extra scalability then stay tuned for the next post in the Tools series. I'm going to explain how to make wikipedia imported into the hypertable searchable.

Update: I forgot to mention that these tests were performed on a x86_64 Ubuntu 8.10 Linux with hypertable compiled and operating in 64 bit mode on a single Intel Core 2 Quad Q6600 processor clocked at 2.4 GHz with 4 GB of physical memory and a single 500 GB SATA II disk.

 

Hypertable project has a new sponsor! Title: Hypertable project has a new sponsor!
PermaLink: http://www.googlestack.com/50226711/hypertable_project_has_a_new_sponsor.php

Filed in archive News by Mateusz Berezecki on January 23, 2009

After a few months of evaluating Hypertable project, many of you probably noticed that Hypertable has a new sponsor - Baidu. Here are the numbers according to Wikipedia:

Baidu provides an index of over 740 million web pages, 80 million images, and 10 million multimedia files.[5] The domain baidu.com attracted at least 5.5 million visitors annually by 2008 according to a Compete.com scentury
, and what's more Baidu is going to throw Hypertable at their engineers and the whole infrastructure will be operating under the control of Hypertable. Neat! This is a no doubt one of the early symptoms of project quality acknowledged by the community.

Here's the official announcement: Baidu now an Official Sponsor of Hypertable

Update: Here you can find the official announcement on the Hypertable PR page

 

Columns versus Rows Title: Columns versus Rows
PermaLink: http://www.googlestack.com/50226711/columns_versus_rows.php

Filed in archive Concepts by Mateusz Berezecki on January 10, 2009

Columns versus Rows
In an attempt to bring the column oriented storage closer to the readers, I've managed to find a series of really great explanations about these two distinct storage schemes. While still theoretical these will definitely serve as a good and thorough (but not so hard to read) explanation on the subject matter. Once we got through these we'll see some more practical things in the next posts. Here's the list of my findings:


The last of these links contains interesting benchmark results which will definitely make you think and maybe reconsider your approach when it comes to managing your data.

 

Common MapReduce misconceptions Title: Common MapReduce misconceptions
PermaLink: http://www.googlestack.com/50226711/common_mapreduce_misconceptions.php

Filed in archive MapReduce by Mateusz Berezecki on January 3, 2009

MapReduce as a programming paradigm has no doubt attracted a lot of attention and has been mentioned throughout the blogosphere many times, both in a good and bad way. In fact one of the largest database related blogs has criticized the very programming paradigm here and gently backing off of the previous critique here. When I first read the first of these posts I was thought: "MapReduce critiqued by database guys? What the heck?"

The main points of the argument in the posts mentioned above were:


  • A giant step backward in the programming paradigm for large-scale data intensive applications

  • A sub-optimal implementation, in that it uses brute force instead of indexing

  • Not novel at all - it represents a specific implementation of well known techniques developed nearly 25 years ago

  • Missing most of the features that are routinely included in current DBMS

  • Incompatible with all of the tools DBMS users have come to depend on


My response (or a clarification of misconceptions) to these arguments, as I thought over all of them for a while closely matches the one posted here

  • It's one of the most fundamental ways of processing data used in functional programming and one of the most established ones in computer science.

  • There's a lot of computational methods that proceed by exhausting the solution space, or going through the whole data sets and they are there for a reason, e.g. branch and bound algorithms are a perfect candidate for running as MapReduce tasks

  • Not being novel and being well established is an advantage rather than not

  • MapReduce is not a database. It's a programming concept.

  • MapReduce is not a database. It's a programming concept.


Although many months passed since I've saw these posts for the first time I am still amazed how you can misjudge a great concept because of simply not spending enough time to read and think about it.

If you think the original MapReduce paper is too hard to understand for you then before making a final judgment maybe you could try reading this tutorial.


 

Downtime incidents Title: Downtime incidents
PermaLink: http://www.googlestack.com/50226711/downtime_incidents.php

Filed in archive Entertainment by Mateusz Berezecki on December 27, 2008

Pingdom has a quite entertaining article about relative safety of online operations at data centers. My personal favorite?

In August, an update to SiteMeter's script (websites can have it included on their pages to get visitor statistics) started crashing popular blogs like Gawker, Lifehacker, Gizmodo, Valleywag and Problogger for Internet Explorer users. Presumably every single website using SiteMeter had this problem. This incident revealed how a third-party script can quite easily stop a whole site from working, which is a vulnerability that every site owner should keep in mind.


Read a full list here.

 

Hypertable version updates and some more Title: Hypertable version updates and some more
PermaLink: http://www.googlestack.com/50226711/hypertable_version_updates_and_some_more.php

Filed in archive Tools by Mateusz Berezecki on December 24, 2008

It has not been a long time since version 0.9.1.0 came out, and today we've got another patch release - 0.9.1.1 is out and ready to pick up here.

And as a Christmas bonus here's the link to an article, "Scale-out with Hypertable" by Doug Judd posted in Linux Magazine here (free registration required).

 

KFS at Quantcast Title: KFS at Quantcast
PermaLink: http://www.googlestack.com/50226711/kfs_at_quantcast.php

Filed in archive Scalability stories by Mateusz Berezecki on December 23, 2008

KFS at Quantcast
With people claiming Quantcast is using Hadoop, and not giving a credit to the KFS it is necessary to give more details about the story, and not so surprisingly, the details are available instantly, and here they are (these details refer to Quantcast KFS deployment)


Two deployments:
-  130 node cluster hosting log data
- ~2M files; 70TB of data; WORM system
- Metaserver uses ~2GB RAM
- ~1TB of data copied in during a week
- Used for daily jobs in read mode

Plan is to use both KFS and HDFS
-  For job output, backup from KFS to HDFS using Hadoop's distcp


For some more insight and technical details about moving petabyte data centers, hadoop and KFS and they uses in practice please see these:

 

Open-source alternatives to BigTable Title: Open-source alternatives to BigTable
PermaLink: http://www.googlestack.com/50226711/opensource_alternatives_to_bigtable.php

Filed in archive Tools by Mateusz Berezecki on December 22, 2008

disclosure: I am a Hypertable MapReduce contributor, and this post has nothing to do with increasing publicity of the project, the opinions presented here are the opinions of my own



Open-source alternatives to BigTable


One of the most important google stack components is BigTable, the underlying data store. With it's proven scalability and availability characteristics, there was no wonder that sooner or later some open source alternative would have been created. I have already described the underlying concepts in an earlier post, and here's the link. Now on to the open-source free alternatives.




As of present day, there are two open source projects underway, which are based on the original BigTable publication. The projects in question are:

  • Hypertable

  • Hbase


The first difference between these two projects is in the language used in the development of these. Hbase stems from the Hadoop project, and as such it's written in Java, whereas a Hypertable is a from scratch C++ implementation aiming for the highest performance. By the nature of these projects Hypertable has yet to develop a larger community, while Hbase is enjoying a full-fledged community of supporters and contributors coming from the Hadoop project (which is also backed by Yahoo engineers). This is, however, not the only difference. In fact there are many more. During my experiments with Hypertable and Hbase I had a genuine feeling of superiority of the former compared to the latter, starting at the easier compilation process, deployment process with provided scripts, a promised (which is now ready and available) Thrift broker enabling different language bindings (think Ruby, Python, PHP, etc. ), as well as overall perceived performance (and in fact, that's how I decided to start contributing to Hypertable, instead of Hbase). For some more unbiased information on Hadoop and Hbase you could refer to this Wikipedia article.

If you feel like experimenting with these new tools of the trade, preparing for scaling your web application, web service, or planning for an data storage schemes that can give you an extra level of flexibility, then while not yet ready for the prime time at the time of writing this post, Hypertable at version 0.9.1.0 is actually worth checking out. I'd say more, I encourage you to do so because if done right it might some day replace old MySQL - at least in most of the scenarios! Being prepared might give you a serious advantage, as you never know when you'll have to handle the flash-crowd from your garage.



The good starting point when it comes to using Hypertable, is a project wiki located at http://code.google.com/p/hypertable/w/list, and of course the README file located in the downloaded source code package. You could download the source code package here.



RSSrss   | See all blog subscribe options
Google google   |   What is RSS?
Yahoo! yahoo
Addthis Subscribe using any feed reader!
Bloglines Bloglines
Newsletter
Grouptivity

Use the search to look for other interesting posts



 
  • Advertise with us

  • Learn more about our advertising options or email advertising - at - creative-weblogging.com or give us a call at +1 (650) 331 4900.




  • Other blogs in the same channel in the Creative Weblogging Network







 

Tagcloud: Articles Concepts Entertainment MapReduce News Scalability stories Tools