2010-02-10 Wed
Oracle has agreed to acquire Convergin, a leading provider of real-time service brokering solutions.
Convergin’s industry-leading J2EE-based Service Broker platform enables communications service providers (CSPs) to manage services for a wide range of networks and application platforms, including pre-paid charging. The solution allows CSPs to focus on launching innovative services while modernizing to next-generation networks.
CSPs are increasingly looking to transition from inflexible and costly intelligent network platforms to deliver value-added services. The combination of Oracle and Convergin is expected to provide a single carrier-grade, standards-based IT platform allowing CSPs to effectively evolve their service delivery capabilities at a lower total cost of ownership.
Convergin products complement Oracle Communications’ integrated product suite, including Oracle Communications Billing and Revenue Management, Oracle Communications Converged Application Server and Oracle Communications service fulfillment applications.
The transaction is expected to close the first half of this year. Financial details of the transaction were not disclosed.
Related articles:
It can take a fairly stable team of programmers as long as six months to get to a point where they’re estimating programming time fairly close to actuals, says Suvro Upadhyaya, a Senior Software Engineer at Oracle. Accurately estimating programming time is a process of defining limitations, he says. The programmers’ experience, domain knowledge, and speed vs. quality all come into play, and it is highly dependent upon the culture of the team/organization. Upadhyaya uses Scrum to estimate programming time. How do you do it?
via Slashdot Ask Slashdot Story | How Do You Accurately Estimate Programming Time?.
Related articles:
A "referral URL" is one of many signals we use to deliver contextually relevant ads on your website. The referral URL contains information about the link a user followed to arrive at your website, whether from a search engine or another site on the Internet. Any webmaster for any site can look at referral URLs to see how users arrive at their site.
Let's see how this works today when a user arrives at your golfing advice website from a search engine results page. Imagine that someone searches on Google for [golf shop atlanta] and clicks on a search result that takes them to your site. The referral URL that is passed to your site may look something like this: http://www.google.com/search?q=golf+shop+atlanta. I'm using Google as an example here, but the same type of information is transmitted if a user arrives at your website from another search engine.
To deliver the most relevant ad, we treat the query words [golf shop atlanta] in the referral URL as if they're part of the content of your webpage. We can then better tailor the ad we deliver on your site. In this example, we could use the additional information from the query words to show an ad for a golf shop in Atlanta rather than for one in Chicago (depending on the other words in the page).
We've recently started to expand the use of the query words in referral URLs to a few hours so we can so we can continue to deliver more relevant ads. The technical way that we're doing this is by associating the relevant query words in the referral URL with the existing advertising cookie on the user's browser. After a short period of time (a few hours) the query words are no longer used for the purposes of matching ads. Of course, users can continue to opt out of our advertising cookie at any time here.
This allows us to deliver more relevant ads on a wider range of AdSense partner sites that a user may browse over the course of a few hours. Let's assume the user in our example leaves your golf website and browses through to a news website that is also an AdSense partner. Since [golf shop atlanta] is in a referral URL that was visited in the past few hours, we may use those query words, along with the content of the news webpage itself, to determine the most relevant ad to show the user on the news website.
Using signals from the referral URL is just one part of our teams' continuing efforts to deliver even better contextually matched ads on your website.
Cassandra is a hybrid non-relational database in the same class as Google's BigTable. It is more featureful than a key/value store like Dynomite, but supports fewer query types than a document store like MongoDB.
Cassandra was started by Facebook and later transferred to the open-source community. It is an ideal runtime database for web-scale domains like social networks.
This post is both a tutorial and a "getting started" overview. You will learn about Cassandra's features, data model, API, and operational requirements—everything you need to know to deploy a Cassandra-backed service.
Jan 8, 2010: post updated for Cassandra gem 0.7 and Cassandra version 0.5.
features
There are a number of reasons to choose Cassandra for your website. Compared to other databases, three big features stand out:
- Flexible schema: with Cassandra, like a document store, you don't have to decide what fields you need in your records ahead of time. You can add and remove arbitrary fields on the fly. This is an incredible productivity boost, especially in large deployments.
- True scalability: Cassandra scales horizontally in the purest sense. To add more capacity to a cluster, turn on another machine. You don't have restart any processes, change your application queries, or manually relocate any data.
- Multi-datacenter awareness: you can adjust your node layout to ensure that if one datacenter burns in a fire, an alternative datacenter will have at least one full copy of every record.
Some other features that help put Cassandra above the competition :
- Range queries: unlike most key/value stores, you can query for ordered ranges of keys.
- List datastructures: super columns add a 5th dimension to the hybrid model, turning columns into lists. This is very handy for things like per-user indexes.
- Distributed writes: you can read and write any data to anywhere in the cluster at any time. There is never any single point of failure.
installation
You need a Unix system. If you are using Mac OS 10.5, all you need is Git. Otherwise, you need to install Java 1.6, Git 1.6, Ruby, and Rubygems in some reasonable way.
Start a terminal and run:
sudo gem install cassandra
If you are using Mac OS, you need to export the following environment variables:
export JAVA_HOME="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home" export PATH="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin:$PATH"
Now you can build and start a test server with cassandra_helper:
cassandra_helper cassandra
It runs!
live demo
The above script boots the server with a schema that we can interact with. Open another terminal window and start irb, the Ruby shell:
irb
In the irb prompt, require the library:
require 'rubygems' require 'cassandra' include Cassandra::Constants
Now instantiate a client object:
twitter = Cassandra.new('Twitter')
Let's insert a few things:
user = {'screen_name' => 'buttonscat'}
twitter.insert(:Users, '5', user)
tweet1 = {'text' => 'Nom nom nom nom nom.', 'user_id' => '5'}
twitter.insert(:Statuses, '1', tweet1)
tweet2 = {'text' => '@evan Zzzz....', 'user_id' => '5', 'reply_to_id' => '8'}
twitter.insert(:Statuses, '2', tweet2)
Notice that the two status records do not have all the same columns. Let's go ahead and connect them to our user record:
twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '1'}})
twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '2'}})
The UUID.new call creates a collation key based on the current time; our tweet ids are stored in the values.
Now we can query our user's tweets:
timeline = twitter.get(:UserRelationships, '5', 'user_timeline', :reversed => true)
timeline.map { |time, id| twitter.get(:Statuses, id, 'text') }
# => ["@evan Zzzz....", "Nom nom nom nom nom."]
Two tweet bodies, returned in recency order—not bad at all. In a similar fashion, each time a user tweets, we could loop through their followers and insert the status key into their follower's home_timeline relationship, for handling general status delivery.
the data model
Cassandra is best thought of as a 4 or 5 dimensional hash. The usual way to refer to a piece of data is as follows: a keyspace, a column family, a key, an optional super column, and a column. At the end of that chain lies a single, lonely value.
Let's break down what these layers mean.
Keyspace (also confusingly called "table"): the outer-most level of organization. This is usually the name of the application. For example,
'Twitter'and'Wordpress'are both good keyspaces. Keyspaces must be defined at startup in thestorage-conf.xmlfile.Column family: a slice of data corresponding to a particular key. Each column family is stored in a separate file on disk, so it can be useful to put frequently accessed data in one column family, and rarely accessed data in another. Some good column family names might be
:Posts,:Usersand:UserAudits. Column families must be defined at startup.Key: the permanent name of the record. You can query over ranges of keys in a column family, like
:start => '10050', :finish => '10070'—this is the only index Cassandra provides for free. Keys are defined on the fly.
After the column family level, the organization can diverge—this is a feature unique to Cassandra. You can choose either:
A column: this is a tuple with a name and a value. Good columns might be
'screen_name' => 'lisa4718'or'Google' => 'http://google.com'.It is common to not specify a particular column name when requesting a key; the response will then be an ordered hash of all columns. For example, querying for
(:Users, '174927')might return:{'name' => 'Lisa Jones', 'gender' => 'f', 'screen_name' => 'lisa4718'}In this case,
name,gender, andscreen_nameare all column names. Columns are defined on the fly, and different records can have different sets of column names, even in the same keyspace and column family. This lets you use the column name itself as either structure or data. Columns can be stored in recency order, or alphabetical by name, and all columns keep a timestamp.A super column: this is a named list. It contains standard columns, stored in recency order.
Say Lisa Jones has bookmarks in several categories. Querying
(:UserBookmarks, '174927')might return:{'work' => { 'Google' => 'http://google.com', 'IBM' => 'http://ibm.com'}, 'todo': {...}, 'cooking': {...}}Here,
work,todo, andcookingare all super column names. They are defined on the fly, and there can be any number of them per row.:UserBookmarksis the name of the super column family. Super columns are stored in alphabetical order, with their sub columns physically adjacent on the disk.
Super columns and standard columns cannot be mixed at the same (4th) level of dimensionality. You must define at startup which column families contain standard columns, and which contain super columns with standard columns inside them.
Super columns are a great way to store one-to-many indexes to other records: make the sub column names the foreign ids, and leave the values blank. We saw an example of this strategy in the demo, above.
If this is confusing, don't worry. We'll now look at two example schemas in depth.
twitter schema
Here is the schema definition we used for the demo, above. It is based on Eric Florenzano's Twissandra:
What could be in StatusRelationships? Maybe a list of users who favorited the tweet? Having a super column family for both record types lets us index each direction of whatever many-to-many relationships we come up with.
Here's how the data is organized:
Cassandra lets you distribute the keys across the cluster either randomly, or in order, via the Partitioner option in the storage-conf.xml file.
For the Twitter application, if we were using the order-preserving partitioner, all recent statuses would be stored on the same node. This would cause hotspots. Instead, we should use the random partitioner.
Alternatively, we could preface the status keys with the user key, which has less temporal locality. If we used user_id:status_id as the status key, we could do range queries on the user fragment to get tweets-by-user, avoiding the need for a user_timeline super column.
multi-blog schema
Here's a another schema, suggested to me by Jonathan Ellis, the primary Cassandra maintainer. It's for a multi-tenancy blog platform:
Imagine we have a blog named 'The Cutest Kittens'. We will insert a row when the first post is made as follows:
require 'rubygems'
require 'cassandra'
include Cassandra::Constants
multiblog = Cassandra.new('Multiblog')
multiblog.insert(:Blogs, 'The Cutest Kittens',
{ UUID.new =>
'{"title":"Say Hello to Buttons Cat","body":"Buttons is a cute cat."}' })
UUID.new generates a unique, sortable column name, and the JSON hash contains the post details. Let's insert another:
multiblog.insert(:Blogs, 'The Cutest Kittens',
{ UUID.new =>
'{"title":"Introducing Commie Cat","body":"Commie is also a cute cat"}' })
Now we can find the latest post with the following query:
post = multiblog.get(:Blogs, 'The Cutest Kittens', :reversed => true).to_a.first
On our website, we can build links based on the readable representation of the UUID:
guid = post.first.to_guid # => "b06e80b0-8c61-11de-8287-c1fa647fd821"
If the user clicks this string in a permalink, our app can find the post directly via:
multiblog.get(:Blogs, 'The Cutest Kittens', :start => UUID.new(guid), :count => 1)
For comments, we'll use the post UUID as the outermost key:
multiblog.insert(:Comments, guid,
{UUID.new => 'I like this cat. - Evan'})
multiblog.insert(:Comments, guid,
{UUID.new => 'I am cuter. - Buttons'})
Now we can get all comments (oldest first) for a post by calling:
multiblog.get(:Comments, guid)
We could paginate them by passing :start with a UUID. See this presentation to learn more about token-based pagination.
We have sidestepped two problems with this data model: we don't have to maintain separate indexes for any lookups, and the posts and comments are stored in separate files, where they don't cause as much write contention. Note that we didn't need to use any super columns, either.
storage layout and api comparison
The storage strategy for Cassandra's standard model is the same as BigTable's. Here's a comparison chart:
| multi-file | per-file | intra-file | |||||
|---|---|---|---|---|---|---|---|
| Relational | server | database | table* | primary key | column value | ||
| BigTable | cluster | table | column family | key | column name | column value | |
| Cassandra, standard model | cluster | keyspace | column family | key | column name | column value | |
| Cassandra, super column model | cluster | keyspace | column family | key | super column name | column name | column value |
* With fixed column names.
Column families are stored in column-major order, which is why people call BigTable a column-oriented database. This is not the same as a column-oriented OLAP database like Sybase IQ—it depends on what you use the column names for.
In row-orientation, the column names are the structure, and you think of the column families as containing keys. This is the convention in relational databases.
In column-orientation, the column names are the data, and the column families are the structure. You think of the key as containing the column family, which is the convention in BigTable. (In Cassandra, super columns are also stored in column-major order—all the sub columns are together.)
In Cassandra's Ruby API, parameters are expressed in storage order, for clarity:
| Relational | SELECT `column` FROM `database`.`table` WHERE `id` = key; |
|---|---|
| BigTable | table.get(key, "column_family:column") |
| Cassandra: standard model | keyspace.get("column_family", key, "column") |
| Cassandra: super column model | keyspace.get("column_family", key, "super_column", "column") |
Note that Cassandra's internal Thrift interface mimics BigTable in some ways, but this is being changed.
going to production
Cassandra is an alpha product and could, theoretically, lose your data. In particular, if you change the schema specified in the storage-conf.xml file, you must follow these instructions carefully, or corruption will occur (this is going to be fixed). Also, the on-disk storage format is expected to change in version 0.4.0. After that things will be a bit more stable.
The biggest deployment is at Facebook, where hundreds of terabytes of token indexes are kept in about a hundred Cassandra nodes. However, their use case allows the data to be rebuilt if something goes wrong. Currently there are no known deployments of non-transient data. Proceed carefully, keep a backup in an unrelated storage engine...and submit patches if things go wrong.
That aside, here is a guide for deploying a production cluster:
Hardware: get a handful of commodity Linux servers. 16GB memory is good; Cassandra likes a big filesystem buffer. You don't need RAID. If you put the commitlog file and the data files on separate physical disks, things will go faster. Don't use EC2 or friends except for testing; the virtualized I/O is too slow.
Configuration: in the
storage-conf.xmlschema file, set the replication factor to 3. List the IP address of one of the nodes as the seed. Set the listen address to the empty string, so the hosts will resolve their own IPs. Now, adjust the contents ofcassandra.in.shfor your various paths and JVM options—for a 16GB node, set the JVM heap to 4GB.Deployment: build a package of Cassandra itself and your configuration files, and deliver it to all your servers (I use Capistrano for this). Start the servers by setting
CASSANDRA_INCLUDEin the environment to point to yourcassandra.in.shfile, and runbin/cassandra. At this point, you should see join notices in the Cassandra logs:Cassandra starting up... Node 10.224.17.13:7001 has now joined. Node 10.224.17.14:7001 has now joined.
Congratulations! You have a cluster. Don't forget to turn off debug logging in the
log4j.propertiesfile.Visibility: you can get a little more information about your cluster via the tool
bin/nodeprobe, included:$ bin/nodeprobe --host 10.224.17.13 ring Token(124007023942663924846758258675932114665) 3 10.224.17.13 |<--| Token(106858063638814585506848525974047690568) 3 10.224.17.19 | ^ Token(141130545721235451315477340120224986045) 3 10.224.17.14 |-->|
Cassandra also exposes various statistics over JMX.
Note that your client machines (not servers!) must have accurate clocks for Cassandra to resolve write conflicts properly. Use NTP.
conclusion
There is a misperception that if someone advocates a non-relational database, they either don't understand SQL optimization, or they are generally a hater. This is not the case.
It is reasonable to seek a new tool for a new problem, and database problems have changed with the rise of web-scale distributed systems. This does not mean that SQL as a general-purpose runtime and reporting tool is going away. However, at web-scale, it is more flexible to separate the concerns. Runtime object lookups can be handled by a low-latency, strict, self-managed system like Cassandra. Asynchronous analytics and reporting can be handled by a high-latency, flexible, un-managed system like Hadoop. And in neither case does SQL lend itself to sharding.
I think that Cassandra is the most promising current implementation of a runtime distributed database, but much work remains to be done. We're beginning to use Cassandra at Twitter, and here's what I would like to happen real-soon-now:
- Interface cleanup: the Thrift API for Cassandra is incomplete and inconsistent, which makes writing clients very irritating.
- Online migrations: restarting the cluster 3 times to add a column family is silly.
- ActiveModel or DataMapper adapter: for interaction with business objects in Ruby.
- Scala client: for interoperability with JVM middleware.
Go ahead and jump on any of those projects—it's a chance to get in on the ground floor.
Cassandra has excellent performance, and I hope to publish some robust benchmarks in a few weeks. For now there are a few numbers in Avinash Lakshman's slides.
further resources
- Cassandra wiki
- Presentation by Avinash Lakshman about Cassandra: slides, video
- The cassandra-user and cassandra-dev mailing lists
- The #cassandra IRC channel on irc.freenode.net
- Cassandra's bug tracker
- Twitter's Ruby client: docs, source
I've released Scribe 0.1, a Ruby client for the Scribe remote log server.
sudo gem install scribe
Usage is simple:
client = Scribe.new
client.log("I'm lonely in a crowded room.", "Rails")
Documentation is here.
about scribe
The primary benefit of Scribe over something like syslog-ng is
increased scalability, because of Scribe's fundamentally distributed architecture. Scribe also does away with the legacy
syslog
alert levels, and lets you define more application-appropriate categories on the fly instead.
Dmytro Shteflyuk has good article about installing the Scribe server itself on OS X. It would be nice if someone would put it in MacPorts, but it may be blocked on the release of Thrift.
We recently migrated Twitter from a custom Ruby 1.8.6 build to a Ruby Enterprise Edition release candidate, courtesy of Phusion. Our primary motivation was the integration of Brent's MBARI patches, which increase memory stability.
Some features of REE have no effect on our codebase, but we definitely benefit from the MBARI patchset, the Railsbench tunable GC, and the various leak fixes in 1.8.7p174. These are difficult to integrate and Phusion has done a fine job.
testing notes
I ran into an interesting issue. Ruby is faster if compiled with -Os (optimize for size) than with -O2 or -O3 (optimize for speed). Hongli pointed out that Ruby has poor instruction locality and benefits most from squeezing tightly into the instruction cache. This is an unusual phenomenon, although probably more common in interpreters and virtual machines than in "standard" C programs.
I also tested a build that included Joe Damato's heaped thread frames, but it would hang Mongrel in rb_thread_schedule() after the first GC run, which is not exactly what we want. Hopefully this can be integrated later.
benchmarks
I ran a suite of benchmarks via Autobench/httperf and plotted them with Plot. The hardware was a 4-core Xeon machine with RHEL5, running 8 Mongrels balanced behind Apache 2.2. I made a typical API request that is answered primarily from composed caches.

As usual, we see that tuning the GC parameters has the greatest impact on throughput, but there is a definite gain from switching to the REE bundle. It's also interesting how much the standard deviation is improved by the GC settings. (Some data points are skipped due to errors at high concurrency.)
upgrading
Moving from 1.8.6 to REE 1.8.7 was trivial, but moving to 1.9 will be more of an ordeal. It will be interesting to see what patches are still necessary on 1.9. Many of them are getting upstreamed, but some things (such as tcmalloc) will probably remain only available from 3rd parties.
All in all, good times in MRI land.
How many objects does a Rails request allocate? Here are Twitter's numbers:
- API: 22,700 objects per request
- Website: 67,500 objects per request
- Daemons: 27,900 objects per action
I want them to be lower. Overall, we burn 20% of our front-end CPU on garbage collection, which seems high. Each process handles ~29,000 requests before getting killed by the memory limit, and the GC is triggered about every 30 requests.
In memory-managed languages, you pay a performance penalty at object allocation time and also at collection time. Since Ruby lacks a generational GC (although there are patches available), the collection penalty is linear with the number of objects on the heap.
a note about structs and immediates
In Ruby 1.8, Struct instances use fewer bytes and allocate less objects than
Hash and friends. This can be an optimization opportunity in circumstances where the Struct class is reusable.
A little bit of code shows the difference (you need REE or Sylvain Joyeux's patch to track allocations):
GC.enable_stats
def sizeof(obj)
GC.clear_stats
obj.clone
puts "#{GC.num_allocations} allocations"
GC.clear_stats
obj.clone
puts "#{GC.allocated_size} bytes"
end
Let's try it:
>> Struct.new("Test", :a, :b, :c)
>> struct = Struct::Test.new(1,2,3)
=> #
>> sizeof(struct)
1 allocations
24 bytes
>> hash = {:a => 1, :b => 2, :c => 3}
>> sizeof(hash)
5 allocations
208 bytes
Watch out, though. The Struct class itself is expensive:
>> sizeof(Struct::Test) 29 allocations 1216 bytes
In my understanding, each key in a Hash is a VALUE pointer to another object, while each slot in a Struct is merely a named position.
Immediate types (Fixnum, nil, true, false, and Symbol) don't allocate, except for Symbol. Symbol is interned and keeps its string representations on a special heap that is not garbage-collected.
your turn
If you have allocation counts from a production web application, I would be delighted to know them. I am especially interested in Python, PHP, and Java.
Python should be about the same as Ruby. PHP, though, discards the entire heap per-request in some configurations, so collection can be dramatically cheaper. And I would expect Java to allocate fewer objects and have a more efficient collection cycle.
One of the hardest gems to install is no more. It's now easy to install!
Memcached 0.15 features:
- Update to libmemcached 0.31.1
- Bundle libmemcached itself with the gem (antifuchs)
- UDP connection support
- Unix domain socket support (hellvinz)
AUTO_EJECT_HOSTSbugfixes (mattknox)
Install with gem install memcached. Since libmemcached is bundled in, there are no longer any dependencies.
on coordination
Andreas Fuchs suggested several months ago that I include libmemcached itself in the gem, but at the time I resisted. I was wrong.
My opposition was based on the idea that libmemcached itself would be an integration point, so running multiple versions on a system would be bad.
In real life, the hash algorithm became the integration point, not the library itself. And since the library's ABI kept changing, the gem always required a very specific custom build. This annoyed the public and caused extra work for my operations team, who had to make sure to upgrade both the library and the gem at the same time.
Updates can come thick and fast now because I don't have to worry about publishing custom builds or waiting for the libmemcached developers to merge my patches.
In retrospect it seems obvious—it's always a win to remove coordination from a system.
linker woes
Unfortunately, it was easier to make that decision than it was to implement it. Linux and OS X link libraries differently, and I had a lot of trouble making sure that no system-installed version of libmemcached would get linked, instead of the custom one built during gem install.
When you link a shared object, OS X seems to maintain a reference to the original .dylib. Linux does not, and depends on ldconfig and LD_LIBRARY_PRELOAD to find the object at runtime. Since you can't modify the shell environment from within a running process, there's no way to override LD_LIBRARY_PRELOAD, so I needed to statically link libmemcached into the gem's own .so or .bundle.
The only way I could do this on both systems was to configure libmemcached with CFLAGS=-fPIC --disable-shared, rename the libemcached.* static object files to libemcached_gem.*, and pass -lmemcached_gem to the linker rather than -lmemcached. Otherwise the linker would prefer the system-installed dynamic objects, even with the correct paths and -static option set.
Note that you can check what objects a binary has linked to via otool -F on OS X, and ldd on Linux.
Feel free to look at the extconf.rb source and let me know if there's a better way to do this.
I've been reading a bunch of papers about distributed systems recently, in order to help systematize for myself the thing that we built over the last year. Many of them were originally passed to me by Toby DiPasquale. Here is an annotated list so everyone can benefit.
It helps if you have some algorithms literacy, or have built a system at scale, but don't let that stop you.
prelude
The Death of Architecture, Julian Browne, 2007.
First, a reminder of what it means to build a system that solves a business problem. Browne built the space-based billing system at Virgin Mobile, one of the most well-known SBAs outside the financial and research industries.
That lovely diagram showing clean service-oriented interfaces, between unified systems of record, holding clean data, performing well-defined business processes is never going to be....You have to roll up your sleeves, talk to the business analysts, developers, operations and make a contribution that makes those boxes and arrows real.
System failures in the web world are most often due to inflated technical requirements and integration risks, not an incorrect decision to skip two-phase commit.
constraints
The Case for Shared Nothing, Michael Stonebraker, 1985.
The source of the shared-nothing paradigm, and importantly, its alternatives. Shared-nothing is a nice hammer, but not every problem is a nail.
Harvest, Yield and Scalable Tolerant systems, Eric Brewer, 1999.
The CAP theorem. Sometimes you just can't get what you want. (This is related to the Dynamo work, below.)
Distributed Computing Economics, Jim Gray, 2003.
How to predict the cost of the thing you want to build. Via some napkin math, Gray shows why making the cloud cost-efficient for current problems continues to be a struggle.
coordination
Guardians and Actions: Linguistic Support for Robust, Distributed Programs, Barbara Liskov and Robert Scheifler, 1983.
Two-phase commit. Making sure what you plan to do will get done.
Time, Clocks and the Ordering of Events in a Distributed System, Leslie Lamport, 1978.
Distributed systems are inherently relative; there is no privileged position that can describe all events exactly. Lamport clocks (and the closely related vector clocks) let participants agree on the order of events in the world, if you need to care about that.
Paxos Made Simple, Leslie Lamport, 2001.
The consensus problem: how can potentionally faulty processes agree about an element of global state? The Paxos algorithm guarantees correctness during a failure of a minority of nodes. This paper is difficult, but important for the subtleties it reveals.
Also see Paxos Made Live for a discussion of the implementation in Google's Chubby v2 coordination server. Real life introduces many unfortunate kinks.
encapsulation
Generative Communication in Linda, David Gelernter, 1985.
Tuple spaces, a.k.a. the blackboard pattern, a.k.a. spaced-based architecture. Coordinating a system through the data, not through the behaviors. I will be writing a lot more about this in the future.
partitioning
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, Ion Stoica, et al., 2001.
The original distributed hash table paper. Introduced consistent hashing for robust pool rebalancing.
Frangipani: A Scalable Distributed File System, Chandramohan A. Thekkath, Timothy Mann and Edward K. Lee, 1998.
Classic paper regarding modern-style distributed filesystems.
systems integration
Dynamo: Amazon’s Highly Available Key-Value Store, Giuseppe DeCandia, et al., 2007.
A key-value store that spawned numerous clones, it integrated many fundamental ideas from the above works into an actual running system. Cassandra is the most featureful open-source version.
(Also see Tokyo Cabinet. Not a Dynamo clone, per se, but it's the next most practical alternative aside from MySQL. Tokyo is a networked BerkeleyDB replacement, so the domain code must handle the distributed aspects.)
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, Matt Welsh, David Culler, and Eric Brewer, 2001.
Great paper on managing the interactions between individual components, and ways to build a well-conditioned service under unpredictable load.
conclusion
That's all really. Focus on the ideas, not the implementations. Try to see the patterns in existing systems, rather than imagining how they "should" work if everything was perfect. Then you'll be able to scale to the moon.
The moon is above the cloud and therefore obviously more scaleable.
ElasticSearch is an open source, distributed, RESTful search engine built on top of Lucene. Its features include:
- Distributed and Highly Available Search Engine.
- Each index is fully sharded with a configurable number of shards.
- Each shard can have zero or more replicas.
- Read / Search operations performed on either replica shard.
win7 的一个新增的强大功能就是 library 啦,如果你还没用到这个功能的话,一定要看看《windows 7 library(库)以及状态栏的妙用》。库虽然好用,可是不能自定义图标和添加网络文件夹。Win7 Library Tool 正是为了弥补这两个缺陷而开发的。
用法很简单,先点击左下角第一个按钮:add all your existing libraries,然后可以通过旁边的按钮来自定义库了。选择一个库之后,点击 edit properties of the selected library 就可以方便地自定义库(添加/删除库里面的文件夹)了。
下载(206.9 KB): skydrive | 官方网站 | 来自小众软件 | uushare | dropbox | Go Aruna
© sfufoet for 小众软件,2005-2010 | 原文链接 | 2 留言 | 联系我们 | 投稿 | 更新列表
相关文章
-
The new YQL set of tables for Twitter enables any developer to use simple SQL-like queries to retrieve and post Twitter data. For simple user queries, getting a user’s twitter profile data is as simple as something like “SELECT * FROM twitter.status WHERE id=’8036408424?;“.
-
A VARCHAR2(0) column?
-
We should stop using NVL altogether and get into the habit of using coalesce instead – regardless how how annoying it is to type.
-
Interview with Sue Harper, Senior Principal Product Manager for Database Development Tools at Oracle. She has been at Oracle since 1992 and is currently based in London. Sue is a regular contributor to magazines, maintains a technical blog, and speaks at many conferences around the world.
-
There are a growing number of people asking the question: how do you move a VMware virtual machine to VirtualBox. So it is about time the Fat Bloke rolled up his sleeves and showed us how.
-
When Oracle acquired Sun, the database giant also acquired the Java technology that was Sun's lifeblood. Oracle Chairman and CEO Larry Ellison called Java the most important technology Oracle has ever acquired. With ownership and leadership come responsibility. Java's future is now in Oracle's hands. This eWEEK slide show presents 15 ways Oracle can improve Java and boost its position in the Java community.
Related articles:
- Daily Roundup of News, Tips and Tricks for 2010-02-06
- Daily Roundup of News, Tips and Tricks for 2010-02-08
- Daily Roundup of News, Tips and Tricks for 2010-02-01
看到这篇文章,当时就泪奔了好几回,重点推荐下,顺便我自己也做个整理。
sys_getloadavg()
这个函数返回当前系统的负载均值信息(当然 Windows 下不适用),详细文档可以翻阅 PHP 的相关文档。文档中有段示例代码,基本上也就能看出它的用途了。
<?php
$load = sys_getloadavg();
if ($load[0] > 80) {
header('HTTP/1.1 503 Too busy, try again later');
die('Server too busy. Please try again later.');
}PS,如果“很不幸”得你的 PHP 环境中没有这个函数,可以考虑使用下面这段代码 via
if (!function_exists('sys_getloadavg')) {
function sys_getloadavg()
{
$loadavg_file = '/proc/loadavg';
if (file_exists($loadavg_file)) {
return explode(chr(32),file_get_contents($loadavg_file));
}
return array(0,0,0);
}
}这一特性如果使用得当,能减轻服务器部分压力。
pack()
pack 对应的还有个函数为 unpack,用于压缩二进制串,文中的作者的示例非常清楚
$pass_hash = pack("H*", md5("my-password"));如果你使用 PHP5,那么可以直接这样子
$pass_hash = md5("my-password", true); // PHP 5+这样做的好处之一是能减少串存储空间(能节省多少呢?可能又会是另篇文章了)。
这里还有个示例代码可以 pack 数组 via
<?php
function pack_array($v,$a) {
return call_user_func_array(pack,array_merge(array($v),(array)$a));
}cal_days_in_month()
该函数可以直接返回指定月份中的天数,例如
$days = cal_days_in_month(CAL_GREGORIAN, date("m"), date("Y")); // 31我敢保证,你自己实现过类似功能的函数 :^)
_()
呃,这的确也是个 PHP 函数(也有可能是最短的 PHP 内置函数)。_() 是它的“小名”,它的大名是 gettext()。
写过 Wordpress 皮肤的朋友会了解 __() 以及 _e() 这些函数,其实 PHP 早已经自带了相关的功能。
// Set language to German
setlocale(LC_ALL, 'de_DE');
// Specify location of translation tables
bindtextdomain("myPHPApp", "./locale");
// Choose domain
textdomain("myPHPApp");
echo _("Have a nice day");利用 gettext 可以编写多语言的应用,现在您感兴趣的可能就是如何编写 locale 文件,这但已经不是此文涉及的重点,更多信息可以移步到这里。
get_browser()
坦白讲,见到这个函数我当时就彻底泪奔。有了这个函数,再也不用自己去分析 $_SERVER['HTTP_USER_AGENT'] 这个字符串了。
更多的信息可以参考这里。在使用此函数前,你可能需要个 browscap.ini 配置文件,相信你可以搞定的。
debug_print_backtrace()
以前查看函数调用堆栈,我会使用 xdebug 等的扩展,其实 PHP5 版本以后已经内置了相关的函数。
顺便再分享个“蛋疼”的小技巧,如果你记不住这个函数的名字,可以用这段代码同样能达到目的(看起来还是记住那个函数靠谱):
<?php
$e = new Exception();
print_r(str_replace('/path/to/code/', '', $e->getTraceAsString()));natsort()
这个函数用于自然排序,这个大家可能都要用到。贴下相关的文档链接以及示例代码
$items = array("100 apples", "5 apples", "110 apples", "55 apples");
// normal sorting:
sort($items);
print_r($items);
# Outputs:
# Array
# (
# [0] => 100 apples
# [1] => 110 apples
# [2] => 5 apples
# [3] => 55 apples
# )
natsort($items);
print_r($items);
# Outputs:
# Array
# (
# [2] => 5 apples
# [3] => 55 apples
# [0] => 100 apples
# [1] => 110 apples
# )有关自然排序的算法规则,可以参考这里的文档。
glob()
这个函数的功能同样让人感到泪奔,先不说功能直接上示例代码
foreach (glob("*.php") as $file) {
echo "$file\n";
}相比你已经了解该函数的用途了,那么我们就可以有更多的“玩法”,例如就显示目录(via):
$dirs = array_filter(glob($path.'*'), 'is_dir');
当然,文件递归你也可以考虑使用下 SPL 扩展。
PHP Filter
如果你还在正则验证字符串,那么就真的“Out”了。自 PHP5.2 版本以后,内置了 PHP Fliter 模块用于专门验证 电子邮件、URL 等是否合法,示例代码:
var_dump(filter_var('bob@example.com', FILTER_VALIDATE_EMAIL));由于是新生的模块,因此还有很多的陷阱,例如
filter_var('abc', FILTER_VALIDATE_BOOLEAN); // bool(false)
filter_var('0', FILTER_VALIDATE_BOOLEAN); // bool(false)但这不影响我们去尝试。有关 PHP Filter 的更多信息,相信能拎出来另外写篇文章了。
-- Split --
最后,感叹 PHP 其实是个历久弥新的工具,不小心我们就会悲剧性得重复造了只轮子。因此,时常看看 PHP 文档每次都会有新的收获。
提起NoSQL这个话题,仿佛不应该是DBA要关注的事,而是架构师应该关心的。但是作为一名DBA,在使用传统的关系型思想建模时,应该有必要了解NoSQL的建模方法。
各种NoSQL数据库有很多,我最关注的还是BigTable类型,因为它是一个高可用可扩展的分布式计算平台,用来处理海量的结构化数据,而数据库同样也是处理结构化数据,所以除了没有SQL,在数据模型方面有相似之处。Cassandra是facebook开源出来的一个版本,可以认为是BigTable的一个开源版本,目前twitter和digg.com在使用。我们尝试从DBA的角度出发去理解Cassandra的数据模型。
NoSQL并不能简单的理解为No SQL,其本质应该是No Relational,也就是说它不是基于关系型的理论基础,而我们所有传统的数据库都是基于这套理论而发展起来的,所以SQL并不是问题的关键所在,比如有些NoSQL数据库可以提供SQL类型的接口,允许你通过类SQL的语法去访问数据。而Friendfeed则是反其道而行之,利用关系型数据库MySQL,采用了去关系化的设计方法,去实现自己的KeyValue存储。所以NoSQL的本质是No Relational.
Cassandra特点:
1.灵活的schema,不需要象数据库一样预先设计schema,增加或者删除字段非常方便(on the fly)。
2.支持range查询:可以对Key进行范围查询。
3.高可用,可扩展:单点故障不影响集群服务,可线性扩展。
Keyspace
Cassandra中的最大组织单元,里面包含了一系列Column family,Keyspace一般是应用程序的名称。你可以把它理解为Oracle里面的一个schema,包含了一系列的对象。
Column family(CF)
CF是某个特定Key的数据集合,每个CF物理上被存放在单独的文件中。从概念上看,CF有点象数据库中的Table.
Key
数据必须通过Key来访问,Cassandra允许范围查询,例如:start => '10050', :finish => '10070'
Column
在Cassandra中字段是最小的数据单元,column和value构成一个对,比如:name:“jacky”,column是name,value是jacky,每个column:value后都有一个时间戳:timestamp。
和数据库不同的是,Cassandra的一行中可以有任意多个column,而且每行的column可以是不同的。从数据库设计的角度,你可以理解为表上有两个字段,第一个是Key,第二个是长文本类型,用来存放很多的column。这也是为什么说Cassandra具备非常灵活schema的原因。
Super column
Super column是一种特殊的column,里面可以存放任意多个普通的column。而且一个CF中同样可以有任意多个Super column,一个CF只能定义使用Column或者Super column,不能混用。下面是Super column的一个例子,homeAddress这个Super column有三个字段:分别是street,city和zip:
homeAddress: {street: "binjiang road",city: "hangzhou",zip: "310052",}
Sorting
不同于数据库可以通过Order by定义排序规则,Cassandra取出的数据顺序是总是一定的,数据保存时已经按照定义的规则存放,所以取出来的顺序已经确定了,这是一个巨大的性能优势。有意思的是,Cassandra按照column name而不是column value来进行排序,它定义了以下几种选项:BytesType, UTF8Type, LexicalUUIDType, TimeUUIDType, AsciiType, 和LongType,用来定义如何按照column name来排序。实际上,就是把column name识别成为不同的类型,以此来达到灵活排序的目的。UTF8Type是把column name转换为UTF8编码来进行排序,LongType转换成为64位long型,TimeUUIDType是按照基于时间的UUID来排序。例如:
Column name按照LongType排序:
{name: 3, value: "jacky"},
{name: 123, value: "hellodba"},
{name: 976, value: "Cassandra"},
{name: 832416, value: "bigtable"}
Column name按照UTF8Type排序:
{name: 123, value: "hellodba"},
{name: 3, value: "jacky"},
{name: 832416, value: "bigtable"}
{name: 976, value: "Cassandra"}
下面我们看twitter的Schema:
<Keyspace Name="Twitter">
<ColumnFamily CompareWith="UTF8Type" Name="Statuses" />
<ColumnFamily CompareWith="UTF8Type" Name="StatusAudits" />
<ColumnFamily CompareWith="UTF8Type" Name="StatusRelationships"
CompareSubcolumnsWith="TimeUUIDType" ColumnType="Super" />
<ColumnFamily CompareWith="UTF8Type" Name="Users" />
<ColumnFamily CompareWith="UTF8Type" Name="UserRelationships"
CompareSubcolumnsWith="TimeUUIDType" ColumnType="Super" />
</Keyspace>
我们看到一个叫Twitter的keyspace,包含若干个CF,其中StatusRelationships和UserRelationships被定义为包含Super column的CF,CompareWith定义了column的排序规则,CompareSubcolumnsWith定义了subcolumn的排序规则,这里使用了两种:TimeUUIDType和UTF8Type。我们没有看到任何有关column的定义,这意味着column是可以灵活变更的。
为了方便大家理解,我会尝试着用关系型数据库的建模方法去描述Twitter的Schema,但千万不要误解为这就是Cassandra的数据模型,对于Cassandra来说,每一行的colunn都可以是任意的,而不是象数据库一样需要在建表时就创建好。
Users CF记录用户的信息,Statuses CF记录tweets的内容,StatusRelationships CF记录用户看到的tweets,UserRelationships CF记录用户看到的followers。我们注意到排序方式是TimeUUIDType,这个类型是按照时间进行排序的UUID字段,column name是用UUID函数产生(这个函数返回了一个UUID,这个UUID反映了当前的时间,可以根据这个UUID来排序,有点类似于timestamp一样),所以得到结果是按照时间来排序的。使用过twitter的人都知道,你总是可以看到自己最新的tweets或者最新的friends.
存储
Cassandra是基于列存储的(Bigtable也是一样),这个和基于列的数据库是一个道理。
API
下面是数据库,Bigtable和Cassandra API的对比:
Relational SELECT `column` FROM `database`.`table` WHERE `id` = key;
BigTable table.get(key, "column_family:column")
Cassandra: standard model keyspace.get("column_family", key, "column")
Cassandra: super column model keyspace.get("column_family", key, "super_column", "column")
我对Cassandra数据模型的理解:
1.column name存放真正的值,而value是空。因为Cassandra是按照column name排序,而且是按列存储的,所以往往利用column name存放真正的值,而value部分则是空。例如:“jacky”:“null”,“fenng”:”null”
2.Super column可以看作是一个索引,有点象关系型数据库中的外键,利用super column可以实现快速定位,因为它可以返回一堆column,而且是排好序的。
3.排序在定义时就确定了,取出的数据肯定是按照确定的顺序排列的,这是一个巨大的性能优势。
4. 非常灵活的schema,column可以灵活定义。实际上,colume name在很多情况下,就是value(是不是有点绕)。
5.每个column后面的timestamp,我并没有找到明确的说明,我猜测可能是数据多版本,或者是底层清理数据时需要的信息。
最后说说架构,我认为架构的核心就是有所取舍,不管是CAP还是BASE,讲的都是这个原则。架构之美在于没有任何一种架构可以完美的解决各种问题,数据库和NoSQL都有其应用场景,我们要做的就是为自己找到合适的架构。
–EOF–
这篇文章,我参考了up and running with cassandra,除此以外,我还参考了twitter提供的API,它帮助我理解twitter的schema设计。这篇文章,肯定有很多理解不正确的地方,希望朋友们指正。
I’m running in this misconception second time in a week or so, so it is time to blog about it.
How blobs are stored in Innodb ? This depends on 3 factors. Blob size; Full row size and Innodb row format.
But before we look into how BLOBs are really stored lets see what misconception is about. A lot of people seems to think for standard (”Antelope”) format first 768 bytes are stored in the row itself while rest is stored in external pages, which would make such blobs really bad. I even seen a solution to store several smaller blobs or varchar fields which are when concatenated to get the real data. This is not exactly what happens
With COMPACT and REDUNDANT row formats (used in before Innodb plugin and named “Antelope” in Innodb Plugin and XtraDB) Innodb would try to fit the whole row onto Innodb page. At least 2 rows have to fit to each page plus some page data, which makes the limit about 8000 bytes. If row fits completely Innodb will store it on the page and not use external blob storage pages. For example 7KB blob can be stored on the page. However if row does not fit on the page, for example containing two 7KB blobs Innodb will have to pick some of them and store them in external blob pages. It however will keep at least 768 bytes from each of the BLOBs on the row page itself. With two of 7KB blobs we will have one blob stored on the page completely while another will have 768 bytes stored on the row page and the remainder at external page.
Such decision to store first 768 bytes of the BLOB may look strange, especially as MySQL internally has no optimizations to read portions of the blob – it is either read completely or not at all, so the 768 bytes on the row page is a little use – if BLOB is accessed external page will always have to be read. This decision seems to be rooted in desire to keep code simple while implementing initial BLOB support for Innodb – BLOB can have prefix index and it was easier to implement index BLOBs if their prefix is always stored on the row page.
This decision also causes strange data storage “bugs” – you can store 200K BLOB easily, however you can’t store 20 of 10K blobs. Why ? Because each of them will try to store 768 bytes on the row page itself and it will not fit.
Another thing to beware with Innodb BLOB storage is the fact external blob pages are not shared among the blobs. Each blob, even if it has 1 byte which does not fit on the page will have its own 16K allocated. This can be pretty inefficient so I’d recommend avoiding multiple large blobs per row when possible. Much better decision in many cases could be combine data in the single large Blob (and potentially compress it)
If all columns do not fit to the page completely Innodb will automatically chose some of them to be on the page and some stored externally. This is not clearly documented neither can be hinted or seen. Furthermore depending on column sizes it may vary for different rows. I wish Innodb would have some way to tune it allowing me to force actively read columns for inline store while push some others to external storage. May be one day we’ll come to implementing this in XtraDB
So BLOB storage was not very efficient in REDUNDANT (MySQL 4.1 and below) and COMPACT (MySQL 5.0 and above) format and the fix comes with Innodb Plugin in “Barracuda” format and ROW_FORMAT=DYNAMIC. In this format Innodb stores either whole blob on the row page or only 20 bytes BLOB pointer giving preference to smaller columns to be stored on the page, which is reasonable as you can store more of them. BLOBs can have prefix index but this no more requires column prefix to be stored on the page – you can build prefix indexes on blobs which are often stored outside the page.
COMPRESSED row format is similar to DYNAMIC when it comes to handling blobs and will use the same strategy storing BLOBs completely off page. It however will always compress blobs which do not fit to the row page, even if KEY_BLOCK_SIZE is not specified and compression for normal data and index pages is not enabled.
If you’re interested to learn more about Innodb row format check out this page in Innodb docs:
It is worth to note I use BLOB here in a very general term. From storage prospective BLOB, TEXT as well as long VARCHAR are handled same way by Innodb. This is why Innodb manual calls it “long columns” rather than BLOBs.
Entry posted by peter | 6 comments
Our patches for 5.0 have attracted significant interest. You can read about SecondLife’s experience here, as well as what Flickr had to say on their blog. The main improvements come in both performance gains and improvements to diagnostics (such as the improvements to the slow log output, and INDEX_STATISTICS).
Despite having many requests to port these patches to 5.1, we simply haven’t had the bandwidth as our main focus has been on developing XtraDB and XtraBackup. Thankfully a customer (who prefers to stay unnamed) as stood up and sponsored the work to move the patches to 5.1.
To refresh, the most interesting patches are:
- Performance patches for InnoDB ®. Although many patches are present in XtraDB / InnoDB-plugin, the RC status of plugin does not allow to install it on product for some customer’s policies.
Important fixes are: - Diagnostic patches.
-
- – We provide much more statistics in slow.log, i.e. execution plan, InnoDB timing, profiling info
-
- Different patches to help with day to day usage of MySQL ®
Two new features which not available for 5.0:
- In slow.log for Stored Procedure call you can see profiling for each individial query from this procedure, not just
call storproc() - With userstat you can get additional THREADS_STATISTICS which show similar information to USER/CLIENT_STATISTICS but per THREAD granularity (it’s useful if you have connection pool)
On this stage the patches are available only in source code, you
can get them from Launchpad https://code.launchpad.net/~percona-dev/percona-patches/5.1.43. Binaries are also on the way, and will be ready soon. We are running intensive stress testing loads on them to provide stable and quality packages.
And to finalize are results for tpce-like benchmark, where I compare MySQL-5.1.43 vs percona-5.1.43.
The results made for TPCE configuration with 2000 customers and 300 tradedays and 16 concurrent users on our R900 server. The dataset is about 25GB, fully fitting into buffer_pool, so disk does not really matter, but data was stored on FusionIO 320GB MLC card.
On chart with results I show amount of TradeResults transactions per 10 sec during 3600 session (more is better)

As you see with percona patches you can get just about 10x improvement.
Yeah, that sounds too cool, but let me explain where difference comes from.
As I mentioned in tpce workload details the load is very SELECT intensive and these SELECTS are mainly scans by secondary keys ( not Primary Keys), so it hits problems in InnoDB rw-lock implementations and in buffer_pool mutex contention, which alredy fixed in percona-patches ( and in XtraDB and InnoDB-plugin also).
So you are welcome to try it!
Entry posted by Vadim | 4 comments
2010-02-09 Tue
2010-02-08 Mon
2010-02-07 Sun
2010-02-06 Sat
- Oracle Life
- DBA notes
- 对牛乱弹琴 | Playin' with IT
- Oracle Security Blog
- Google 黑板报 -- Google 中国的博客网志
- Taobao.com UED Team
- Movable Type
- Eddie Awad's Blog
- The Tom Kyte Blog
- 小众软件
- 分享网络2.0
- 博客@英特尔中国
- NinGoo.net
- AnySQL.net
- DBA Tools
- MySQL Performance Blog
- Chanel [K]
- DBA@Taobao
- OracleBlog.cn
- Hello DBA
- 云风的 BLOG
- 白鸦,Blog
- ESB zone
- Gracecode.com
- flypig.org
- 槽边往事
- Inside AdSense
- High Scalability
- 王建硕
- 玉面飞龙的BLOG
- Oracle database internals by Riyaj
- 南方公园
- All Things Distributed
- Snax



![Win7 Library Tool 增强 win7 的库[图] | 小众软件 > system Win7 Library Tool 增强 win7 的库[图] | 小众软件 > system](http://img1.appinn.com/2010/02/223053000.png)








