Over the past year, the genealogy site’s repository of family historical data has greater than doubled in size. Here’s how Ancestry managed its growth.
Businesses often use — or overuse — the term “big data” to explain every kind of knowledge-related services, however the buzzword certainly applies with regards to Ancestry.com, a favored genealogy service that helps people dig up their family roots.
A little over a year ago, Ancestry was managing about 4 petabytes of information, including greater than 40,000 record collections with birth, census, death, immigration, and armed forces documents, in addition to photos, DNA test results, and other info. Today the gathering has quintupled to greater than 200,000 records, and Ancestry’s data stockpile has soared from 4 petabytes to ten petabytes.
According to Bill Yetman, senior director of engineering at Ancestry.com, the large data explosion brought about growing pains. “We measured every step in our process pipeline,” said Yetman in a phone interview with InformationWeek. “We started with academic algorithms that folk are using at universities, and that they work great at smaller scales.”
[How can K-12 education help train a brand new generation of information scientists? Read How Educators Can Narrow Big Data Skills Gap.]
But, he added, these algorithms were breaking down as the database got bigger and greater and larger. “There is a very specific algorithm we use in matching [DNA]. It’s called Germline, and it was created by some very, very bright people at Columbia University,” Yetman told us.
To analyze its growing stockpile of DNA data, Ancestry needed to re-implement Germline using Hadoop and HBase. This process involved storing the information in HBase, after which using two map functions to run comparisons in parallel. “There are two MapReduce steps we use, after which we use HBase to carry the consequences, which makes it easy for us to do the [DNA] comparisons. If we couldn’t run this stuff in parallel, we couldn’t get it done nearly as fast.”
Hadoop’s vaunted expandability also helped Ancestry manage its growth. “If i must improve my [performance] times, i will be able to scale horizontally,” said Yetman. “Just add more nodes to the cluster, and we will be able to handle the expansion.”
Future growth, however, would require more innovation to maintain things flowing smoothly. “You cannot just say, ‘OK, I’ve gotten over this 200,000 hump, and that i could make it to five million.’ i do know there are going to be challenges all along the way in which, and i am going to be searching for them.”
Obviously, hardware performance need to be monitored closely. “We have to observe the memory in each node, how we’re using memory, and the way we’re using the CPU.”
Ancestry.com is additionally inside the technique of optimizing its Germline implementation to greatly reduce its memory usage. And it can team up with a cloud provider to spice up its processing capacity.
The cloud option gained credence when Ancestry.com recently updated its algorithm for its ethnicity test. “We needed to return to these 200,000 people to rerun their ethnicity,” said Yetman. “We did that using machines in our datacenter.” But local hardware won’t be enough because the selection of users climbs to 500,000 — or 1 million.
Ancestry.com is currently evaluating several cloud providers, but Yetman acknowledges that privacy issues add some extent of complexity to the move. “It gets really tricky because DNA data is so sensitive. That’s some of the things that we as an organization are careful with.”
One potential solution: “I’m staring at bursting to the cloud… to do these calculations,” Yetman said. But as opposed to leaving the information within the cloud, he might “pull all of it back” to local storage to relieve customers’ privacy concerns.
Emerging software tools now make analytics feasible — and price-effective — for many companies. Also within the Brave The large Data Wave issue of InformationWeek: Have doubts about NoSQL consistency? Meet Kyle Kingsbury’s Call Me Maybe project. (Free registration required.)
More Insights