Selling DNA Data

by William Wells

Incyte doesn't need to make drugs. It already makes millions of dollars selling a bunch of As, Cs, Gs and Ts.

Human cells are almost ridiculously tiny and efficient. Every one of them has an entire genetic instruction set - three billion As, Cs, Gs and Ts - packed into a nucleus one hundredth of a millimeter across. And a few cents worth of carbon, nitrogen and phosphorus will keep millions of cells happy in a dish, reproducing themselves and their DNA data.

Incyte Pharmaceuticals, Inc. (Palo Alto, Calif.) also stores DNA data, but in a way that is somewhat less compact and far more likely to impress a computer geek. Ground-zero for Incyte is a highly air-conditioned room filled with scores of black, towering supercomputers worth up to a million dollars each.

Nine of the ten largest pharmaceutical companies in the world each pay Incyte ~$5 million per year to look at the data on those computers. The money has brought an expanding workforce - from 160 to 675 employees in the last two and a half years - and a frantic effort to keep generating more and more information. "We're maybe a year away from going bankrupt if we don't pursue new technologies at any one point," says Tod Klingler, Director of Research Bioinformatics at Incyte.

The manpower and computer power is needed to tame an ever-growing mountain of DNA sequence information. "We now have 24 organisms completely sequenced - we have their entire genetic message written down," says Temple Smith of Boston University. "This information allows us to get information about one organism and use that to understand another."

Evolution, the original Xerox machine

Incyte's DNA databases are valuable because nature is thrifty. As single cells evolved to make mice and then men, many genes were kept on to do their old jobs. Incyte can line up the genes and recognize the similarities in the DNA sequences. (A DNA sequence is the order of the several thousand A, C, G and T nucleotides that make up a single gene.) If the function of a yeast gene is known, the function of the related human gene is, by implication, similar. And yeast genes are easier to study. "We don't put humans in Waring blenders and do experiments," says Smith, "but we do put yeast in Waring blenders."

"People have been blown away in recent years by the sequence comparisons," says Smith. "For example, in the fly the homeobox genes control what end is the mouth and what end is the ass, which apparently is very important. The same genes are used in mice, in exactly the same order. Once biology found a way to get a front and a back, she never did it again."

Pharmaceutical companies are worried about drugs not evolution. But drugs for complex diseases do not spring forth fully formed. Modern drug hunting means finding a chemical that jams a specific protein, and the first step is to find that protein target. Only then can you look through thousands of chemicals to find the one that turns the protein off. If the protein is needed for the virus to invade the cell, or for the cancer cell to multiply, then the chemical is your new drug. "In the past it might have taken a long time to find one drug target, but now you can quickly get multiple targets," says Klingler. "The priming of the pump is no longer a problem."

The most important proteins, like those that tell a cell when and how to grow, are often remarkably similar from yeast to man. In fact a number of human genes can replace their yeast counterparts and keep the yeast alive (some of these genes are listed on the Web). A drug company that searches the Incyte database and finds the human version of a yeast growth protein would count itself lucky: Chemicals that can turn off that protein might have anti-cancer activity.

The immune system has no counterpart in yeast cells, but targets for anti-inflammatory drugs can also be found in the Incyte databases. These searches rely on another result of nature's thriftiness: proteins come in families. Immune cells, for example, must send countless messages to each other. Rather than invent a completely new protein for each message, evolution has Xeroxed its first effort and made minor changes. Even if one of these messengers is useless as a drug target (perhaps turning it off shuts down the whole immune system), it may allow you to find its cousin, which is specific to the auto-immune disease lupus.

The power of DNA databases mushrooms with their size. Every new piece of DNA sequence means not only one more possible drug target, but also one more sequence to compare all the old sequences against. "The sequence comparison method is not new," says Smith. "What is new is how much data we have."

Sequence, sequence, sequence

In simpler days, before people made up words like genomics and bioinformatics, a single gene was enough to keep a graduate student occupied for all five years of a doctorate. Sequence data trickled onto Genbank one gene at a time, even as the methods for obtaining that sequence became simpler and simpler. Then a renegade called J. Craig Venter had an idea: stop thinking about what DNA you should sequence and start sequencing anything you can get your hands on. Reasoned thought was out; brute force was in.

But there was some logic to Venter's approach. Genes are mere islands in a cell's DNA, stranded among seas of nonsensical filler DNA. The Human Genome Project promised to (eventually) sequence everything. (A genome is the collection of all the DNA in a given cell.) Venter wanted to fish out the informative bits first - just enough of each gene to take a guess at its place in the running of the cell. It is proteins that do the work of a cell, but proteins are made only after genes are converted to mRNA, which is then converted to protein. Venter took the mRNA and transformed it back into DNA that was ready to sequence and devoid of non-gene junk. After sequencing at most a few hundred nucleotides of each piece of DNA, he had his expressed sequence tag (EST).

Venter founded The Institute for Genomic Research (TIGR; Rockville, Md.) in July 1992, with $85 million of funding promised over ten years by Human Genome Sciences, Inc. (HGS; Rockville, Md.). Within a year, TIGR claimed it had identified ESTs for over half the estimated 70,000 human genes. TIGR and HGS parted ways in 1997: TIGR is now a not-for-profit institute with government funding, and HGS has focused on patenting genes (several hundred applications so far, with over fifty patents allowed) and developing the corresponding proteins as drugs.

Incyte began as a traditional pharmaceutical company. But when the failure of its premier drug in clinical trials coincided with Venter's EST splash, Incyte decided to re-invent itself. "We became basically a factory for sequencing DNA," says Klingler.

Incyte’s Database

Four years, six months, and three million human ESTs later, the sequencing machines are still running. The sequencing room is a far cry from the deserted computer room: people scurry everywhere to tend to the rows of sequencing machines. This room generates all the data that makes the company run, but the work is repetitive and the workers - many of them college students - are expendable. "There is a whole new temporary biologist market," says Klingler. "I don't know how long the average technician stays, but it's not too long."

Incyte’s human database is called LifeSeq. The ESTs come from the mRNA of 669 different tissue samples, some of them diseased and some of them not, and represent perhaps 90-95% of all human genes. Genes that are often made into mRNA have been sequenced thousands of times, but some genes that are rarely converted into mRNA remain to be sequenced once. Incyte is also using the short ESTs to find the entire length of every human gene, and is working out where each gene lies in the 24 human chromosomes.

Newer databases include PathoSeq, which has most of the genes from 32 bacterial species, and ZooSeq, which includes genes from mice, rats, monkeys, and soon dogs. The sequencing operation that feeds these databases generates ~200,000 pieces of sequence, or over 40 million DNA nucleotides, every single month.

Growth strategies

Incyte could never have grown to this size by itself. It has aggressively collaborated with, licensed from and acquired companies that can provide:

Many of these collaborations are aimed at adding more bells and whistles to the databases. Any researcher or high school student can compare his or her favorite gene with public databases like GenBank or dbEST, using a common search method called BLAST, so the Incyte database must stand out. For starters, says Klingler, "our customers can get the first look at 40,000 human genes that are in no other database. And Genbank is like a snapshot - the data may be true when it is entered but it never gets updated based on new information coming in."

Genome projects are pouring new sequences into public databases, so the advantage of having more genes will not last for long. "They can't keep all this data secret but they don't care, because patent protection is lead time," says Smith. "If they know something six months ahead that's enough - then you can tell everyone everything."

That game has an inevitable conclusion. As Mark Fishman, a biologist at Massachusetts General Hospital (Boston, Mass.), observed at a recent genomics meeting in San Francisco, "The problem with defining a target like sequencing the genome is that you might succeed and then be out of a job."

Counting the clones

Incyte's escape clause is called expression analysis. Just by sequencing an insane number of ESTs the company gets a rough sense of how often each gene is expressed, i.e., made into its corresponding protein. In the pancreas, for example, the insulin gene will be turned on to make mRNA and then insulin protein. There will be hundreds of insulin mRNAs, and so hundreds of insulin ESTs, from pancreatic tissue, but no insulin ESTs from skin tissue. The Incyte database has this sort of information for almost all genes in almost all tissues.

"Now we can do biological research with a picture of the entire human genome," says Klingler. "The classical approach is to look at one gene at a time. Having a peek at all the human genes will change the way you look at a problem. You can take a disease tissue and find all the genes that you see only in asthmatic lungs. That's never been possible in the past."

Counting up ESTs is what Klingler calls "low resolution" information. The future lies in chips that can hold tens of thousands of genes arrayed in a neat grid. A chip with every one of the 6116 genes of brewers yeast has just been made by Joe DeRisi and Patrick Brown of Stanford University,
Link to the home page of the scientists who made this chip
and any number of researchers and companies are busy lining up collections of human genes. Those who are keen (and have a lot of spare time and $25,000 for parts) can even make their own chips and chip-readers using DeRisi's instructions. The two leading chip companies are Affymetrix, which in a confusing turn of events is both collaborating with Incyte and suing it for patent infringement, and Synteni, which was bought by Incyte last January.

Researchers using the chips first collect mRNA from two different sources, such as diseased and non-diseased tissue, or normal and drug-treated cells. The mRNA from diseased tissue can be labeled with a green dye, and non-diseased mRNA with a red dye. The mRNAs are then allowed to stick to their corresponding genes on the chip. If there is far more of mRNA from gene 216 in the diseased state, position 216 will light up green, but if there is more mRNA 216 in the normal state it will be red. Equal expression gives a yellow spot. With one experiment the researcher can tell how every gene has reacted to the change.

The flood of data from these methods is just beginning. "Probably 99% of the data collected using this technique haven't been published yet," says Brown. "It's a fast-moving and exciting field." Brown is looking at how yeast coordinate switching hundreds of genes on or off when they have more or less food, but the pharmaceutical companies will be looking at their favorite drug target. If the gene they proposed as a breast cancer target is also turned on in pancreatic cancer they should expand their clinical trials. And, given a choice, they should opt for the target that is not made in the stomach or blood, to minimize the chances that their drug will cause digestive and immune problems.

Finding a made-to-order gene that is on in one situation and off in many others used to be either a fluke or impossible. The chips make it a matter of a few experiments. That makes researchers like Klingler ambitious. "Our real goal is to understand the molecular basis of human biology," he says. "That's not going to happen in a traditional molecular biological way, one gene at a time."

Originally published in the web magazine Access Excellence.

Return to writing archive