Archive for the ‘open source’ tag
Build from source in Windows
While Bench Press is a big fan of open source, we realize that it can be intimidating for the lay-scientist (or layperson for that matter) to build code from an open source repository when asking a question might quickly get the asker labeled a n00b and not taken seriously. This problem is especially relevant to us Windows users who don’t have ready access to UNIX-style command line-fu and are dependent on kind open source community members to create Windows-specific installers.
This weekend, while working on integrating the open source database SQLite into some code I was writing for Benchside, I realized that the only way to integrate SQLite’s Full Text Search capability was to recompile it from source code, something I had never done before. As I ran a Windows system, I wasn’t able to use the UNIX command line instructions on from Michael Trier’s post on how to integrate SQLite’s Full Text Search capability into a Python program.
A few hours of research and trial and error later, I finally came up with how to do it on Windows in a way which hopefully generalizes to building other open source projects out there. Hopefully I can lay these out so that other “n00bs” out there no longer need to feel left out when someone gives them some source code:
- Install MinGW and MSYS – MinGW is the Minimalist GNU for Windows software pack which allows Windows computers to implement the development tools (like
gccandmake) that UNIX-style operating system users (Linux, Mac) take for granted. MSYS provides a command line interface which emulates directly the UNIX command line, at least up to the point where it is needed to build from source code. Both packages are open source and have Windows installers for download at their respective pages (to keep things simple). These installers are probably a sub-version or two behind (as the updated tools need to be packaged together in Windows installers), but with MinGW and MSYS, it should be relatively easy to download and build from source the new tools! In any event, install MinGW first, and afterwards when you install MSYS, it will ask where you placed the MinGW install and be able to link directly to those compiler tools- (optional step) Install GnuWin32 – GnuWin32 is, like MinGW, a set of tools which emulate many of the other commands that UNIX-style operating systems have. The primary use for these would be to do other command line tricks which the UNIX guys have access to (i.e. downloading and de-compressing source code files from the command line). This is unnecessary as most commercial compression software (or open source like 7zip) can handle most of the de-compressing that you need
- (optional step) Install the Windows Open Command Window Here PowerToy which will let you right-click on a folder and open a command line window right there. Makes things more convenient many times.
- Add MinGW and MSYS to your system path – Now that you’ve installed MinGW and MSYS (and possibly GnuWin32), you need to make sure that your tools can be accessed from anywhere by command line and not just in the folders that you installed them to. To do this, pull up the “System” panel in your Control Panel. You can do this by clicking on “Start”, “Run”, and typing in
“control sysdm.cpl”and hitting [ENTER]. A window should pop up called “System Properties”. Click on the “Advanced” tab and then click on the “Environment Variables” button. Another window should pop up. In the System variables panel (the one on the bottom, see image below), select the row that has “Path” in the Variable column and click on the “Edit” button. This should bring up a text-editing dialog box where you should add the full directory paths to the MinGWbinand MSYS1.0folders separated by semicolons there. In my case, for instance, the paths to the bin folders were “C:\MinGW\bin” and “C:\msys\1.0” and so I added “C:\MinGW\bin;C:\msys\1.0;” to the end of the Variable Value text box.
- Download and de-compress the source code – This usually comes in the form of a gzipped tarball file (*.tar.gz) which you can unpack with software like 7zip or at the command line (if you have the appropriate tools installed) using
gzipandtarcommands. - Run MSYS and use the
cdcommand to go to the directory where you’ve de-compressed the source code – You should be able to do this, assuming you added MSYS to your path properly, from anywhere in the command line by just typingmsysand hitting [ENTER]. Use thecdcommand (syntax: “cd<name of directory>”) to move to the directory where you unzipped the source code. In my case it was: “cd C:/sqlite3/sqlite-3.6.22/” - Set any compiler flags that need to be set and run
./configure: For SQLite, I had to set theDSQLITE_ENABLE_FTS3andDSQLITE_ENABLE_FTS3_PARENTHESISflags before compiling. To do this, I simply typed in:CFLAGS="-DSQLITE_ENABLE_FTS3=1 -DSQLITE_ENABLE_FTS3_PARENTHESIS =1" ./configureand hit [ENTER]. What this does is first, set the environment variableCFLAGS, which is what the compiler will look at to set any preprocessor flags (setting-DXXX=Ysets the preprocessor variableXXXthe value ofY) that it needs before building, and second call./configurewhich runs a quick diagnostic to see if your system has all the development tools it needs to compile. If an error pops up, this might be a sign that your MinGW installation is incomplete or that you did not set the system path correctly. - Follow any instructions in the README – Source code packs usually come with a README text file which gives instructions for how to build the software package. You should definitely read those (as they may also provide instructions for which compiler flags that you might want to set in step 5). These usually end with you entering the instruction “
make”(and then [ENTER]) and subsequently “make install”(and then [ENTER]) to make sure that the compiled code is embedded in the system at its proper location. - Post-setup – Depending on the software, there may be further configuration/setup steps that need to take place. In the case of my custom SQLite build, I had to copy the Windows DLL file from the ./lib folder that was created by the build process (
libsqlite3-0.dll) into my Python installation’sDLLfolder and rename itsqlite3.dllto replace my old built-in Python-SQLite setup.
The process above is probably not 100% fool-proof and skims over some details which may be important for different source code types, but hopefully my painfully self-taught lessons in building from source in Windows-land will be helpful to those of you out there who find yourself needing to build something by compiling from source code.
Why Bio/Pharma Should Open Up
The Open Science movement is driven by the idea that collaboration and openness are good for innovation and discovery. After all, the logic goes, who is more likely to discover a cure for cancer: five research groups with different sets of skills and specializations who don’t share any information with one another, or five identical groups who actively pool their knowledge?
Ironically, that reasoning seems to have completely skipped over the biotech/pharmaceutical industry who seem intent on pursuing the “divided we fall” approach despite the escalating costs of drug development. The application of openness itself is especially relevant here, as a significant piece of the $800 million – $1.2 billion price tag that goes with bringing one drug to market is the cost of failed R&D projects.
This problem is one that is not only a burden on the shareholders and executives at these companies, but also a burden on the healthcare systems of people around the world, who have to pay more and wait longer for drugs to make it through a company’s pipeline.
About a month ago, the New York Times cast a spotlight on this problem:
Although many companies have committed to publishing the results of clinical trials, whether or not they succeed, drug makers don’t typically publish information about projects that fail at an earlier stage. A result is that companies waste many millions going down experimental paths that their competitors have already found to be dead ends.
M.I.T. is proposing dead-end drug disclosure, a concept that goes by a euphemistic mouthful: “precompetitive information sharing.”
Drug makers may realize that the financial and medical value of sharing such information outweighs the competitive risk, said Dr. Gigi Hirsch, the executive director of the M.I.T. Center for Biomedical Innovation, the locus of the drug project. “There should be more information available about failed compounds in the interest of the greater good,” Dr. Hirsch said.
The traditional response from the pharmaceutical industry is one that is familiar to Open advocates – that intellectual property and proprietary platforms are necessary for the returns which drive investment in these spaces. But, this ignores two things.
First, the act of sharing information on failed assay hits helps to reduce the cost and time of development. The investment decision is driven by returns, and returns are driven by costs and the delay in achieving revenues. While being more open about internal failures and successes will do little to change the cost of marketing, clinical trials, and many aspects of development, it will reduce a large portion of the time and cost of initial R&D (because of the wider availability of information) and the cost of failure (as there would be no need to pursue avenues of research on paths that have already been deemed a failure). So even if the price premium and number of drugs sold diminish slightly due to the inability of a company to hold on to some early-stage proprietary advantage or the release of some of the details on their compound library, the time to market and cost of development diminish as well, helping to preserve the return on investment necessary to provide for the level of drug innovation and discovery that patients and doctors desire.
Secondly, a move to openness does not necessitate unprofitability. It wouldn’t be realistic to ask companies to pursue a path which destroys the shareholder value they’ve been entrusted to protect. But, the fact that technology companies like Google and Nokia have been able to push open standards and open source yet retain profitability and innovativeness should hopefully signal to bio/pharmaceutical companies that it is possible to pursue shareholder value and a degree of openness.
The path to profitable openness that technology industry practitioners like Google pursue is no different than the path that any company pursues: specialization. Google, for example, has specialized on high quality search results, effective advertisement targeting, and its IT infrastructure management. As a result, there’s no reason for Google to prevent the broader technology industry from having access to the source code for its Chrome web browser, Android mobile operating system, and provides APIs to access almost all of the information you can find on its websites. In fact, Google seems to have understood that keeping its information closed would reduce innovation on the web, which would in the long-run hurt its own growth and profitability prospects.
I’d humbly wager that the value of protecting early stage failure and platform information is relatively minor in the grand scheme of pharmaceutical company value (and in fact some companies do publish failures, only years after the experiment), and that major drug companies probably have much more significant expertise and differentiation in the steps after initial R&D, such as in compound refinement, clinical trials operation, process development, and computational analysis, etc. Not to mention that healthcare worldwide and the state of drug development science in both academic and corporate settings would probably benefit significantly more from differentiation along these activities than any proprietary lock on compound failure information would enable.
Nobody is saying that this path will magically materialize and produce awe-inspiring levels of profitability and growth. But, when an industry is on the edge of a patent cliff (most blockbuster drugs are expected to become generic in the next couple of years), and its primary source of “value creation” seems to be in buying smaller companies, and nations around the world are struggling with healthcare costs, I’d assert that it needs to change its practices.
As for how – I would propose the following compromise.
- First, the NIH, PhRMA, or some other neutral authority should define a set of standards for what information should be contributed (balancing the desire to foster innovation through openness and the desire for companies to maintain the closedness they need to build proprietary advantages along other dimensions), create a standard for secure information sharing (which protects any individuals and patients and proprietary pipeline-related information) and govern compliance.
- This body should then set up an information exchange/database for participating companies, academic institutions, government research centers, and medical institutions to share information and prevent non-compliant companies from gaining access (it’s not a perfect solution, but it could help assure companies who are worried that they will give up all of their information but not receive any in return).
This road would likely be a long and difficult one, but given the stakes and the potential benefits, I think it is one well worth taking.
To Stimulate Open Science
A lot of scientific circles are talking about how best to spur collaboration, and that’s spawned a number of movements, such as “open access” and “open science” — both inspired by the “open source” movement in programming — that fight to end the fencing of science into proprietary, commercial enclaves that require fees to access. Clearly, in terms of fostering the trade of knowledge, an open, free highway is better than a highway with a large toll.
Although much of this movement towards open science has focused on journals and their large subscription fees, there’s another area of open science that’s drawn my attention: Gene Ontology (GO) annotations, which are a set of standardized annotations to classify genes according to their biological, such as “amino acid metabolism.” These annotations are, as of now, curated by experts. What I’ve noticed in particular is that GO has thrived in one community, and withered in another, and I’m curious as to why.
The yeast community is famous amongst all the molecular biology communities as being open and collaborative, to the extent that almost all gene names have been systematized, annotations for genes are very extensive and well-structured, a strain is available for the deletion of every gene, many genes are available fused to a fluorescent marker for easy microscopy, and so on. Just go to the Saccharomyces Genome Database, and there’s a wealth of all this sort of information at your fingertips, centralized, standardized, interconnected, and easy to use. In particular, the Gene Ontology annotations are considered superb and accurate, allowing for easy computational interpretation of large-scale experiments involving hundreds and thousands of genes and their interactions. Yeast genomicists use GO all the time, and contribute to its development very often.
In contrast, the human Gene Ontology annotations are considered sparse and relatively uninformative, and generally they aren’t quite as useful for interpreting things like gene expression microarrays. Instead, one of the most successful and popular sets of biological function annotations is called Ingenuity, which is a commercial software package, well developed by the large amount of money poured into it by pharmaceutical companies and other health science research and development.
Why did the two communities end up going in two directions, one towards a more collaborative, “open science”-friendly annotation system, and the other towards a proprietary, commercial annotation platform? Undoubtedly, part of the reason is the structure of financial incentives; human biology has unique opportunities for direct commercialization via drug or health research, and so people would naturally focus their efforts on things that can win them fortune. But the first yeast biology research done by Louis Pasteur was probably related to budding (pun intended) commercial R&D on reproducible bread/wine/beer recipes, so what prevented the yeast community from, say, balkanizing yeast research because of incentives from the beer brewing and bread-making industries?
Perhaps it is because the yeast community arrived at common standards and nomenclature for information sharing long before it got very large. After all, yeast doesn’t nearly have the same problem of having multiple names for the same genes that humans do (just look at the gene RANKL, which is also known as OPGL, ODF, CD254, TNFSF11, TRANCE, and hRANKL2). They also don’t have nearly as much of a problem with the explosion of gene database IDs (humans have, as a small sample: RefSeq, HGNC, Ensembl, EMBL/GenBank, Entrez, MIM, Unigene, UniProt/SwissProt, and UCSC). Perhaps having a common, universal standards-making institution is the answer, to make sure all the railroad tracks are the same width, to use an analogy.
Or perhaps its the size of the community. There are many, many more labs studying human biology than yeast biology, not only because of the financial incentives, but also because of the huge size of the human genome (1000 times bigger than the yeast genome). Maybe it’s just easier to coordinate fewer people into one community.
I think as the scientific community moves forward, especially in embracing new collaborative methods on the internet, we should closely examine what’s worked so far and what hasn’t, so that we don’t end up fording through endless patents, fees, and proprietary, non-interoperable data structures to get what we need.