Slithering your way into bioinformatics with snakemake, round 2.
read moreThere are comments.
I was one of the reviewers of the Salmon paper by Patro et al., 2017, Salmon provides fast and bias-aware quantification of transcript expression, and I posted my review in large part because of Lior Pachter's blog post levying charges of intellectual theft and dishonest against the Salmon authors. More …
read moreThere are comments.
I was one of the reviewers of the Salmon paper by Patro et al., 2017, Salmon provides fast and bias-aware quantification of transcript expression. I was asked to review the paper on September 14, 2016, and submitted my review (or at least stopped getting reminders :) soon after October 20th.
The …
read moreThere are comments.
This blog post stems from notes I made for a 12 minute talk at the Oregon State Microbiome Initiative, which followed from some previous thinking about data integration on my part -- in particular, Physics ain't biology (and vice versa) and What to do with lots of (sequencing) data.
My talk …
read moreThere are comments.
This is early draft text that Anita and I put together from a bunch of brainstorming done at the Imagining Tomorrow's University workshop. Comments welcome!
Communities are the fabric of open research, and serve as the basis for development and sharing of best practices, building effective open source tools, and …
read moreThere are comments.
This is our just-submitted proposal for the JGI-NERSC "Facilities Integrating Collaborations for User Science" call. Enjoy!
Abstract: Sourmash is a command-line tool and Python library that calculates and compares MinHash signatures from sequence data. Sourmash "compare" and "gather" functionality enables comparison and characterization of signatures …
read moreThere are comments.
Note: we were just awarded this allocation on Jetstream for DIBSI. Huzzah!
Large datasets have become routine in biology. However, performing a computational analysis of a large dataset can be overwhelming, especially for novices. From June 18 to July 21, 2017 (30 days), the Lab for Data Intensive Biology …
There are comments.
So I've been invited to Imagining Tomorrow's University, and they have this series of questions they'd like me to answer.
(Note that you can follow the conversation at #TomorrowsUni on Twitter.)
Conveniently I already answered many of these questions in my "What is Open Science?" blog post. I've copy/pasted …
read moreThere are comments.
As part of my Moore Foundation Data Driven Discovery grant, I have to put together annual reports each year. (This is more or less standard for grants. ;) You can read my annual report narrative, here, and my (ancillary, not required) breakdown of projects in the lab, here.
There are comments.
We are currently soliciting applications for computational postdoctoral fellows to undertake exciting projects in computational biology/bioinformatics jointly supervised by Dr. Titus Brown (http://ivory.idyll.org/lab/) and Dr. Fereydoun Hormozdiari (http://www.hormozdiarilab.org/) at UC Davis.
UC Davis is a world class research institution with a strong …
read moreThere are comments.
This is another blog post on MinHash sketches; see also:
There are comments.
Note: This is the fifth post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.
This post was put together after the event and benefited greatly from conversations with Victoria Stodden, Yolanda Gil, Monya Baker, Gail Peretsman-Clement, and Kristin Antelman!
There are comments.
Note: This is the fourth post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.
This is an outline of the talk I didn't give at Caltech, because I decided that Victoria Stodden and Yolanda Gil were going to cover most of …
read moreThere are comments.
Note: This is the third post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.
I've been struggling to put together an interesting talk for the workshop, and last night Gail Clement (our host, @Repositorian) and Justin Bois helped me convinced myself …
read moreThere are comments.
Note: This is the second post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.
An important yet rarely articulated assumption of a lot of my work in biological data analysis is that data implies software: it's not much good gathering data …
read moreThere are comments.
Note: This is the first post in what I hope to be a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.
Even preprints go through some review before they're posted, just to make sure they're …
There are comments.
I'm writing this up for the rOpenSci call on Codes of Conduct that I'm participating in today.
My lab has a lab Code of Conduct.
We adapted it from https://github.com/confcodeofconduct/confcodeofconduct.com. So the "how" was easy enough :).
Key points I want to make:
There are comments.
One of the uses that we are most interested in MinHash sketches for is the indexing and search of large public, semi-public, and private databases. There are many specific use cases for this, but the basic goal is to be able to find data sets by content queries, using sequence …
read moreThere are comments.
This is an update to last week's blog post, "Efficiently searching MinHash Sketch collections".
Last week, Thanksgiving travel and post-turkey somnolescence gave me some time to work more with our combined MinHash/SBT implementation. One of the main things the last post contained was a collection of MinHash signatures of …
read moreThere are comments.
There is an update to this blog post: please see "Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists!
Note: This blog post is based largely on work done by Luiz Irber. Camille Scott, Luiz Irber, Lisa Cohen, and Russell Neches all collaborated on …
read moreThere are comments.
Update: Zenodo will remove content upon request by the owner, and hence is not suitable for long-term archiving of published code and data. Please see my comment at the bottom (which is just a quote from an e-mail from a journal editor), and especially see "Ownership" and "Withdrawal" under Zenodo …
read moreThere are comments.
Our first JOSS submission (paper? package?) is about to be accepted and I wanted to enthuse about the process a bit.
JOSS, the Journal of Open Source Software, is a place to publish your research software packages. Quoting from the about page,
The Journal of Open Source Software (JOSS) is …read more
There are comments.
(Please contact us at bnsacks@ucdavis.edu if you are interested in access to any of this data. We're still working out how and when to make it public.)
The tule elk (Cervus elaphus nannodes) is a California-endemic subspecies that underwent a major genetic bottleneck when its numbers were reduced …
read moreThere are comments.
I just left Woods Hole, MA, where I spent the last 6 and a half weeks taking the Microbial Diversity course as a student. It was fun, exhausting, stimulating, and life changing!
The course had three components: a lecture series, in which world-class microbiologists gave 2-3 hrs of talks each …
read moreThere are comments.
As I wrote last week my latest enthusiasm is MinHash sketches, applied (for the moment) to RNAseq data sets. Briefly, these are small "signatures" of data sets that can be used to compare data sets quickly. In the previous blog post, I talked a bit about their effectiveness and showed …
read moreThere are comments.
(I gave a talk on this on Monday, April 11th - you can see the slides slides here, on figshare.
This is a Reproducible Blog Post. You can regenerate all the figures and play with this software yourself on binder.)
So, my latest enthusiasm is MinHash sketches.
A few weeks back …
read moreThere are comments.
So, there's this fairly large collection of about 700 RNAseq samples, from 300 species in 40 or so phyla. It's called the Marine Microbial Eukaryotic Transcriptome Sequencing Project (MMETSP), and was funded by the Moore Foundation as a truly field-wide collaboration to improve our reference collection for genes (and more …
read moreThere are comments.
I'm writing a proposal to the Sloan Foundation for about $20k to support a workshop to hack on mybinder. Comments solicited. Note, it's, umm, due today ;).
(I know the section on "major related work" is weak. I could use some help there.)
If you're interested in participating and don't mind …
read moreThere are comments.
Preface from Titus: this is an e-mail written by Deniz Kural of Seven Bridges Genomics in response to concerns and accusations about their patents and patent applications on genomics workflow engines and graph genome analysis techniques. It was sent to a closed list initially, and I asked Deniz if we …
read moreThere are comments.
Over the last few months, I've been playing with hypothes.is and thinking about how to use it to further my scientific work. This resulted in some brainstorming with Jon Udell and Maryann Martone about, well, lots of things. And now we're putting in an open science prize entry!
tl …
read moreThere are comments.
If you haven't seen mybinder.org, you should go check it out. It's a site that runs IPython/Jupyter Notebooks from GitHub for free, and I think it's a solution to publishing reproducible computational work.
For a really basic example, take a look at my demo Software Carpentry lesson. Clicking …
read moreThere are comments.
CORRECTION: I mistakenly linked to Geoff Bilder, Jennifer Lin, and Cameron's piece on infrastructure in the first posted version, rather than Cameron's post on culture in data sharing. Both are worth reading but the latter is more relevant to this post, and I also wanted to make sure I correctly …
read moreThere are comments.
I wrote the below in response to someone who e-mailed me about trying out our partitioning approach for metagenome assembly.
yes, the original partitioning approach worked only on low coverage data sets. The main reason is that highly connected regions (repeats, from biology; and some kinds of sequencing errors) are …
read moreThere are comments.
A while back, Kai Blin (via Nick Loman) asked Michael Barton:
If we containerize all these things won't it just encourage worse software development practices; right now developers still need to consider someone other than themselves installing the software.
and Michael Barton's response, transcribed, was:
"It's a good point. Ultimately …read more
There are comments.
Recently I was asked by someone at a funding organization about the term "hardening software"; I wrote a blog post asking others what they thought, and this got a number of great comments (as well as spurring Dan Katz to write a blog post of his own). I'd already written …
read moreThere are comments.
I just received an e-mail from someone in the funding world who thinks a lot about software, and they were interested in any thoughts I might have on the term "software hardening", and its practice. To quote,
This is about making research software more robust, more easily usable and possibly …read more
There are comments.
As part of our Docker hands-on workshop earlier this month, I learned a lot about building Dockerfiles, running Docker containers on remote hosts with docker-machine, and using data volumes to manage data in remotely hosted Docker containers.
During and after the workshop, I put together Docker images (and, more importantly …
read moreThere are comments.
Pubwication. Pubwication is what bwings us togethew today. Pubwication, that bwessed awwangement, that dweam within a dweam. And authorship, twue authorship, wiww fowwow you fowevah and evah. So tweasuwe youw authorship.
Last week, our software paper on khmer 2.0 was published on F1000Research. We intend this paper to be …
read moreThere are comments.
Note: at the Lab for Data Intensive Biology, we're trying out a new journal club format where we summarize our thoughts on the paper in a blog post. For this blog post, Lisa Cohen wrote the majority of the text and the rest of us added questions and comments; Lisa …
read moreThere are comments.
Below is our first review of the paper Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning by Brian Cleary et al., recently published in Nature Biotech.
All in all, this is one of the most technically interesting papers we've read in a while.
A few interesting tidbits …
read moreThere are comments.
I just heard the sad news that Eric Davidson, my PhD advisor, passed away.
Eric was a giant in the field of developmental biology and gene regulatory networks. His work spanned more than fifty years, and had an indelible impact on gene regulation studies. (You can read up on his …
read moreThere are comments.
Just as I was moving to UC Davis, a funding call for a training coordination center came out. I got partway down the path of applying for it before realizing that I was overwhelmed with the move, but I did generate some text that I thought was OK. Here it …
read moreThere are comments.
Note: this is a blog post from the DIB Lab journal club.
Jump to Questions and Comments:.
The paper:
http://www.techfak.uni-bielefeld.de/~stoye/dropbox/wabi2015final.pdf
"Bloom Filter Trie: a data structure for pan-genome storage."
by Guillaume Holley, Roland Wittler, and Jens Stoye.
There are comments.
Note: at the Lab for Data Intensive Biology, we're trying out a new journal club format where we summarize our thoughts on the paper in a blog post. For this blog post, Luiz wrote the majority of the text and the rest of us added questions and comments.
The paper …
read moreThere are comments.
Note: A year ago, I wrote this in response to an editorial request. Ultimately they weren't interested in publishing it, and I got distracted and this languished on my hard disk. So when I remembered it recently, I decided to just push it out to my blog, where I should …
read moreThere are comments.
Note: Last week, I submitted my review of Stephen R. Piccolo, Adam B. Lee, and Michael B. Frampton's paper, Tools and techniques for computational reproducibility. Soon after, Dan Katz wrote a blog post about notebooks, and in a comment I mentioned Piccolo's paper; and, after dropping a note to Dr …
read moreThere are comments.
This is a response to (parts of) Dr. Lior Pachter's post, "The myths of bioinformatics software". (You can also see my post on bioinformatics software licensing for at least some of the background arguments.)
I agree with a lot of what Lior says: most bioinformatics software is not very good …
read moreThere are comments.
If a piece of bioinformatics software is not fully open source, my lab and I will generally seek out alternatives to it for research, teaching and training. This holds whether or not the software is free for academic use.
If a piece of bioinformatics software is only available under the …
read moreThere are comments.
(This is a review of Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees, Solomon and Kingsford, 2015.)
In this paper, Solomon and Kingsford present Sequence Bloom Trees (SBTs). SBT provides an efficient method for indexing multiple sequencing datasets and finding in which datasets a query sequence is present …
read moreThere are comments.
We just submitted our review of the paper Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees., by Brad Solomon and Carl Kingsford.
The paper outlines a fairly simple and straightforward way to query massive amounts of sequence data (5 TB of mRNAseq!) in very small disk (~70 GB …
read moreThere are comments.
I gave a presentation at the BEACON Center's coding group this past Monday; here are my notes and followup links. Thanks to Luiz Irber for scribing!
My short slideshow: here
The khmer project is on github, and we have a tutorial for people who want to try out our development …
read moreThere are comments.
On Tuesday, I wrote a draft blog post in response to Michael Eisen's blog post on how Lior Pachter's blog post was a a model for post-publication peer review (PPPR). (My draft post suggested that scientific bloggers aim for inclusivity by adopting a code of conduct and posting explicit site …
read moreThere are comments.
I'm starting to work on a grant renewal for khmer, and with a lot of help from the community, including most especially Richard Unna-Smith, I've put together the following blurb. Suggestions for things to rearrange, highlight or omit welcome, as well as suggestions for things to add. I can't make …
read moreThere are comments.
Note: at the Lab for Data Intensive Biology, We're trying out a new journal club format where we summarize our thoughts on the paper in a blog post. For this blog post, Camille wrote the majority of the text and the rest of us added questions and comments.
Inanç Birol …
read moreThere are comments.
Last week we wrote five blog posts about some previously un-publicized features in the khmer software - most specifically, read-to-graph alignment and sparse graph labeling -- and what they enabled. We covered some half-baked ideas on graph-based error correction, variant calling, abundance counting, graph labeling, and assembly evaluation.
It was, to be …
read moreThere are comments.
One of our long-term interests has been in figuring out what the !$!$!#!#%! assemblers actually do to real data, given all their heuristics. A continuing challenge in this space is that short-read assemblers deal with really large amounts of noisy data, and it can be extremely hard to look at assembly …
read moreThere are comments.
So far, in this week of khmer blog posts (1, 2, 3), we've been focusing on the read-to-graph aligner ("graphalign"), which enables sequence alignments to a De Bruijn graph. One persistent challenge with this functionality as introduced is that our De Bruijn graphs nodes are anonymous, so we have no …
read moreThere are comments.
De Bruijn graph alignment should also be useful for exploring concepts in transcriptomics/mRNAseq expression. As with variant calling graphalign can also be used to avoid the mapping step in quantification; and, again, as with the variant calling approach, we can do so by aligning our reference sequences to the …
read moreThere are comments.
There's an interesting and intuitive connection between error correction and variant calling - if you can do one well, it lets you do (parts of) the other well. In the previous blog post on some new features in khmer, we introduced our new "graphalign" functionality, that lets us align short sequences …
read moreThere are comments.
One of the newer features in khmer that we're pretty excited about is the read-to-graph aligner, which gives us a way to align sequences to a De Bruijn graph; our nickname for it is "graphalign."
Briefly, graphalign uses a pair-HMM to align a sequence to a k-mer graph (aka De …
read moreThere are comments.
About a month ago, I took some time to try out Docker, a container technology that lets you bundle together, distribute, and execute applications in a lightweight Linux container. It seemed neat but I didn't apply it to any real problems. (Heng Li also tried it out, and came to …
read moreThere are comments.
After a fair amount of time thinking about software's place in science (see blog posts 1, 2, 3, and 4), and thinking about khmer's short- and long-term future, we're making some changes to our development process.
Semantic versioning: The first change, and most visible one, is that we are going …
read moreThere are comments.
I finally got a chance to more thoroughly read Mark Stalzer and Chris Mentzel's arxiv preprint, "A Preliminary Review of Influential Works in Data-Driven Discovery". This is a short review paper that discusses concepts highlighted by the 1,000+ "influential works" lists submitted to the Moore Foundation's Data Driven Discovery …
read moreThere are comments.
Note - this was an internal funding request solicited by the Center for Open Science. It's been funded!
Brief: We propose to integrate OSF into Galaxy as a data store. For this purpose, we request 3 months of funding (6 months, half-time) for one developer, plus travel.
Introduction and summary: Galaxy …
read moreThere are comments.
So I wrote this thing that got an awful lot of comments, many telling me that I'm just plain wrong. I think it's impossible to respond comprehensively :). But here are some responses.
In that blog post, I argued that software shouldn't …
There are comments.
Update - I've written Yet Another blog post, More on scientific software on this topic. I think this blog post is a mess so you should read that one first ;).
This blog post was spurred by a simple question from Pauline Barmby on Twitter. My response didn't, ahem, quite fit in …
read moreThere are comments.
I'm reading Galileo's Middle Finger by Dr. Alice Dreger (@alicedreger), and it's fantastic. It's a paean to evidence-based popular discourse on scientific issues -- something I am passionate about -- and it's very well written.
I bought the book because I ran across Dr. Dreger's excellent and hilarious live-tweeting of her son's …
read moreThere are comments.
tl;dr? A while back I wrote that there are three uses of research software: replication, reproduction, and reuse. The world of computational science would be better off if people clearly delineated whether or not they wanted anyone else to reuse their software, and I think it's a massive mistake …
read moreThere are comments.
(The below issues are very much on my mind as I think about how to apply for another NIH grant to fund continued development on the khmer project.)
Imagine that we have a graph of novel functionality versus software engineering effort for a particular project, cast in the shape of …
read moreThere are comments.
I'm at the PyCon 2015 sprints (day 2), and I took the opportunity to play around with Docker a bit.
First, I created a local docker container that contained an installed version of khmer. I ran a blank docker container:
docker run -it ubuntu
and then installed the khmer prereqs …
read moreThere are comments.
I'm at the PyCon 2015 sprints (day 2), and I took the opportunity to play around with named pipes.
I was reminded of named pipes by Vince Buffalo in this great blog post, and since we at the khmer project are very interested in streaming, and named pipes fit well …
read moreThere are comments.
Here are talk notes and links for my PyCon 2015 talk.
The talk slides are up on SlideShare.
You should definitely check out Mike Lin's great blog posts on "Blogging my genome".
I found SNPedia through this wonderful blog post on how to use 23andMe irresponsibly, on Slate …
There are comments.
Note: Turns out Nick Loman is a C programmer. Well, that's what happens when I make assumptions, folks ;).
Jared Simpson just posted a great blog entry on nanopolish, an HMM-based consensus caller for Oxford Nanopore data. In it he describes how he moved from a Python prototype to a standalone …
read moreThere are comments.
This is a stub blog post for the talk notes for my OpenCon talk on how to get tenure as an open scientist.
A few links --
More …
read moreThere are comments.
A few weeks back, a journalist contacted me about my old blog post comparing physics and biology, and amidst other conversation, I pointed them at my latest blog post on data and said that I thought a lot of (molecular) biologists were "culturally confused about data". The next question was …
read moreThere are comments.
We just posted a new preprint (well, ok, a few weeks back)! The preprint title is "Crossing the streams: a framework for streaming analysis of short DNA sequencing reads", by Qingpeng Zhang, Sherine Awad, and myself. Note that like our other recent papers, this paper is 100% reproducible, with all …
read moreThere are comments.
The other day I was contacted by someone whose student wants to attend the MSU NGS course in 2015, because they are interested in learning how to data integration with (among other things) metagenome data. My response was "we don't cover that in the course", which isn't very helpful ;).
So …
read moreThere are comments.
On March 19th and 20th, the Center for Open Science hosted a small meeting in Charlottesville, VA, convened by COS and co-organized by Kaitlin Thaney (Mozilla Science Lab) and Titus Brown (UC Davis). People working across the open science ecosystem attended, including publishers, infrastructure non-profits, public policy experts, community builders …
read moreThere are comments.
On a recent west coast speaking junket where I spoke at OSU, OHSU, and VanBUG (Brown PNW '15!), I put together a new talk that tried to connect our past work on scaling metagenome assembly with our future work on driving data sharing and data integration. As you can maybe …
read moreThere are comments.
I'm returning from a small, excellent meeting on "Open Source, Open Science", held at the Center for Open Science in Charlottesville, VA. We'll post a brief meeting report soon, but I wanted to share my particular highlights --
First, I got a chance to really dig into what the Center for …
read moreThere are comments.
Two weeks ago, I ran a workshop at UC Davis on mRNAseq analysis for semi-model organisms, which focused on building new gene models ab initio -- with a reference genome. This was a milestone for me - the first time I taught a workshop at UC Davis as a professor there! My …
read moreThere are comments.
I've been putting together a streaming API for khmer that would let us use generators to do sequence analysis, and I'd be interested in thoughts on how to do it in a good Pythonic way.
Some background: a while back, Alex Jironkin asked us for high level APIs, which turned …
read moreThere are comments.
A colleague who is starting their own computational lab just asked me for some advice on how to run software projects, and I wrote up the following. Comments welcome!
A brief summary of what we've converged on for our own needs is this:
everything's on github (you can have private …
There are comments.
Michael R. Crusoe and I are throwing a sprint!
Somewhat in the vein of last year's mini-Hackathon, Michael and I and other members of the lab are going to focus in on reviewing contributions and closing issues on the khmer project for a 5 day period.
read moreThere are comments.
We are pleased to announce that the Laboratory for Data Intensive Biology at UC Davis has joined the Software Carpentry Foundation as an Affiliate Member for three years, starting in January 2015.
"We've been long-term supporters of Software Carpentry, and Affiliate status lets us support the Software Carpentry Foundation in …
read moreThere are comments.
It may not surprise peope to learn that I was one of the reviewers on the MEGAHIT metagenome assembly paper... which is now published!.
Below is my review, edited to remove all of the stuff they addressed in their revision.
Please also see our first blog post on MEGAHIT and …
read moreThere are comments.
Today at 3pm EST, the Moore Data Driven Discovery Investigators will be answering questions on reddit, in the science "ask me anything (AMA)" series. This is an opportunity to ask us anything you want about our research, data-driven discovery more generally, or ...well, you tell us!
read moreThere are comments.
I participated in my second Balti and Bioinformatics on Wednesday - unlike the first one, which ended with only slightly sketchy Indian food in Birmingham, this one was entirely online. The technology worked really well and I think this is a great way to do talks!
For those that haven't seen …
read moreThere are comments.
A while back, someone else's graduate student asked me (slightly edited to protect the innocent :) --
I already have two independent sets of de novo transcriptome assemblies and annotations of the NGS data [...] 1) from the company who did the sequencing and analysis, and 2) from our pipeline here. It would …read more
There are comments.
On December 10th, 2014, I was formally awarded tenure at UC Davis, where I will start as an Associate Professor in the School of Veterinary Medicine on January 5th, 2015. In my research statement for my job application, I wrote:
Open science and scientific reproducibility: I am a strong advocate …read more
There are comments.
I was a reviewer on Determining the quality and complexity of next-generation sequencing data without a reference genome by Anvar et al., PDF here. Here is the top bit of my review.
One interesting side note - the authors originally named their tool kMer and I complained about it in my …
read moreThere are comments.
The apocalypse is nigh. Soon, binary executables and containers in object stores will join the many Web-based pipelines and the several virtual machine images on the dystopic wasteland of "reproducible science."
Anyway.
I had a conversation a few weeks back with a senior colleague about container-based approaches (like Docker) wherein …
read moreThere are comments.
Dear <chairs>,
I am resigning my Assistant Professor position at Michigan State University effective January 2nd, 2015.
Sincerely,
CTB.
Anticipated FAQ:
There are comments.
Brian O'Shea (a physics prof at Michigan State) asked me the following, and I thought I'd post it on my blog to get a broader set of responses. I know the answer is "Python 3", but I would appreciate specific thoughts from people with experience either with the specific packages …
read moreThere are comments.
A few months ago, I wrote a short description of how we make our papers replicable in the lab. One problem with this process is that for complex pipelines, it's not always obvious how to connect a number in the paper to the steps in the pipeline that produced it …
read moreThere are comments.
As we think about the next few years of khmer development, it is helpful to explore what khmer is, and what our goals for khmer development are. This can provide guiding principles for development, refactoring, extension, funding requests, and collaborations.
Comments solicited!
Links:
There are comments.
Here's an excerpt from an e-mail to a student whose committee I'm on; they were asking me about a comment their advisor had made that they shouldn't put a result in a paper because "It'll confuse the reviewer."
One thing to keep in mind is that communicating the results _is …read more
There are comments.
Sean Eddy wrote an interesting blog post on how scripting is something every biologist should learn to do. This spurred a few discussions on Twitter and elsewhere, most of which devolved into the usual arguments about what, precisely, biologists should be taught.
I always find these discussions not merely predictable …
read moreThere are comments.
Since being chosen as a Moore Foundation Data Driven Discovery Investigator, I've been putting together the paperwork at UC Davis to actually receive the money. Part of that is putting together a budget and a Statement of Work to help guide the conversation between me, Davis, and the Moore Foundation …
read moreThere are comments.
Yesterday I gave my third keynote address ever, at the Australasian Genomics Technology Association's annual meeting in Melbourne (talk slides here). On my personal scale of talks, it was a 7 or 8 out of 10: I gave it a lot of energy, and I think the main messages got …
read moreThere are comments.
A few weeks back, Nick Loman (via Manoj Samanta) brought MEGAHIT to our attention on Twitter. MEGAHIT promised "an ultra-fast single-node solution for large and complex metagenome assembly" and they provided a preprint and some open source software. This is a topic near and dear to my heart (see Pell …
read moreThere are comments.
I am very, very happy to announce that I have been selected to be one of the fourteen Moore Data Driven Discovery Investigators.
This is a signal investment by the Moore Foundation into the burgeoning area of data-intensive science, and it is quite a career booster. It will provide my …
read moreThere are comments.
Note: the source data for this is available on github at https://github.com/ctb/dddi
Today, the Moore Foundation announced that they have selected fourteen Moore Data Driven Discovery Investigators.
In reverse alphabetical order, they are:
Dr. Ethan White, University of Florida
Proposal: Data-intensive forecasting and prediction for ecological …
read moreThere are comments.
In Extracting shotgun reads based on coverage in the data set, we showed how to get a read coverage spectrum for a shotgun data set. This is a useful diagnostic tool that can be used to estimate total genome size, average coverage, and repetitive content.
Uses for this recipe include …
read moreThere are comments.
Update 3/29/15: the CAMI FAQ now includes information on reproducibility measures, and looks very promising. The data sets they are producing also seem fascinating.
If you're into metagenomics, you may have heard of CAMI, the Critical Assessment of Metagenome Interpretation. I've spoken to several people about it in …
read moreThere are comments.
This is a recipe that provides a time- and memory- efficient way to loosely estimate the likely size of your assembled genome or metagenome from the raw reads alone. It does so by using digital normalization to assess the size of the coverage-saturated de Bruijn assembly graph given the reads …
read moreThere are comments.
This recipe provides a time-efficient way to determine whether you've saturated your sequencing depth, i.e. how much new information is likely to arrive with your next set of sequencing reads. It does so by using digital normalization to generate a "collector's curve" of information collection.
Uses for this recipe …
read moreThere are comments.
Inspired by Sarah Bisbing's excellent post on her first year as a faculty member, here are the questions I remember asking myself during my first six years:
Year 0: What science do I want to do?
Year 1: What the hell am I doing all day and why am I …
read moreThere are comments.
The below is a recipe for subsetting a high-coverage data set to a given average coverage. This differs from digital normalization because the relative abundances of reads should be maintained -- what changes is the average coverage across all the reads.
Uses for this recipe include subsampling reads from a super-high …
read moreThere are comments.
In recent days, we've gotten several requests, including two or three on the khmer mailing list, for ways to extract shotgun reads based on their coverage with respect to the reference. This is fairly easy if you have an assembled genome, but what if you want to avoid doing an …
read moreThere are comments.
I just finished reading Svante Paabo's autobiography, Neanderthal Man: In Search of Lost Genomes. The book is perfect -- if you're a biologist of any kind, you'll understand most of it without any trouble, and even physicists can probably get a lot out of the story (heh).
The book describes Svante …
read moreThere are comments.
Every month, Bjorn Ostman finds another sucker^W^W^W organizes a Carnival of Evolution blog post, that does a roundup of blogs on evolution from a previous month. This month, I'm hosting it -- it's a bit late, due to some teaching duties, so apologies!
Trigger warning: This blog post …
read moreThere are comments.
This past weekend, I accepted an offer to join UC Davis as an Associate Professor of Genetics in the Department of Population Health and Reproduction, in the School of Veterinary Medicine. The appointment is still pending tenure review, but I expect to join Davis whether or not they give me …
read moreThere are comments.
The fifth annual Analyzing Next Generation Sequencing Data workshop just finished - #ngs2014. As usual the schedule and all of the materials are openly available.
tl; dr? Good stuff.
We've been running this thing since 2010, and we now have almost 120 alumni (5 classes of roughly 24 students each). The …
read moreThere are comments.
Here are my talk notes for the Data Driven Discovery grant competition ("cage match" round). Talk slides are on slideshare You can see my full proposal here as well.
Hello, my name is Titus Brown, and I'm at Michigan State University where I run a biology group whose motto is …
read moreThere are comments.
Our lab is part of the ongoing online conversation about how to properly credit software and algorithms; as is my inclination, we're Just Trying Stuff (TM) to see what works. Here's an update on our latest efforts!
A while back (with release 1.0 of khmer) we added a CITATION …
read moreThere are comments.
Note to all: this is satire... As Marcia McNutt says below, please see Science Magazine's Contributors FAQ for more detailed information.
Recently I had some conversations with Science Magazine about preprints, and when they're counted as double publication (see: Ingelfinger Rule). Now, Science has an enlightened preprint policy:
...we do …read more
There are comments.
In September, I will be visiting the NIH to "chart the next 5 years of data science at the NIH." This meeting will use an open space approach, and we were asked to provide some suggested topics. Here are five topics that I suggested, and one that Jeramia Ory suggested …
read moreThere are comments.
Create a github repository named something like '2014-paper-xxxx'. Ask me for name suggestions.
In that github repo, do the following:
Write a Makefile or some other automated way of generating all results from data - see
https://github.com/ged-lab/2013-khmer-counting/blob/master/pipeline/Makefile
or ask Camille (@camille_codon) what …
There are comments.
These are the talk notes for my opening talk at the 2014 Bioinformatics Open Source Conference.
Normally my talk notes aren't quite so extensive, but for some reason I thought it would be a good idea to give an "interesting" talk, so my talk title was "A History of Bioinformatics …
read moreThere are comments.
I'm at the 2014 Marine Microbes Gordon Conference right now, and at the end of my talk, I brought up the point that the function of most genes is unknown. It's not a controversial point in any community that does environmental sequencing, but I feel it should be mentioned at …
read moreThere are comments.
We just released khmer v1.1, a minor version update from khmer v1.0.1 (minor version update: 220 commits, 370 files changed).
Cancel that -- _I_ just released khmer, because I'm the release manager for v1.1!
As part of an effort to find holes in our documentation, "surface" any …
read moreThere are comments.
Eli Kintisch (@elikint) just wrote a very nice article on "Sharing in Science" for Science Careers; his article contained quotes from my MSU colleague Ian Dworkin as well as from me.
When Eli sent me an e-mail with some questions about open science, I responded at some length (hey, I …
read moreThere are comments.
As part of the 2-day Mozilla Science Labs hackathon in late July, the khmer project will be providing a "mentored open source contributathon" experience. This will provide an opportunity for people interested in trying out our instance of the "github flow" model, in which contributions are submitted for review using …
read moreThere are comments.
(or, What I Did For One Day Of My Summer Vacation.)
tl;dr? I played around with building a CountMin Sketch that is dynamic in size, based on a scalable Bloom Filter approach. I'm not sure it worked. Thoughts, suggestions, help?
In our research, we've made some hay …
There are comments.
About 10 days ago, I gave a talk in Manchester to Carole Goble's group, hosted by Aleksandra Pawlik. The talk title was "Six ways to Sunday: Approaches to computational reproducibility in non-model sequence analysis." I've posted the slides (here).
For the talk, I put together a list of five things …
read moreThere are comments.
I'm on a European trip that involves several plane flights accompanied by long airport stays, and I just used some of that time to do a bit of tedious coding on khmer.
The coding I did was to add proper exception handling to khmer's internal file loading routines (see the …
read moreThere are comments.
Earlier today, I posted our response to the reviewers' comments on our k-mer counting paper, "These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.
A side note -- I was wondering how many public examples there are of the whole paper submission …
read moreThere are comments.
A few months back, we received some reviews for our paper on k-mer counting with khmer. After many months, we (mostly Qingpeng Zhang, the first author) has finished revising the paper. Here is our response to reviewers.
The latest (resubmitted) version of the paper is here, while the version the …
read moreThere are comments.
I've just made my full application to the Moore Foundation's Data Driven Discovery Investigator program available, but wanted to post an HTML version, too. You can also see a short sci-fi story about what I want to enable.
You might wonder why I'm posting this. Well, there's a snowflakes chance …
read moreThere are comments.
My second-round Data Driven Discovery application is due on Monday, and my first draft contained the following story. I don't think I'll include it in the actual application, but it was entertaining enough to write that I thought I'd post it here.
A vision of the future I would like …
read moreThere are comments.
So my daughter just participated in her first science fair, at the age of 6. ("Conclusion: science can be fun! and sticky!")
Over dinner, my wife and I came up with some ideas for her next fair. She was having trouble dissolving sugar in ice water, so we suggested maybe …
read moreThere are comments.
Links, software, thoughts -- all solicited! Add 'em below or send 'em to me, t@idyll.org.
---
Imagine... a rolling 48 hour hackathon, internationally teleconferenced, on reproducing analyses in preprints and papers. Each room of contributors could hack on things collaboratively while awake, then pass it on to others in overlapping …
read moreThere are comments.
I'm pleased to announce the publication of "Tackling soil diversity with the assembly of large, complex metagenomes", by Adina Howe, Janet Jansson, Stephanie Malfatti, Susannah Tringe, James Tiedje, and myself. The paper is openly available on the PNAS Web site here (open access).
External links:
read moreThere are comments.
Note: updated 2/18 with Benton Gravely's name -- he did the squid genome sequencing!
A few months back, I announced the khmer protocols project, an effort to write down an explicit, open protocol for transcriptome and metagenome assembly. This project was started during the summer of 2013 at the Woods …
read moreThere are comments.
A few months back, we submitted a paper, These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure, to PLoS One. We got the (signed) reviews back in December, and I asked the reviewers if I could post their reviews publicly. They …
read moreThere are comments.
On January 27th, 2014, the MSU BEACON Center graduate students held a panel on how to review. The panel was organized by Emily Weigel (@choosy_female) and Jory Schossau, Bjorn Ostman (@CarnyEvolution), Kristin Parent, Rich Lenski (@RELenski), and Arend Hintze were panel members, together with me.
I put together the …
read moreThere are comments.
I've just posted the narrative for a recently funded USDA grant on improving the quality of the chick genome assembly on the lab's research page. The issues are laid out in detail in the grant, but, basically, the question is: how can we improve the quality of the assembly? The …
read moreThere are comments.
In 1947 a Bedouin shepherd found a bunch of ancient scrolls in a cave near the Dead Sea. These scrolls, now known as the Dead Sea scrolls, included some of the oldest known Biblical texts as well as other Jewish religious writing. Over the next few decades, these scrolls - of …
read moreThere are comments.
Note: this post is a guest post by Jerome Kelleher. Please also see his letter to Bioinformatics on this topic.
There are comments.
As part of a February visit to the Whitney Marine Lab in Florida, I'm giving a talk for the public. I chose "The Genomic Revolution: How Sequencing Anything and Everything Is Changing the Way We Do Science" as the title. Basically, I want to talk about what the DNA sequencing …
read moreThere are comments.
Over the last year, digital normalization has occupied an increasingly privileged position in sequence analysis: it's a lightweight way to achieve an assembly, one that is computationally cheaper than almost anything else you can do; our software works reasonably well in practice; sequencing data generation capacity is only increasing; and …
read moreThere are comments.
(with Camille Scott, Michael Crusoe, and Leigh Sheneman; Josh Rosenthal contributed to eel-pond; and Adina Howe contributed to kalamazoo)
This summer, I spent a lot of time writing up computational protocols for both mRNAseq and metagenome assembly in the Amazon cloud.
I'm happy to announce that they are now available …
read moreThere are comments.
I've been using EBSeq for a few things lately, and have had trouble getting some of the dependencies installed -- in particular, gplots doesn't seem to be readily available for R 2.14, 2.15, etc. Judging by my Google searches, others have been having the same problems; see e.g …
read moreThere are comments.
Dear <student>,
I'd be happy to, but I do have a few conditions/requests based on prior experience with students!
First, please schedule all of your meetings at least 2 months in advance :)
Second, a condition for my signing off on your thesis will be that, for any paper for …
read moreThere are comments.
It's not often that someone perfectly and thoroughly summarizes the challenges inherent in data science being confronted by academic institutions, but that's just what Fernando Perez did in this blog post. Just... just go read it, trust me :)
The new data driven discovery centers being funded by Sloan & Moore are …
read moreThere are comments.
I'm on my way back from a great week in England. I spent most of the week in Norwich at The Genome Analysis Center (t-gaaaaaaaack), hosted by Vicky Schneider-Gricar. I gave a talk, taught two workshops together with Aleksandra Pawlik -- one for biologists and one for bioinformaticians -- and met quite …
read moreThere are comments.
I just read Scientific Data - ultimate salami slicing publishing, in which Pedro Beltrao argues that Nature's new journal is simply another venue for them to suck money out of scientists. Maybe. But I'm strongly considering sending a lot of stuff there, and I really think Pedro is missing something very …
read moreThere are comments.
This was some great advice from Thomas Wolff, one of our Associate Deans in the College of Engineering. He sent it out to the undergraduate students at MSU. I am reposting it here with his permission. --titus
Dear Spartan Engineering Students,
I am writing you today on a matter important …
read moreThere are comments.
A recent visiting speaker, Dr. Sinead Collins from Edinburgh, mentioned in passing during her talk that she was particularly interested in mentoring and empowering women in science. I am also interested in this, but as a male in a position of power I'm wary of preaching to women on the …
read moreThere are comments.
I recently had the pleasure of meeting with Randy LeVeque, Bill Howe, and Steven Roberts at UW, along with Jory Schossau, after the UW bootcamp that Jory and I ran. I already knew Bill from before (see our conversation on VMs and reproducibility) and Steven had taken the workshop, but …
read moreThere are comments.
I just finished my third workshop in two weeks. I taught 3.5 days of microbial bioinformatics at Caltech, 2 days of intro computing for biologists at MSU, and another 2-day intro computing for biologists workshop at UW. The Caltech workshop was sponsored by CEMI, the Caltech Environmental Microbial Initiative …
read moreThere are comments.
Erica Check Hayden at Nature News wrote this article about a Mozilla Science Lab effort to bring code review to scientific code. Code review is an important part of many open source, startup, and corporate software development cultures, and the goal of the Mozilla effort is to See What Happens …
read moreThere are comments.
Question: What do Nick Loman, Jared Simpson, Lex Nederbragt, and I all have in common?
Answer: We all spend way too much time thinking about assembly.
Question: What does Jonathan Eisen's lab do?
Answer: Sequence lots of really weird things that they'd like to assemble.
Motivated by …
There are comments.
We've just posted a new paper to arXiv: "These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure." We'll be submitting it to PLoS One after we wait a few days for comments from the Twittersphere and/or on Haldane's Sieve.
The …
read moreThere are comments.
The paper Howison et al., 2013, just appeared in early form in Bioinformatics. Here is my first round review, which they handily addressed in their revisions; since I was quite positive I felt I might as well post the whole thing, though.
Note that a relevant paper from Mihai Pop …
read moreThere are comments.
So, we've been running this course on NGS data analysis. And it's been fun and all. But a lot of work.
About a year ago, I thought hard about whether or not I wanted to apply for renewal, and ended up applying again. You can see the final grant if …
read moreThere are comments.
Two papers on the Haemonchus contortus genome just came out (Schwarz et al. and Laing et al.), and I'm an author on one of them (Schwarz et al.). H. contortus, or Haemonch (as I affectionately called it) is a nasty parasitic nematode that feasts on the mucosal blood of ruminants …
read moreThere are comments.
In my post on proselytizing version control, an underlying and implicit assumption was that version control was not fulfilling the function of a lab notebook. But I didn't make that explicit. And then someone asked in the comments. So now I'm making it explicit.
tl; dr? Version control's a really …
read moreThere are comments.
I gave a talk yesterday at the 2013 BEACON Congress titled "How to build an enduring online research presence using social networking and open science." The talk slides are here.
This talk was a combined survey of sites and personal perspective on how social media has helped shape the last …
read moreThere are comments.
Since I attended the entire STAMPS course at MBL this year, which was an entirely computational course, I had the opportunity to proselytize computational reproducibility and good practice to a number of people.
Now, with students I'm usually fairly gentle about this kind of thing, and try to get my …
read moreThere are comments.
Late last year, inspired by a review I did of a Science submission, I wrote a blog post asking what people thought of the Insight Journal. This was in response to the submission's mention of Image Processing On Line.
The Science paper is finally out -- actually, I missed it, it …
read moreThere are comments.
As the title says, I've got a new job.
But it's not really that exciting a switch, sorry :)
As of mid-August sometime, I will officially switch my appointment from 2/3 Computer Science and Engineering / 1/3 Microbiology and Molecular Genetics, to 2/3 Microbiology and Molecular Genetics, 1/3 …
read moreThere are comments.
(With very little apology whatsoever to Geoffrey North.)
The airplane age, in particular the advent of large, well-attended conferences, has created a brave new world of broadcasting instant criticism of scientific papers, for good or ill.
I think there is a clear "good" side, illustrated by cases where papers making …
read moreThere are comments.
So, I got this grant. And, um, it looks like khmer has a future, which means... so does my lab.
What is khmer?
khmer is my lab's software for doing various things to sequencing data, and is largely focused on providing good demo implementations of low-memory data structures …
There are comments.
I've been asked -- in several different contexts now -- whether not the openness of my lab has had any specific impact. I blog about active research; we develop our code in the open; we post papers to arXiv; we emphasize remixability; I'm pushing open data in consortia; and we are trying …
read moreThere are comments.
At the "What to Teach Biologists about Computing" meeting (discussed here, a bit) we received a strong message from Our Dear MozSciLabLeader, Kaitlin Thaney. The message was this: if we want to maximize reuse and remixing of educational materials, we should explicitly license them under CC0. (See her talk and …
read moreThere are comments.
We all know that biology (along with other sciences) is becoming ever more data intensive. Biologists (among other scientists) are not terribly well prepared for this, because of a lack of computational culture, lack of computational training, and a lack of tools. What do they need to know?
This question …
read moreThere are comments.
(This blog post was mightily helped by Qingpeng Zhang, the first author of the paper; he wrote the pipeline. I just ran it a bunch :)
We have been benchmarking k-mer counters in a variety of ways, in preparation for an upcoming paper. As with the diginorm paper we are automating …
read moreThere are comments.
Leslie Babonis, an attendee at the 2013 NGS course, posted the following on facebook. I'm reposting with permission. --titus
an ode to my lab bench...:
i've returned, my dear friend, after a fortnight away delighted to find you, in just the same way your tube racks still brilliant in hues …read more
There are comments.
I just finished reading The Immortal Life of Henrietta Lacks, an excellent book about the HeLa cell line cultured from cancerous cells taken from Henrietta Lacks. In addition to raising some really interesting and astonishing questions about the appropriate (mis)use of patients' tissue samples, a section about George Gey …
read moreThere are comments.
My advice to graduate students: blog! post garnered some interesting comments here and there. Most of the responses were positive, but then again, most anyone who reads blogs probably doesn't need to be convinced that blogging is useful. In particular, note that some of the comments at the bottom of …
read moreThere are comments.
(with Adina Howe, James Tiedje, Titus Brown)
I have been working on the assembly of big shotgun metagenomic data from ARMO (Amazon Rain Forest Microbial Observatory) project. The biggest challenge is the huge data size, 2TB in fastq and more than 6 billions reads after read trimming. One lucky thing …
read moreThere are comments.
Sometimes you've really got to wonder.
The Chronical of Higher Ed just posted this article on a collaboration between A. Sean Pue, Tracy K. Teal, and myself. It's about bringing bioinformatics (or, really, CS and computational linguistics) to the study of Urdu poetic meter.
The article has two interesting flaws …
read moreThere are comments.
So, there this guy, Matt Welsh. And he left Harvard to go to Google. OK.
Now he's baaaack, to point out that academia isn't that rosy.
Yep. He's not wrong.
Were I in a sarcastic mood, I would say something like "ohmigod, Matt Welsh is pointing out that …
There are comments.
I was a reviewer of the PLoS One paper, An Integrated Pipeline for de Novo Assembly of Microbial Genomes, and just recently came across the review again. I didn't post it at the time, but heck, why not now? ;)
Note that for our recent microbial genomes assembly workshop we wrote …
read moreThere are comments.
Software installation is a real problem.
I'm writing this as I return from my fourth Software Carpentry workshop, or -- if you count the one I ran at LLNL almost a decade ago -- my fifth one. This workshop was taught with Karen Cranston and Rich Enbody, both of them very experienced …
read moreThere are comments.
Is khmer evolving?
The khmer project is our software package to work with short reads, and it enables a lot of things like k-mer counting and de Bruijn graph exploration and modification. As data volume grows, interest in partitioning and digital normalization is also growing. But we haven't really talked …
read moreThere are comments.
On the angenmap mailing list, I wrote:
Overall, I see little justification for believing that our current system [ of peer review ] is particularly good. It's just comfortable, especially for the people who have been molded by it.
Chris Moran responded:
read moreI frequently hear these assertions about weak correlations, false negative …
There are comments.
Continuing in the saga of "what do sequencing errors do to our de Bruijn graph density measure" (read the first post here), I have some new results.
The conclusion of the first post was that on random (non-real) genomes, both with and without repeats, we see that de Bruijn graph …
read moreThere are comments.
Are our reviewers correct or incorrect?
About two months ago we got back reviews for our assembly artifacts paper, in which we showed that there was a strong 3' bias in the reads towards higher graph connectivity. Since shotgun sequencing is supposed to be random, we asserted that this 3' …
read moreThere are comments.
A week or two ago, I posted a crazy idea about crowdsourcing a bioinformatics analysis pipeline. I may still try to do that. But in the meantime, here's another crazy idea.
First, some background.
I'm writing this as I fly back from …
There are comments.
I'm worried about our current mRNAseq analysis strategies.
I recently posted a draft paper of ours to arXiv entitled RNA-Seq Mapping Errors When Using Incomplete Reference Transcriptomes of Vertebrates; the link is to the Haldane's Sieve discussion of the paper. Graham Coop said I should write something quick, and so …
read moreThere are comments.
Over the last few weeks, I've been on a bit of a Cory Doctorow kick. I started by reading Homeland, a sequel to the excellent Little Brother; these two very important books about anti-terror-enabled government suppression of liberty and free speech are very well written and extremely timely. I then …
read moreThere are comments.
Or, "can we crowdsource BGI?" ;)
With all of the crazy need surrounding genomic analysis -- most of it on a shoestring budget -- I am thinking about a mildly crazy idea.
What if I offered to computationally analyze people's non-model transcriptomic and metagenomic data for them, in exchange for (a) non-exclusive access …
read moreThere are comments.
I just finished attending a 1-day workshop on Cyberinfrastructure for Marine 'Omics down in DC. It was a meeting organized by the Gordon and Betty Moore Foundation but attended by program managers from about a dozen different agencies and divisions (NSF BIO, NSF GEO, etc.); a bunch of pretty serious …
read moreThere are comments.
I'm on my way down to D.C. to attend another meeting about cyberinfrastructure, this time with a bent towards metagenomics pipelines. (At least, I'm pretty sure that's why I'm invited. It's getting hard to tell these days.)
Inspired by James Watters' blog post on his "fork you" shirt, I …
read moreThere are comments.
A short note -- the lamprey genome (P. marinus) paper is finally out! You can see the paper and the Michigan State University press release. (The press release isn't too bad, but I would like to point out that I had no part in the sentence talking about how this could …
read moreThere are comments.
Note: this is the general part of the submitted review; I left out the things that I expect might change if revisions are made.
Also see Thoughts on the Assemblathon 2 paper.
Re the Assemblathon 2 paper <http://arxiv.org/abs/1301.5406>,
Bradnam et …
There are comments.
(Also see Assemblathon 2 review, round 1, parts thereof)
I just finished reviewing the Assemblathon 2 paper, in which many of the extant de novo genome assembly pipelines were evaluated against three different organismal data sets. (I'll post the review when I can.) Good paper.
To me, the biggest outcome …
read moreThere are comments.
I received this letter in the mail the other day. Can anyone help?
---
Dear Dr. Abby,
I am at a top-50 R1 research institution, and we are currently conducting faculty hiring searches for a number of professors in biology. The applicant pool has been stunningly good this year, and we …
read moreThere are comments.
I was a reviewer on Boisvert et al., Ray Meta: scalable de novo metagenome assembly and profiling, and (as with DSK: k-mer counting with very low memory usage) I thought I'd share my review.
(Sorry, it's really short. My first round review had some comments that they handily addressed in …
read moreThere are comments.
Why do I blog?
I've been blogging now for almost 8 years, since around when Grig Gheorghiu started the Southern California Python Interest Group. Since then I've gotten a PhD, taken a postdoc, had one child, started a faculty position, had another child, and basically gotten way, way busier. Why …
read moreThere are comments.
One of my graduate students and I were reviewers on Rizk et al., DSK: k-mer counting with very low memory usage, and I thought I'd share our review. At the moment I cannot easily see the entire paper so I have not modified the review to account for post-review changes …
read moreThere are comments.
I just spent a really fun and exciting two hours installing a piece of software that I needed to run to do a paper review. The software itself downloaded, but failed routinely on their own test data; after delving through four layers of Perl and Python, I discovered that the …
read moreThere are comments.
The other day, I purchased a new car from the car company down the street. This was a small boutique shop, and their marketing brochure was slick -- 0-60 in 6 seconds, heated seats, a good safety rating -- and the technical reviews were amazing -- "Never seen anything like it! Really novel …
read moreThere are comments.
I just left the NAS meeting on Integrating Environmental Health Data to Advance Discovery, where I was an invited speaker. It was a pretty interesting meeting, with presentations from speakers who worked on chemotoxicity data, pollution data, exposure data, and electronic health records, as well as a few "outsiders" from …
read moreThere are comments.
I just left the NAS meeting on Integrating Environmental Health Data to Advance Discovery, where I was an invited speaker. It was a pretty interesting meeting, with presentations from speakers who worked on chemotoxicity data, pollution data, exposure data, and electronic health records, as well as a few "outsiders" from …
read moreThere are comments.
For each of the last two summers, I've returned from co-teaching our Analyzing Next-Generation Sequencing Data course, slept for 48 hours straight, and then hunkered down and bunkered up to write grants. (To be clear, sometimes this bunkering up involves travelling out to California and sitting on my in-laws' beach …
read moreThere are comments.
We just posted yet another pre-submission paper to arXiv.org:
Assembling large, complex environmental metagenomes
Authors: Adina Chuang Howe, Janet Jansson, Stephanie A. Malfatti, Susannah Tringe, James M. Tiedje, and C. Titus Brown
Abstract:
The large volumes of sequencing data required to deeply sample …read more
There are comments.
We just posted another pre-submission paper to arXiv.org:
Illumina Sequencing Artifacts Revealed by Connectivity Analysis of Metagenomic Datasets
Authors: Adina Chuang Howe, Jason Pell, Rosangela Canino-Koning, Rachel Mackelprang, Susannah Tringe, Janet Jansson, James M. Tiedje, and C. Titus Brown
Abstract:
Sequencing errors and …read more
There are comments.
This post can be referenced and cited at the following DOI: http://dx.doi.org/10.6084/m9.figshare.98198.
For a few months, the Trinity list was awash with discussions about how to use digital normalization to lower the memory and compute requirements for mRNASeq assembly. At some point …
read moreThere are comments.
I recently had the pleasure of reviewing an excellent paper that used actual data (DATA!) to argue that source code needs to be part of the review process. (When it is published I will post again about it; for now, the process of secret handshakes in smoke-filled back rooms must …
read moreThere are comments.
I am just returning from a trip to Southern California that included, among other things, the teaching of a two day Software Carpentry workshop at The Scripps Research Institute. There were two instructors, myself and Tracy Teal, a research scientist at MSU; and two external TAs, Qingpeng Zhang (one of …
read moreThere are comments.
Just over a week ago, I posted a list of wanted tech that I thought would help further open science. One item that struck a chord with a number of people on Twitter and in the comments was the idea of giving blog entries a DOI:
read more
- An easy way to …
There are comments.
I gave a talk last Wednesday at U. Michigan in the DCMB program where I included a slide estimating how much DNA sequencing (in base pairs) was needed for good de novo assembly of sequences from various biological environments or problems. The slide was there to motivate the challenges of …
read moreThere are comments.
This is one of a bunch of posts on science and the Web. Start here for an overview.
It's been fun to watch (and occasionally help drive) science moving online and taking advantage of the Web. Here are some of my favorite examples.
Simple, easy ways of sharing process abound …
There are comments.
This is one of a bunch of posts on science and the Web. Start here for an overview.
The web represents an opportunity for a phase transition in terms of connectedness and openness in scientific practice, as in software development, and we're not taking much advantage of it. Why?
There are comments.
This is one of a bunch of posts on science and the Web. Start here for an overview.
I've been reading Michael Nielsen's book Reinventing Discovery, which is an awesome and inspirational book about (among other things) accelerating scientific discovery using the Internet. Highly recommended.
From my position within academia …
read moreThere are comments.
This is one of a bunch of posts on what I'm calling 'w4s' -- using the Web, and principles of the Web, to improve science. The others are:
The awesomeness we're experiencing, which provides some examples of current awesomeness in this area.
The challenges ahead, which covers some of the reasons …
read moreThere are comments.
This is one of a bunch of posts on science and the Web. Start here for an overview.
I don't think I can devote myself to any big projects, but I do have a bunch of ideas for relatively small projects that I think could lead to worthwhile change.
Here …
read moreThere are comments.
In his paper, Reproducible Research and Cloud Computing, Bill Howe asks:
What happens if you do all your work on a virtual machine hosted in the cloud? When it came time to publish, you might make a snapshot of the VM, make it public, and cite it in your paper …read more
There are comments.
An increasing number of people are asking about using our assembly approaches for things that we haven't yet written (or posted) papers about. Moreover, our assembly strategies themselves are also under constant evolution as we do more research and find ever-wider applicability of our approaches.
This has been moved to …
read moreThere are comments.
Note: this post is a guest post by Rohan Maddamsetti, posted by the regular blog author, Titus Brown. Typos are Titus's fault. Flaws in logic are Rohan's ;). See the paper on arXiv [1] and the discussion on Haldane's Sieve, also.
I recently wrote a short paper explaining some interesting results …
read moreThere are comments.
At BOSC 2012, we heard a report from Richard Holland on the Pistoia Alliance Sequence Squeeze competition. I'd run across this a couple of times before -- most notably in the Quip paper -- and was interested in hearing the results.
What was the problem being tackled? To quote,
read moreThe volume of …
There are comments.
After yet another round of futile Twittering on the subject of research software, I thought I'd share a deeply personal story -- a story that explains some of my rather adamant stance that most research scientists need to think more critically about their code, and should adopt at least some of …
read moreThere are comments.
One of my favorite in-class exercises is The Assembly Exercise, in which I provided "shotgun sequence" from some English text and ask the students to assemble it. Normally I provide a printout of about 10-20 pages of reads with range of read lengths, error rates, and single/paired end sequences …
read moreThere are comments.
The IPython Notebook (or 'ipynb' for short) is one of the most exciting technologies for teaching and research that I've seen in recent years. It is a completely open source, well architected, and fairly stable system for scientific computing and data exploration.
I've now been using it for teaching for …
read moreThere are comments.
These talk notes are for my talk at the 2012 Argonne Soil Metagenomics Workshop.
The slides are available for viewing and download here, on Slideshare.
I'm going to be talking about our assembly pipeline for soil metagenomes.
Much of this work was done by …
There are comments.
I just finished reading The Idea Factory: Bell Labs and the Great Age of American Innovation, by Jon Gertner, an absolutely fabulous book on Bell Labs, and their invention of the transistor, the laser, and almost everything to do with modern telecommunications and computers ;).
The final chapter is about the …
read moreThere are comments.
We held the 2012 workshop on Analyzing Next Generation Sequencing Data from June 4 to June 15, at the Kellogg Biological Station in western Michigan, about 30 minutes north of Kalamazoo.
(This is a long delayed blog post. :)
The goal of the workshop is to take biologists with little in …
read moreThere are comments.
Try out this thought experiment.
Suppose you are a bio professor, and a grad student came to you and said, "I'm trying to figure out what classes to take, and there're all these math, modeling, and computational courses that I could take. But I just don't think that math or …
read moreThere are comments.
I'm giving a talk at XLDB 2012 tomorrow, and I thought I'd post a bunch of accompanying links and discussion, since this audience is pretty far away from my normal audience ;).
Here's the talk itself, on slideshare: Streaming and Compression Approaches for Terascale Biological Sequence Data Analysis
Acknowledgements (slide 3 …
read moreThere are comments.
I'm starting to notice that a lot of bioinformatics is anecdotal.
People publish software that "works for them." But it's not clear what "works" means -- all to often either the exact parameters or the specific evaluation procedure is not provided (and yes, there's a double standard here where experimental methods …
read moreThere are comments.
I've been invited to present at the Extremely Large Databases (XLDB) 2012 conference as a practicing biologist who occasionally speaks with physicists, and I'm trying to come up with something to say that will explain why physicists and biologists don't often collaborate all that well.
Here are some guesses.
---
There are comments.
[ Note: I wrote the following e-mail to the Microbiology (MMG) department faculty mailing list here at MSU. I'll post any interesting responses that I get. --titus ]
---
Hi all,
I'm an unabashed proponent of Open Access publishing, as well as the idea of decoupling correctness from estimated impact of a paper …
read moreThere are comments.
One of the biggest problems with basic sequence analysis -- some would say the biggest problem -- is the error rate. If our sequencing reads were error-free, both assembly and mapping would be much, much easier. Alas, Illumina reads have a 0.1-1% error rate per base, and PacBio has an error …
read moreThere are comments.
I'm at the MBL STAMPS course, "Strategies and Techniques for Analyzing Microbial Population Structure," and one of the things I needed to address in my morning talk was the role that the k parameter plays in de Bruijn graph assemblers.
In most de Bruijn assemblers that I have used -- Velvet …
read moreThere are comments.
Suppose you have a community that has two organisms in it, at widely varying abundances. What can you do?
Partitioning takes this mixed distribution and, based on graph connectivity, splits the reads into two bins. Thus, you go from this:
to this:
where the reads are …
There are comments.
This is the story behind our PNAS paper, "Scaling Metagenome Assembly with Probabilistic de Bruijn Graphs" (released from embargo this past Monday).
Why did we write it? How did it get started? Well, rewind the tape 2 years and more...
There we were in May 2010, sitting on 500 million …
read moreThere are comments.
I've just posted my 2nd try at the NSF CAREER award to the lab Web site, where it joins my recent NSF BIGDATA proposal, my Moore Foundation proposal, last year's (rejected) NSF CAREER proposal, my NGS course grant, and my one big funded grant, my USDA proposal from 2009. The …
read moreThere are comments.
Data and materials availability All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. All computer codes involved in the creation or analysis of data must also be available to any reader of Science. After publication, all reasonable …read more
There are comments.
I recently attended an NSF BIO directorate meeting about cyberinfrastructure needs. Here's a list of training & education challenges identified at that meeting:
read more
- development and adaptation of tools to archive data and metadata from diverse sources to enable data mining
- integration of structured and unstructured data from heterogenous data sources
- discussion …
There are comments.
Here's a data analysis question for all you Big Data folk.
A beachcomber is interested in obtaining up to 10 examples of every type of shell present on a beach. The shells are individually easy to find, but some types are really rare and some are really abundant. The beachcomber …
read moreThere are comments.
(Or, "A better way to publish bioinformatics.")
We just got word that our paper, "Scaling metagenome assembly with probabilistic de Bruijn graphs" [ arXiv ] [ github ] has been accepted for publication in PNAS. (Yay!) I just posted the final version to github, and the arXiv PDF should be updated to the third …
read moreThere are comments.
At our 2012 course on Analyzing Next-Generation Sequencing Data, we talked quite a bit about future sequencing technologies, as well as about what analyses are reasonably cookbook (and which ones aren't).
Here are my thoughts -- yours welcome!
The basic conclusions about sequencing tech were these:
There are comments.
As part of the 2012 Analyzing Next-Generation Sequencing Data course, I've been trying out ipython notebook for the tutorials.
In previous years, our tutorials all looked like this: Short read assembly with Velvet -- basically, reStructuredText files integrated with Sphinx. This had a lot of advantages, including Googleability and simplicity; but …
read moreThere are comments.
I just returned from a NESCent Catalysis meeting on Cephalopod Genomics. I was invited as a bioinformatics and genomics guy, and so I spent four days in North Carolina talking about the opportunities and challenges of sequencing cephalopods.
Cephalopods are a class of the molluscs, and include squid and octopus …
read moreThere are comments.
This is a draft proposal of a policy to encourage pre-publication data release and data sharing within a community. This policy is based on discussions at the Cephalopod Genomics Workshop (a Catalysis workshop sponsored by NESCent).
Note, this is made available under a CC-BY-SA license permitting use and re-use with …
read moreThere are comments.
Greg Wilson, Ethan White and I have been talking a bit about what Responsible Conduct of Research (RCR) standards would look like for computational science. I'm having trouble coming up with more than the below standards, which are largely related to publication.
Note, if you regard these as obvious, that's …
read moreThere are comments.
Brad Chapman (@chapmanb on twitter) wrote and signed a nice review of my submission to the Bioinformatics Open Source Conference. In his review, he said
My only small suggestion is to include some discussion about your reproducibility work during the talk: the Amazon AMI, documentation and reproducible ipython workflows. This …read more
There are comments.
I'm a pretty big advocate of anything open -- open source, open access, and open science, in particular. I always have been. And now that I'm a professor, I've been trying to figure out how to actually practice open science effectively
What is open science? Well, I think of it as …
read moreThere are comments.
I'm going to pick on Mick Watson today. (It's OK. He's just a foil for this discussion, and I hope he doesn't take it too personally.)
Mick made the following comment on my earlier Big Data Biology blog post:
read moreI do wonder whether there is just a bit too much …
There are comments.
I'm out at a Cloud Computing for the Human Microbiome Workshop and I've been trying to convince people of the importance of digital normalization. When I posted the paper the reaction was reasonably positive, but I haven't had much luck explaining why it's so awesome.
At the workshop, people were …
read moreThere are comments.
I'm pretty proud of our most recently posted paper, which is on a sequence analysis concept we call digital normalization. I think the paper is pretty kick-ass, but so is the way in which we're approaching replication. This blog post is about the latter.
(Quick note re "replication" vs "reproduction …
read moreThere are comments.
We just posted a pre-submission paper to arXiv.org:
A single pass approach to reducing sampling variation, removing errors, and scaling de novo assembly of shotgun sequences
Authors: C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, and Timothy H. Brom
Paper Web site, with source code …
read moreThere are comments.
The 2012 MSU Next-gen Sequence Analysis course application period just closed, and we received 168 applicants. Last year, we received 133, and the year before that we received 33.
We can take 24.
I was also invited to go teach a ~1 week workshop at two other universities on these …
read moreThere are comments.
I'm putting together a computational pipeline for a paper - a Makefile that runs a ton of stuff and outputs files, combined with an ipython notebook file that takes those output files and turns them into figures for inclusion in a LaTeX file. (Yes, very 2000, except for the ipython notebook …
read moreThere are comments.
If you're like me, we pretend to care about the science in bioinformatics software. But what we really do is try to find reasons not to outright loathe the software -- because, lud knows, there are usually plenty of reasons to hate it.
In no particular order, here are the top …
read moreThere are comments.
(updated to point to http://arxiv.org/).
Authors: Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, C. Titus Brown
Abstract:
The memory requirements for de novo assembly of short-read shotgun sequencing data from complex microbial populations are an increasingly large practical barrier to environmental studies. Here we …read more
There are comments.
(and some related thoughts on reproducibility in computational science)
In a recent news article on the "data deluge" in biology, I was quoted as saying "It's not at all clear what you do with that data. Doing a comprehensive analysis of it is essentially impossible at the moment." So, naturally …
read moreThere are comments.
This blog post was inspired by two recent events.
First, in response to a NY Times article about the "data deluge" affecting biologists, one of my Facebook friends said something like "stop whining about how hard it is to analyze the data and do some good experiments instead!" I vehemently …
read moreThere are comments.
I'm just on my way back from a JGI workshop on metagenome informatics, and I thought I'd take the opportunity to write up a short review.
The workshop was, frankly, excellent. We saw a bunch of talks on metagenome assembly (my current interest) as well as single-cell sequencing approaches, and …
read moreThere are comments.
As sequencing gets cheaper and cheaper, one would expect the answer for how to best sequence (and assemble!) any given genome would change. Most biologists assume something along these lines: everyone else has achieved some standard coverage (say 10x, or 100x) for their genome, so all we need to do …
read moreThere are comments.
There's been a lot of hooplah in the last year or so about the fact that our ability to generate sequence has scaled faster than Moore's Law over the last few years, and the attendant challenges of scaling analysis capacity; see Figure 1a and 1b, this reddit discussion, and also …
read moreThere are comments.
For anyone who actually wants to know what it is I do, I've updated my lab Web site, http://ged.msu.edu/, to be a bit more representative of what it is we're doing these days. (I wrote it over three years ago, so it's been becoming increasingly dated.) In …
read moreThere are comments.
I just flew back from Montreal, where I gave a talk at the International Tunicate Meeting on the Molgula project. This is a project wherein we are doing quantitative mRNA sequencing on two species of ascidians, or sea squirts -- specifically, on M. oculata (tailed), M. occulta (tailless) -- and their hybrids …
read moreThere are comments.
First, I write a recipe file, 'metagenome.recipe', laying out my job description for, say, sequence trimming and assembly with Velvet:
fasta_file soil-data.fa qc_filter min_length=50 remove_Ns=true graph_filter min_length=400 velvet_assemble k=33 min_length=1000 scaffolding=True
Then I specify …
read moreThere are comments.
If you've been under a rock (or indulging in arsenic yourself), you've heard about NASA's "arsenic" article, claiming the discovery of a microbial species that can substitute arsenate for phosphate. The paper was pre-announced via a press conference that then announced the results.
Immediate blogtastrophe! The paper was critically reviewed …
read moreThere are comments.
In thinking about open science and open communication about science, I've always been frustrated by the people who claim that the risks outweight the benefit. Their arguments seem sound if you buy into a certain kind of logic (the creationists will try to twist whatever you say! the climate change …
read moreThere are comments.
(with Billie Swalla)
I've spent the last two weeks out at the Roscoff Statione Biologique in Roscoff, France. This little port is on the northern coast of the French region of Bretagne, or Brittany. I'm here with Billie Swalla, a professor at UW Seattle, and Elijah Lowe, a Computer Science …
read moreThere are comments.
(with Adina Howe, Jason Pell, Rosangela Canino-Koning, and Arend Hintze).
A few weeks ago I blogged a bit about a k-mer filtering system, khmer, that we were using to reduce metagenomic data to a more tractable size by throwing out error-prone reads (see A memory efficient way to remote …
There are comments.
I'm a big believer in open science -- see this great polemic over at Mendeley for a good read -- but it's always interesting to think about how such things as "data release" can be perverted by clever scientists. I'm currently in France working on some ascidians with Billie Swalla -- more on …
read moreThere are comments.
(This project is a collaboration with Jason Pell and Adina Howe)
A few weeks ago I posted about a k-mer filtering approach that we were using to remove low-abundance k-mers from metagenomic data sets, prior to assembly. This technique is working well, and we've managed to do some assembly of …
read moreThere are comments.
The Terabase Metagenomics meeting was good fun, but I most valued the computational component (because that's what I do). Rachel Mackelprang and Rob Knight and I wrote down a list of the computational issues involved in a petabase metagenomics project, and that list will help direct my future research. I'll …
read moreThere are comments.
I'm on my way back from the Terabase Metagenomics meeting in Snowbird, UT, and I'm buzzing with ideas about how to move forward in metagenomics and bioinformatics research. Metagenomics, the use of genomics approaches to study microbial communities, has been opening up as sequencing drops in price. With sequencing becoming …
read moreThere are comments.
I've spent the last few weeks working on a simple solution to a challenging problem in DNA sequence assembly, and I think we've got a nice simple theoretical solution with an actual implementation. I'd be interested in comments!
Briefly, the algorithmic challenge is this:
We have a bunch of …
There are comments.
After my recent next-gen sequencing course, which was supposed to tie into the whole software carpentry (SWC) effort but didn't really succeed in doing so the first time through, I started thinking about the Right Way to tie in the SWC material. In particular, how do you both motivate scientists …
read moreThere are comments.
So, I'm running this summer course and I am trying to figure out how to organize the notes for students. I'd like to mix curriculum-specific notes ("here's what we're doing today, and here are some problems to work on") with tutorials (material independent of a single course, like "here's how …
read moreThere are comments.
In conversation with a colleague the other day, I found myself making a surprising prediction: the age of the big sequencing centers (Broad Institute, WUSTL, Baylor, DOE JGI, etc.) is coming to an end. In 5 years they will no longer exist.
This prediction is obvious in hindsight.
That is …
read moreThere are comments.
Dear NSF,
I am happy to respond to your request for a 2-page Data Management Plan.
First of all, let me say how enthusiastic I am that you have embraced this new field of "large scale data analysis". Ever since I started working with large Avida data sets in 1993 …
read moreThere are comments.
Just got news that the BEACON NSF Science and Technology Center for the study of Evolution in Action funded Chris Adami to come do a sabbatical here at Michigan State University for the next year. This puts me, Chris, and Charles Ofria at the same institution (now MSU, then Caltech …
read moreThere are comments.
I've been doing some more focused bioinformatics programming recently, and as I'm thinking about how to teach biologists about data analysis, I realize more and more how much backstory goes into even relatively simple programming.
The problem: given a reference genome, and a very large set of short, error-prone, random …
read moreThere are comments.
These days, molecular biologists are dealing with lots and lots of sequences, largely due to next-gen sequencing technologies. For example, the Illumina GA2 is producing 100-200 million DNA sequences, each of 75-125 bases, per run; that works out to 20 gb of sequence data per run, not counting metadata such …
read moreThere are comments.
Analyzing Next-Generation Sequencing Data
May 31 - June 11th, 2010
Kellogg Biological Station, Michigan State University
CSE 891 s431 / MMG 890 s433, 2 cr
Applications are due by midnight EST, April 9th, 2010.
Course sponsor: Gene Expression in Disease and Development Focus Group at Michigan State University.
Instructors: Dr. C. Titus …
read moreThere are comments.
The National Science Foundation just announced that the BEACON Science and Technology Center centered at Michigan State University was just funded. BEACON stands for "Bio/computational Evolution in Action Consortium" - you can check out the Web site here.
In my own nutshell, BEACON is focused on studying the evolution of …
read moreThere are comments.
Does anyone have any experience with CloudStore, formerly known as KosmosFS? From http://en.wikipedia.org/wiki/CloudStore:
CloudStore (KFS, previously Kosmosfs) is Kosmix's C++ implementation of Google File System. ... CloudStore supports incremental scalability, replication, checksumming for data integrity, client side fail-over and access from C++, Java and Python.
The …
read moreThere are comments.
My wife and I were talking with my USDA collaborator about some possible chicken research, and I asked about access to animals. His response? "Chickens are not a rate limiting factor."
Did you know that 1 million chickens are slaughtered per hour, on average, in the US? Wow.
--titus
read moreThere are comments.
OK, so you have a genome -- let's say it's about 1gb in size -- and you want to do ChIP-seq on a transcription factor that you think binds ~1000 places in the genome. You've measured the specificity of the transcription factor and it seems to enrich about 10-fold over background (an …
read moreThere are comments.
The last two weeks were pretty miserable, for some scientific/collaboration reasons as well as some personal reasons (visiting sick parents != fun). Two things that weren't miserable -- that were in fact quite fun -- were PyOhio and the Science 2.0 talks in Toronto.
PyOhio was a nice little community-based conference …
read moreThere are comments.
I'd like to find an MSU student to report semi-monthly on python-dev. The student would be responsible for monitoring the python-dev mailing list and active PEPs, summarizing substantive discussions in a public forum, and integrating feedback from the community. This would be a 1 credit CSE independent study course (CSE …
read moreThere are comments.
Just submitted this on Thursday:
Next generation sequencers are beginning to impact agricultural biology. Over the next few years, next generation sequencing will produce incredibly large datasets that will address structural (e.g., SNPs, CNVs, indels, methylation, translocations) and functional (e.g., RNA expression, transcription factor binding sites) variation in …read more
There are comments.
As part of a CiSE submission I'm working on, I interviewed the lead developer on a scientific software package today. This software package is mainly used for evolutionary studies, and has a small but devoted following - ~6 developers and ~12 users locally, plus a few dozen users outside of MSU …
read moreThere are comments.
John Gall apparently said:
A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with …read more
There are comments.
I recently had the pleasure of being the technical reviewer for a new Apress offering, Beginning Python Visualization, by Shai Vaingast.
To quote from the apress page,
read moreWhat you'll learn:
- Write ten lines of code and present visual information instead of data soup.
- Set up an open source environment ready …
There are comments.
The decision of python-dev to deprecate bsddb has left us in a bit of a pickle (hah!) over in the pygr project. We're looking for a replacement for bsddb for default storage of infrequently- (or never-) changed pickled Python objects. Some of the parameters under consideration are:
read more
- Python version availability …
There are comments.
The latest hot shit idea for making a protein-protein interaction database leaves me lukewarm.
A few months ago I met with a genomics group, and we had a back-and-forth about genome annotation. The conversation went something like this:
them: "We have to improve the tools for annotating un-annotated genes!" me …read more
There are comments.
We're going through the PyCon '09 review process, and participating in the process has been pretty interesting. (I joined the Program Committee in large part because I was told to put up or shut up after I critiqued PyCon '08. Ahh, the open source world... where you're encouraged to go …
read moreThere are comments.
As a new prof, I've been too busy to blog much. What am I doing?
Apart from all the normal academic crud (meeting with people, answering e-mail, doing paperwork, etc.) and parenting & home ownership stuff, I've been teaching my Intro to Database-Backed Web Programming course. This has been neither a …
read moreThere are comments.
My last post initiated a discussion on the biology-in-python mailing list about BioPython, among other things. (Here is a link to the discussion, which is kind of long and unfocused.)
I'm happy that the bip list is serving as a place for people to interact with the BioPython maintainers to …
read moreThere are comments.
Chris Lasher wrote a nice blog post naming me as a rabble rouser in the area of "Python in bioinformatics". His post raised a number of interesting points, some of which I'd like to discuss here on my blog.
First, why is Python not more dominant in bioinformatics? I really …
read moreThere are comments.
We have an opening for a project on which I'm collaborating:
Full-time 12 month appointment academic position for a genomics scientist. The incumbent will spend 50% time as the Associate Director of the Comparative Genomics Laboratory, with duties in directing daily activities, long-range planning and seeking extramural funding, and 50 …read more
There are comments.
I read things like this report on SciFoo and think, gawd! I'd have had a great time! I should try to beg/bully/buy/brown-nose my way into the next SciFoo so I can talk about Science 2.0 etc.!
And then I think back to the heady days of …
read moreThere are comments.
I finally got sick of manually schlepping BLAST files around, so I wrote something to do it for me. 'zounds' is a very simple server/client system for coordinating a bunch of 'worker' nodes through a central server; it does everything in Python with objects and pickling, so it's easy …
read moreThere are comments.
On Thursday, May 15th, I finished my post-doc position at Caltech.
On Friday, May 16th, I officially started as an Assistant Professor split between Computer Science & Engineering and Microbiology & Molecular Genetics at Michigan State University.
On Friday evening and Saturday, we hung out down at the Caltech Marine Lab and …
read moreThere are comments.
(pygr is a neat bioinformatics framework in Python.)
After some commenters on my last post seemed happy to hear that pygr was the focus of some summer work, I realized I had only discussed the pygr summer work in a post to the biology-in-python list.
Whoops.
So, here's the scoop …
read moreThere are comments.
Dear Lazyweb, help!
I'm embarking on a number of summer projects in my new lab at MSU, and several of them focus on using pygr to do cool genomic stuff. In particular, I'm planning to build a personal genome annotation system that will let people run their own full genome …
read moreThere are comments.
I spent some time over the last week adding fairly simple motif searching to Cartwheel, my bioinformatics site for biologists doing cis-regulatory analysis of genomic sequence. The new features include the ability to define and search with IUPAC and position-weight matrix (PWM) motifs, as well as visualization of motif search …
read moreThere are comments.
Via http://www.nodalpoint.org/2008/01/18/one_thousand_databases_high_and_rising, on the Nucleic Acids Res "database" issue:
As we pass the one thousand databases mark (1kDB) I wonder, what proportion of the data in these databases will never be used?
This is an unsettling thought for …
read moreThere are comments.
I just finished a chapter for a book, Methods in Avian Embryology, being edited by my boss, Marianne Bronner-Fraser. This chapter is intended for developmental biologists who are interested in locating regulatory modules and analyzing them for binding sites. It ended up being my outlet for a compilation of problems …
read moreThere are comments.
heh, this applies to many fields, I think...
Luis Ibanez This presentation is a satire of the current obsession with intellectual property, innovation and originality that plagues the field of medical image analysis. The presentation makes the point that most Journals and Conferences focus on Originality and despise Reproducibility and …read more
There are comments.
My Computer Science department at Michigan State University is looking for an assistant professor! We are casting a fairly wide net (databases, graphics, medical imaging, and bioinformatics) but I'd really like to attract a bioinformatician.
The Computer Science department at MSU is a nice, small department …
read moreThere are comments.
I recently gave an informal talk on Software Carpentry for the Caltech e-Science 101 course. Since even "Intro Software Carpentry" is a whole course of study, I obviously couldn't cover much, but I tried to motivate people to get interested. And, of course, I pushed testing. TESTING, DAMNIT!
Anyway, here …
read moreThere are comments.
So, next May I'm starting as an assistant professor split between the Computer Science and Microbiology and Molecular Genetics departments at Michigan State U., and I'm interested in attracting as many good CS grad applicants as I can from the open source and bioinformatics communities. (I would also like to …
read moreThere are comments.
Rob Campbell found me by google, and pointed me towards his blog, Science and Software. Funny, well written, and very apropos! Why isn't there more software, commercial or otherwise, for labs?
There has been a lot of local interest (i.e. two or three people have discussed it at various …
read moreThere are comments.
After our long software licensing discussion on the biology-in-python list, I realized that I wanted something different in a license for scientific software.
Specifically, I would like to attach the following clause to either a BSD or L/GPL style license:
Publications relying on derivative works of this software must …read more
There are comments.
This month the newly minted biology-in-python mailing list erupted into a discussion of licenses. There was some confusion about the goal of the discussion, for which I'm largely responsible: we didn't make it clear that we were talking about licenses for code and content posted on the bio.scipy.org …
read moreThere are comments.
In the spirit of cleaning up my desktop... here's a PDF of my talk on Cartwheel at SciPy 2007.
--titus
read moreThere are comments.
To get people talking, I've created a "biology-in-python" mailing list. You can subscribe here: http://lists.idyll.org/listinfo/biology-in-python, and you can post to it at bip@lists.idyll.org once you're a member.
This list is a tool/package/library-agnostic list, for people who use Python to work …
read moreThere are comments.
I've just put up a simple lab Web site for my future lab at Michigan State U.; I'm calling it the Lab of Genomics, Evolution, and Development.
--titus
Legacy Comments
Posted by Melissa on 2007-07-10 at 05:03.
Really cool Titus! The science you are doing is very interesting ... hmmm …read more
There are comments.
In the spirit of Greg's Not on the Shelves post about books he'd like to see, here's one class I'd like to see taught:
Test-Driven Web Development
This class will introduce students to test-driven software engineering through the development of a database-backed Web site. Student development will be driven by …
read moreThere are comments.
Spent a really frustrating hour or two this weekend figuring out why Apache 2.1 wasn't working on vallista.idyll.org.
The symptoms of the problem were that Apache would not serve static pages at all. I could serve dynamic pages (in fact, I "patched" my few static sites by …
read moreThere are comments.
Here's some pointlessly complex systems administration stuff.
I spent an hour or two today debugging my spam filtering setup. Most of my e-mail goes through Caltech, which does spam tagging nicely, but recently there's been a substantial increase in e-mail coming through various hosted domains. This bypasses Caltech's tagging, so …
read moreThere are comments.
http://www.youtube.com/watch?v=u9dhO0iCLww
(Things start to heat up in the 4th minute.)
read moreThere are comments.
corebio, the joint effort by a junta of California bioinformaticians to replace BioPython with something we like better, is proceeding interestingly. So far we have discussed the following issues:
read more
- what license? (BSD)
- what focus? (sequence manipulation & parsing)
- what about binary extensions? (focus on API, provide fast implementations where appropriate, but …
There are comments.
I've said some mean things about BioPython in the past -- that it's broken, that it's crufty, etc. One prominent former BioPython developer responded with the very reasonable question of why I wasn't fixing it, if it was so broken. The answer, of course, is that I've been working on my …
read moreThere are comments.
Page 1 / 59 »