1. Advancing metagenome classification and comparison by MinHash fingerprinting of IMG/M data sets.

    This is our just-submitted proposal for the JGI-NERSC "Facilities Integrating Collaborations for User Science" call. Enjoy!


    1. Brief description: (Limit 1 page)

    Abstract: Sourmash is a command-line tool and Python library that calculates and compares MinHash signatures from sequence data. Sourmash "compare" and "gather" functionality enables comparison and characterization of signatures …

    read more

    There are comments.

  2. Request for Compute Infrastructure to Support the Data Intensive Biology Summer Institute for Sequence Analysis at UC Davis

    Note: we were just awarded this allocation on Jetstream for DIBSI. Huzzah!


    Abstract:

    Large datasets have become routine in biology. However, performing a computational analysis of a large dataset can be overwhelming, especially for novices. From June 18 to July 21, 2017 (30 days), the Lab for Data Intensive Biology …

    read more

    There are comments.

  3. Computational postdoc opening at UC Davis!

    We are currently soliciting applications for computational postdoctoral fellows to undertake exciting projects in computational biology/bioinformatics jointly supervised by Dr. Titus Brown (http://ivory.idyll.org/lab/) and Dr. Fereydoun Hormozdiari (http://www.hormozdiarilab.org/) at UC Davis.

    UC Davis is a world class research institution with a strong …

    read more

    There are comments.

  4. Categorizing 400,000 microbial genome shotgun data sets from the SRA

    This is another blog post on MinHash sketches; see also:

    read more

    There are comments.

  5. How I learned to stop worrying and love the coming archivability crisis in scientific software

    Note: This is the fifth post in a mini-series of blog posts inspired by the workshop Envisioning the Scientific Paper of the Future.

    This post was put together after the event and benefited greatly from conversations with Victoria Stodden, Yolanda Gil, Monya Baker, Gail Peretsman-Clement, and Kristin Antelman!


    Archivability is …

    read more

    There are comments.

  6. Quickly searching all the microbial genomes, mark 2 - now with archaea, phage, fungi, and protists!

    This is an update to last week's blog post, "Efficiently searching MinHash Sketch collections".


    Last week, Thanksgiving travel and post-turkey somnolescence gave me some time to work more with our combined MinHash/SBT implementation. One of the main things the last post contained was a collection of MinHash signatures of …

    read more

    There are comments.

  7. What is open science?

    Gabriella Coleman asked me for a short, general introduction to open science for a class, and I couldn't find anything that fit her needs. So I wrote up my own perspective. Feedback welcome!

    Some background: Science advances because we share ideas and methods

    Scientific progress relies on the sharing of …

    read more

    There are comments.

  8. Increasing postdoc pay

    I just gave all of my postdocs a $10,000-a-year raise.

    My two current postdocs all got a $10k raise over their current salary, and the four postdocs coming on board over the next 6 months will start at $10k over the NIH base salary we pay them already. (This …

    read more

    There are comments.

  9. An #openscienceprize entry: Integrating server-side annotation into the hypothes.is ecosystem

    Over the last few months, I've been playing with hypothes.is and thinking about how to use it to further my scientific work. This resulted in some brainstorming with Jon Udell and Maryann Martone about, well, lots of things. And now we're putting in an open science prize entry!

    tl …

    read more

    There are comments.

  10. Is mybinder 95% of the way to next-gen computational science publishing, or only 90%?

    If you haven't seen mybinder.org, you should go check it out. It's a site that runs IPython/Jupyter Notebooks from GitHub for free, and I think it's a solution to publishing reproducible computational work.

    For a really basic example, take a look at my demo Software Carpentry lesson. Clicking …

    read more

    There are comments.

  11. Why software development practice matters, Containerization version

    A while back, Kai Blin (via Nick Loman) asked Michael Barton:

    If we containerize all these things won't it just encourage worse software development practices; right now developers still need to consider someone other than themselves installing the software.

    and Michael Barton's response, transcribed, was:

    "It's a good point. Ultimately …
    read more

    There are comments.

  12. Transcriptomic analysis with Docker containers and data volumes

    As part of our Docker hands-on workshop earlier this month, I learned a lot about building Dockerfiles, running Docker containers on remote hosts with docker-machine, and using data volumes to manage data in remotely hosted Docker containers.

    During and after the workshop, I put together Docker images (and, more importantly …

    read more

    There are comments.

  13. Pubwication of software papers, and authorship on them

    Pubwication. Pubwication is what bwings us togethew today. Pubwication, that bwessed awwangement, that dweam within a dweam. And authorship, twue authorship, wiww fowwow you fowevah and evah. So tweasuwe youw authorship.

    Last week, our software paper on khmer 2.0 was published on F1000Research. We intend this paper to be …

    read more

    There are comments.

  14. jclub: Bloom Filter Trie - a data structure for pan-genome storage

    Note: this is a blog post from the DIB Lab journal club.

    Jump to Questions and Comments:.


    The paper:

    http://www.techfak.uni-bielefeld.de/~stoye/dropbox/wabi2015final.pdf

    "Bloom Filter Trie: a data structure for pan-genome storage."

    by Guillaume Holley, Roland Wittler, and Jens Stoye.

    Background

    • Pan Genome: Represents genes …
    read more

    There are comments.

  15. A review of "Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees"

    (This is a review of Large-Scale Search of Transcriptomic Read Sets with Sequence Bloom Trees, Solomon and Kingsford, 2015.)

    In this paper, Solomon and Kingsford present Sequence Bloom Trees (SBTs). SBT provides an efficient method for indexing multiple sequencing datasets and finding in which datasets a query sequence is present …

    read more

    There are comments.

  16. Taking grad students to PyCon

    I am still up at PyCon 2015 in Montreal, and most of my lab is here with me.

    On Saturday, I told Terry Peppers and some others that PyCon had been one of my (limited) lifelines to (limited) sanity during my early tenure-track years. Whenever I was in danger of …

    read more

    There are comments.

  17. "Open Source, Open Science" Meeting Report - March 2015

    On March 19th and 20th, the Center for Open Science hosted a small meeting in Charlottesville, VA, convened by COS and co-organized by Kaitlin Thaney (Mozilla Science Lab) and Titus Brown (UC Davis). People working across the open science ecosystem attended, including publishers, infrastructure non-profits, public policy experts, community builders …

    read more

    There are comments.

  18. How we develop software (2015 version)

    A colleague who is starting their own computational lab just asked me for some advice on how to run software projects, and I wrote up the following. Comments welcome!


    A brief summary of what we've converged on for our own needs is this:

    • everything's on github (you can have private …

    read more

    There are comments.

  19. Lab for Data Intensive Biology at UC Davis joins Software Carpentry as an Affiliate

    We are pleased to announce that the Laboratory for Data Intensive Biology at UC Davis has joined the Software Carpentry Foundation as an Affiliate Member for three years, starting in January 2015.

    "We've been long-term supporters of Software Carpentry, and Affiliate status lets us support the Software Carpentry Foundation in …

    read more

    There are comments.

  20. Letter of resignation


    Dear <chairs>,

    I am resigning my Assistant Professor position at Michigan State University effective January 2nd, 2015.

    Sincerely,

    CTB.


    Anticipated FAQ:

    • Why? I'm moving to UC Davis.
    • Do you have an employment contract with UC Davis?? Nope. But I'm starting there in January, anyway. Or that's the plan. And yes …
    read more

    There are comments.

  21. Introducing the Moore Foundation's Data Driven Discovery (DDD) Investigators

    Note: the source data for this is available on github at https://github.com/ctb/dddi

    Today, the Moore Foundation announced that they have selected fourteen Moore Data Driven Discovery Investigators.

    In reverse alphabetical order, they are:


    Dr. Ethan White, University of Florida

    Proposal: Data-intensive forecasting and prediction for ecological …

    read more

    There are comments.

  22. The Critical Assessment of Metagenome Interpretation and why I'm not a fan

    Update 3/29/15: the CAMI FAQ now includes information on reproducibility measures, and looks very promising. The data sets they are producing also seem fascinating.

    If you're into metagenomics, you may have heard of CAMI, the Critical Assessment of Metagenome Interpretation. I've spoken to several people about it in …

    read more

    There are comments.

  23. Preprints and double publication - when is some exposure too much?

    Note to all: this is satire... As Marcia McNutt says below, please see Science Magazine's Contributors FAQ for more detailed information.


    Recently I had some conversations with Science Magazine about preprints, and when they're counted as double publication (see: Ingelfinger Rule). Now, Science has an enlightened preprint policy:

    ...we do …
    read more

    There are comments.

  24. A first science fair

    So my daughter just participated in her first science fair, at the age of 6. ("Conclusion: science can be fun! and sticky!")

    Over dinner, my wife and I came up with some ideas for her next fair. She was having trouble dissolving sugar in ice water, so we suggested maybe …

    read more

    There are comments.

  25. Imagine...

    Links, software, thoughts -- all solicited! Add 'em below or send 'em to me, t@idyll.org.

    ---

    Imagine... a rolling 48 hour hackathon, internationally teleconferenced, on reproducing analyses in preprints and papers. Each room of contributors could hack on things collaboratively while awake, then pass it on to others in overlapping …

    read more

    There are comments.

  26. The Story Behind "Tackling soil diversity with the assembly of large, complex metagenomes"

    I'm pleased to announce the publication of "Tackling soil diversity with the assembly of large, complex metagenomes", by Adina Howe, Janet Jansson, Stephanie Malfatti, Susannah Tringe, James Tiedje, and myself. The paper is openly available on the PNAS Web site here (open access).

    External links:

    read more

    There are comments.

  27. Install gplots in R 2.1X

    I've been using EBSeq for a few things lately, and have had trouble getting some of the dependencies installed -- in particular, gplots doesn't seem to be readily available for R 2.14, 2.15, etc. Judging by my Google searches, others have been having the same problems; see e.g …

    read more

    There are comments.

  28. Will you join my committee?

    Dear <student>,

    I'd be happy to, but I do have a few conditions/requests based on prior experience with students!

    First, please schedule all of your meetings at least 2 months in advance :)

    Second, a condition for my signing off on your thesis will be that, for any paper for …

    read more

    There are comments.

  29. Is "Scientific Data" ever-finer salami-slicing, or is it reducing time to data publication?

    I just read Scientific Data - ultimate salami slicing publishing, in which Pedro Beltrao argues that Nature's new journal is simply another venue for them to suck money out of scientists. Maybe. But I'm strongly considering sending a lot of stuff there, and I really think Pedro is missing something very …

    read more

    There are comments.

  30. I've got a new job

    As the title says, I've got a new job.

    But it's not really that exciting a switch, sorry :)

    As of mid-August sometime, I will officially switch my appointment from 2/3 Computer Science and Engineering / 1/3 Microbiology and Molecular Genetics, to 2/3 Microbiology and Molecular Genetics, 1/3 …

    read more

    There are comments.

  31. A mildly crazy idea: crowdsourced -omic analysis with data privacy sunset?

    Or, "can we crowdsource BGI?" ;)

    With all of the crazy need surrounding genomic analysis -- most of it on a shoestring budget -- I am thinking about a mildly crazy idea.

    What if I offered to computationally analyze people's non-model transcriptomic and metagenomic data for them, in exchange for (a) non-exclusive access …

    read more

    There are comments.

  32. Thinking about software architecture for heterogeneous data integration

    I just left the NAS meeting on Integrating Environmental Health Data to Advance Discovery, where I was an invited speaker. It was a pretty interesting meeting, with presentations from speakers who worked on chemotoxicity data, pollution data, exposure data, and electronic health records, as well as a few "outsiders" from …

    read more

    There are comments.

  33. Assembling the heck out of soil - paper posted

    We just posted yet another pre-submission paper to arXiv.org:

    Assembling large, complex environmental metagenomes

    Authors: Adina Chuang Howe, Janet Jansson, Stephanie A. Malfatti, Susannah Tringe, James M. Tiedje, and C. Titus Brown

    arXiv link

    Paper repository on github

    Abstract:

    The large volumes of sequencing data required to deeply sample …
    read more

    There are comments.

  34. Assembly artifacts paper posted

    We just posted another pre-submission paper to arXiv.org:

    Illumina Sequencing Artifacts Revealed by Connectivity Analysis of Metagenomic Datasets

    Authors: Adina Chuang Howe, Jason Pell, Rosangela Canino-Koning, Rachel Mackelprang, Susannah Tringe, Janet Jansson, James M. Tiedje, and C. Titus Brown

    arXiv link

    Paper repository on github

    Abstract:

    Sequencing errors and …
    read more

    There are comments.

  35. Anecdotal science

    I'm starting to notice that a lot of bioinformatics is anecdotal.

    People publish software that "works for them." But it's not clear what "works" means -- all to often either the exact parameters or the specific evaluation procedure is not provided (and yes, there's a double standard here where experimental methods …

    read more

    There are comments.

  36. What biologists need to know about cyberinfrastructure

    I recently attended an NSF BIO directorate meeting about cyberinfrastructure needs. Here's a list of training & education challenges identified at that meeting:

    • development and adaptation of tools to archive data and metadata from diverse sources to enable data mining
    • integration of structured and unstructured data from heterogenous data sources
    • discussion …
    read more

    There are comments.

  37. The Beachcomber's Dilemma

    Here's a data analysis question for all you Big Data folk.

    A beachcomber is interested in obtaining up to 10 examples of every type of shell present on a beach. The shells are individually easy to find, but some types are really rare and some are really abundant. The beachcomber …

    read more

    There are comments.

  38. Some early experience in teaching using ipython notebook

    As part of the 2012 Analyzing Next-Generation Sequencing Data course, I've been trying out ipython notebook for the tutorials.

    In previous years, our tutorials all looked like this: Short read assembly with Velvet -- basically, reStructuredText files integrated with Sphinx. This had a lot of advantages, including Googleability and simplicity; but …

    read more

    There are comments.

  39. DRAFT: A community-focused pre-publication data release and sharing policy for sequence data

    This is a draft proposal of a policy to encourage pre-publication data release and data sharing within a community. This policy is based on discussions at the Cephalopod Genomics Workshop (a Catalysis workshop sponsored by NESCent).

    Note, this is made available under a CC-BY-SA license permitting use and re-use with …

    read more

    There are comments.

  40. A simple idea: standard but optional review criteria for bioinformatics papers

    Brad Chapman (@chapmanb on twitter) wrote and signed a nice review of my submission to the Bioinformatics Open Source Conference. In his review, he said

    My only small suggestion is to include some discussion about your
    reproducibility work during the talk: the Amazon AMI, documentation
    and reproducible ipython workflows. This …
    read more

    There are comments.

  41. Paper draft: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

    (updated to point to http://arxiv.org/).

    Authors: Jason Pell, Arend Hintze, Rosangela Canino-Koning, Adina Howe, James M. Tiedje, C. Titus Brown

    Abstract:

    The memory requirements for de novo assembly of short-read shotgun sequencing data from complex microbial populations are an increasingly large practical barrier to environmental studies. Here we …
    read more

    There are comments.

  42. A memory efficient way to remove low-abundance k-mers from large (metagenomic?) DNA data sets

    I've spent the last few weeks working on a simple solution to a challenging problem in DNA sequence assembly, and I think we've got a nice simple theoretical solution with an actual implementation. I'd be interested in comments!

    Introduction

    Briefly, the algorithmic challenge is this:

    We have a bunch of …

    read more

    There are comments.

  43. Course announcement: Analyzing Next-Generation Sequencing Data

    Analyzing Next-Generation Sequencing Data

    May 31 - June 11th, 2010

    Kellogg Biological Station, Michigan State University

    CSE 891 s431 / MMG 890 s433, 2 cr

    Applications are due by midnight EST, April 9th, 2010.

    Course sponsor: Gene Expression in Disease and Development Focus Group at Michigan State University.

    Instructors: Dr. C. Titus …

    read more

    There are comments.

  44. Lazyweb query: CloudStore (or KosmosFS)

    Does anyone have any experience with CloudStore, formerly known as KosmosFS? From http://en.wikipedia.org/wiki/CloudStore:

    CloudStore (KFS, previously Kosmosfs) is Kosmix's C++ implementation of
    Google File System. ... CloudStore supports incremental scalability,
    replication, checksumming for data integrity, client side fail-over and access
    from C++, Java and Python.
    

    The …

    read more

    There are comments.

  45. Easily Accessible Web-Based Tools For Analyzing Next-Generation Sequencing Data From Agricultural Animals

    Just submitted this on Thursday:

    Next generation sequencers are beginning to impact agricultural biology. Over the next few years, next generation sequencing will produce incredibly large datasets that will address structural (e.g., SNPs, CNVs, indels, methylation, translocations) and functional (e.g., RNA expression, transcription factor binding sites) variation in …
    read more

    There are comments.

  46. Software testing in science

    As part of a CiSE submission I'm working on, I interviewed the lead developer on a scientific software package today. This software package is mainly used for evolutionary studies, and has a small but devoted following - ~6 developers and ~12 users locally, plus a few dozen users outside of MSU …

    read more

    There are comments.

  47. Pursuing simplicity

    John Gall apparently said:

    A complex system that works is invariably found to have evolved from a simple system that worked. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work. You have to start over, beginning with …
    read more

    There are comments.

  48. PyCon review process

    We're going through the PyCon '09 review process, and participating in the process has been pretty interesting. (I joined the Program Committee in large part because I was told to put up or shut up after I critiqued PyCon '08. Ahh, the open source world... where you're encouraged to go …

    read more

    There are comments.

  49. Off to MSU - Woo hoo!

    On Thursday, May 15th, I finished my post-doc position at Caltech.

    On Friday, May 16th, I officially started as an Assistant Professor split between Computer Science & Engineering and Microbiology & Molecular Genetics at Michigan State University.

    On Friday evening and Saturday, we hung out down at the Caltech Marine Lab and …

    read more

    There are comments.

  50. Principles and Practices of Scientific Origonology

    heh, this applies to many fields, I think...

    Luis Ibanez
    
    This presentation is a satire of the current obsession with
    intellectual property, innovation and originality that plagues
    the field of medical image analysis. The presentation makes the
    point that most Journals and Conferences focus on Originality and
    despise Reproducibility and …
    read more

    There are comments.

  51. Darned Apache2

    Spent a really frustrating hour or two this weekend figuring out why Apache 2.1 wasn't working on vallista.idyll.org.

    The symptoms of the problem were that Apache would not serve static pages at all. I could serve dynamic pages (in fact, I "patched" my few static sites by …

    read more

    There are comments.

  52. Exim, spamassassin and logrotate

    Here's some pointlessly complex systems administration stuff.

    I spent an hour or two today debugging my spam filtering setup. Most of my e-mail goes through Caltech, which does spam tagging nicely, but recently there's been a substantial increase in e-mail coming through various hosted domains. This bypasses Caltech's tagging, so …

    read more

    There are comments.

  53. Installing Xen on Debian

    I just got two HP ML350 servers (very nice: 8 gb RAM, 600 gb 15k disk, 2x 3.6 GHz Xeon -- yes, we over-ordered) and I spent a few hours installing Xen-enabled Debian on them.

    Xen is a very nice virtualization system that works with Linux. It lets you do …

    read more

    There are comments.

  54. corebio proceedeth

    corebio, the joint effort by a junta of California bioinformaticians to replace BioPython with something we like better, is proceeding interestingly. So far we have discussed the following issues:

    • what license? (BSD)
    • what focus? (sequence manipulation & parsing)
    • what about binary extensions? (focus on API, provide fast implementations where appropriate, but …
    read more

    There are comments.

  55. « Page 2 / 58 »

    Proudly powered by pelican, which uses python.

    The theme is subtlely modified from one by Smashing Magazine, thanks!

    For more about this blog's author, see the main site or the lab site

    While the author is employed by the University of California, Davis, his opinions are his own and almost certainly bear no resemblance to what UC Davis's official opinion would be, had they any.