Brainstorming ideas

After a visit to Cornell visiting with Plant Pathology & Plant-Microbe Biology dept, Pseudomonas syringae groups, and SGN and also a breakfast with Tim Hubbard when he was in Berkeley I had a few ideas.

  • We need to be able to put the power of annotation in the hands of more people.  Community assisted annotation at the level of just function, linking to articles, and general curation should be accessible ala-wikipedia.
  • For genome annotation though, there is a more specialized need to be able to incorporate data from different sources. Git-like repository for genome annotation (in GFF) which can be served up to Gbrowse. Edits can be saved to ones own branch.  (all of this assumes the same reference genome assembly which is about the level I’m comfortable worrying about — tho some of the genome projector type tools would seem to make it easy to lift annotation from one assembly to the other).
  • Would probably necessitate a GenomeAnnotationDiff tool.  This might be already accomplished by tools that the Yandell lab has produced described in publication by Eilbeck et al.
  • Gene page with community annotation tools at SGN are ready to go and they have VMs to avoid having to install all the software. I even saw a cool QTL on the fly calculation.  The challenges I see in our data is always linking the data from one context to another how we make this useful. Will have to try and do a transformation of some of the different data we have here.
  • The SGN approach is to use aspects of Chado for the schema that deals with ontologies/controlled vocabularies but to also have domain specific databases for annotation and related info rather than the giant “everything is a feature” that is the Chado-way and doesn’t seem to scale.
  • It is about time to try out hadoop/MapReduce on our big datasets and to also earnestly start running the automated the all-vs-all ortholog prediction scripts on our genomes, there are just too many times it seems important to have an updated dataset – something to deploy on new hardware environment this summer.
  • No one has figured out how to interface with NCBI/GenBank/EMBL to deal with the updating of genomes in a sensible — basically all the really complicated systems are essentially keeping the bulk of the data in their own domain-specific databases and at some appointed times feeding that data back in, but often this is a huge process and only works where there is a real effort from both NCBI/GenBank/EMBL and the group.  E.g. Ensembl has the CCDS and RefSeq projects that can take the output from Ensembl and feed that back into the system.
    What would a comparative reannotation of X fungal genomes system be able to do with the data?

On the Plant Path & fungal side of discussions

  • Looking at multiple genotypes of both the host and pathogen seem like a really smart way to start to explore the effects of mutations. With so many more tools now in both systems it seems like this would be next logical arraying of experimental designs.
  • I really need to get some movies made of Bd (Chytrid) zoospores swimming around, would make for better introductions to talks, I had to settle for showing oomycete zoospores which are cool but not the same.
  • There needs to be new/better tools for population genetics for systems where the populations are clearly not in Hardy-Weinberg equilibrium such as newly introduced pathogens
  • Closeup pictures of fungi are really cool especially through the boroscope

2 thoughts on “Brainstorming ideas

  1. This is a really interesting post. I heartily agree that we need more community assisted annotation. I’m getting incredibly frustrated about not being able to fix EMBL annotation directly – it is much too painful contacting the original authors about updates. Also, the published sequences that don’t make it into EMBL are a further source of pain.

    I’m surprised TH didn’t mention DAS – it’s rather similar in spirit to your gff/git idea. Except I like your concept better as I’m damned if I can figure DAS out yet.

  2. Actually I was implicitly thinking of DAS as the way the annotation goes from GIT (or whatever repository) into the genome browser/apollo view/etc. But that there was some simple data storage on the user’s side and that DAS was the middleware to connect things. One can then have these simple local servers advertised through the DAS registry. This was the sort of thing Tim and I was talking about at breakfast, but didn’t quite put down in the text there. Thanks for reminding me!

    I really think we’re not going to be able to map the curation resources to the data deluge without some sort of community annotation work and completely echo your frustration with what to do when something needs to be corrected in an EMBL/GenBank database.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s