Why github isn’t enough : persistent research materials

Once upon a time the internet didn’t exist. People corresponded by sending written or typed letters to one another through the postal service. Knowledge was transferred mostly through reports, articles and books. Important letters and books were collected and placed in collections in libraries, which one would have to make the effort to visit in order to learn what they had to tell.

Today things are different. Online collections and websites can be visited with comfort from anywhere with an internet connection. Online means of collecting, transferring and communicating knowledge are increasingly supplanting traditional print publishing methods. Figures, data and code are now research outputs in their own right that can stand alone or may be important for deciphering and communicating results that may be shared prior to any peer-reviewed publication. This is, on the whole, all taken for granted. However, not everything is as rosy as could be. Specifically I want to tackle how software and code are released.

code

some example code

It was once the case that the easiest way for researchers to make any software they had created available for others was by releasing it on their own personal websites (or those managed by their university). This is not ideal as these websites may disappear if the domain is not maintained e.g. if a researcher leaves the university (by changing jobs, retiring – or worse: if they die); or (more the case for personal websites) domain payments are not kept up. Software and code released in these ways are not persistent and may not stand the test of time. This creates a problem for people trying to access these materials. Another commonly seen phrase in manuscripts has been something along the lines of:

 

“ available from authors upon request”

 

This is also unsatisfying as these materials will only be available for as long as these authors are able to participate in correspondence (i.e. their current contact details can be found and that they will reply) and that they keep a copy of their original codes/software. As the length of time from release increases, the probability of being able to access relevant  resources under these modes is likely to decrease.

Things are getting better. Nowadays various online repositories exist that provide a space specifically for uploading software code e.g. github, bitbucket. This is great as it encourages researchers to upload the codes they have used – which allows other to view the methods used, rather than just an application which is said to perform certain operations (in an unseen way).

However, some of the same problems exist surrounding the use of these sites as did for previous methods of releasing software and code. The companies that run these repositories could collapse, change their funding structure such that payment is required to access the material, or change their website address. In these cases the original software would become harder to access; or in the worse case inaccessible.

To quote the editor for my latest paper in Royal Society Open Science:

 

“ Whilst github is a useful platform it does not provide a final version of record ”

 

On top of potential problems from a business perspective other things need to be considered. Code/software within a github repository may continue to be altered after a paper using it has been published. This could be good (in terms of fixing bugs that have not been detected) or bad, when the repository changes so much it no longer reflects the original code (including for example the deletion of a repository!).

Fortunately, there is a solution for making code persistently accessible for future users — by archiving the code and assigning it a DOI.  DOI’s (digital object identifiers) are designed to provide a unique identifier that can be assigned to almost any type of resource. DOI’s are permanent unique references to a specific resource. There exists a consortium of agencies who support the DOI who work collaboratively to ensure persistence of these DOI identifiers to the resource, regardless of how the URL may change.

In addition, I think it looks much nicer to cite an online resource through the repository and its DOI, than by inserting a website address into your manuscript! Thus making it easier to cite and give credit to code and software materials.

Both Zenodo (https://zenodo.org/) and figshare (https://figshare.com/) are examples of current repositories that allow for assigning a DOI to code (and both include options to integrate with github). There is a great how-to guide for making your code citable here.

I encourage you to archive your codes and software with a DOI.

Advertisements

Modularity in weighted bipartite networks

My paper on weighted modularity in bipartite networks  has just been published online in the journal Royal Society Open Science. The code used to perform the analysis (in R) and the core algorithms (in R, Julia and MATLAB/Octave languages) are up on GitHub (archived on  Zenodo*).

bipartite

Example of a bipartite network, which represents the interactions between two classes of nodes (red and blue).

The aim is to be able to classify a bipartite network (made from interactions between two classes of nodes) into distinct communities aka modules, based on the strength of interactions between nodes (as opposed to just knowing which nodes are interacting or not – which is more commonly used). I use an example dataset of 23 plant-pollinator interaction networks (which show the number of visitations of pollinator species to plant species) to pit three modularity maximising algorithms (QuanBiMo, LPAwb+ and DIRTLPAwb+) against one another. Both binary and weighted forms of these networks were tested (making a total of 46 networks). QuanBiMo is available in the R library bipartite (as the function computeModules) and is based on simulated annealing approaches, whereas LPAwb+ and DIRTLPAwb+ (available as accompanying software to this paper) are based on label propagation methods.

Whilst the dataset is composed of plant-pollinator networks, these algorithms could be applied to any type of bipartite network in ecology, sociology or elsewhere. There are several known challenges associated with identifying community structures in networks, some of which I raise in my discussion. Finding appropriate methods for this problem is therefore important.

 

myplot

Each of the three algorithms (QuanBiMo, LPAwb+, DIRTLPAwb+) was run on each network 100 times. The maximum modularity score (from all algorithms) is plotted against the median modularity score for both binary (left) and weighted (right) versions of the 23 plant-pollinator networks. The dotted line indicates the ‘perfect case’ when a modularity algorithm is able to find the maximum modularity score every time it is used.

I find:

– Modular structures in binary networks can appear very different to the modular structures identified using weighted interactions — there may be value in evaluating both of these structures

– In general all three algorithms had strong agreement on smaller networks, but QuanBiMo performed less well (with the default arguments) on larger networks — it is hard to relate appropriate input parameters for QuanBiMo to the properties of the network under investigation

– DIRTLPAwb+ identified the community structure with the highest modularity score in all but 2 networks — DIRTLPAwb+ uses multiple initialisations of LPAwb+ from different initial modular configurations to achieve this

– There was more variability in the modularity scores returned by QuanBiMo than LPAwb+ or DIRTLPAwb+ — it may be necessary to run modularity algorithms multiple times on a network to achieve a robust result as modularity maximising algorithms tend to be stochastic

– LPAwb+ is very fast in comparison to the alternative algorithms and returns modularity scores close to the maximum found — it would be good for exploratory research, especially in large networks, or sanity checking QuanBiMo modularity scores

 

LPAwb+ and DIRTLPAwb+ provide additional methods for calculating weighted bipartite modularity than QuanBiMo – and highlight potential dangers of relying on one method. While I expect there may be other and potentially better ways of maximising weighted modularity for bipartite networks (some of which may be simple extensions of existing methods for binary networks) I hope LPAwb+ and DIRTLPAwb+ can be useful tools for those wishing to perform network analysis. You can read the paper here.

 

 

* giving code and other supplementary materials a DOI is critical to making them a persistent resource – it also allows these materials to be citable. If you aren’t you should be! I think I’ll write a blog post on this topic soon.

Network structure in coevolving communities of bacteria and phage

This blog was originally posted on the Earth System Science @Exeter blog in November 2013.


 

Marine bacteria are a key component of oceanic ecosystems and are important drivers of primary productivity and nutrient cycling. Phages (viruses of bacteria) play a key role in their hosts ecology. In addition to aiding the transfer of genes between bacteria, they are also a major cause of mortality; responsible for infecting and reproducing inside bacterial cells which can eventually lead to them bursting open, killing the cell and releasing new phages into the environment. This process is important as it recycles nutrients and may also aid with transport of some material to the deep ocean. Both bacteria and phage evolve on rapid timescales to attempt to evade or exploit the other – however, the basic mode of coevolution between bacteria and phage is unclear. Understanding how these communities interact and respond to each other is therefore an important step towards unravelling the ecological and evolutionary processes in these systems and towards greater biological realism for ocean carbon modelling.

floresa1

Cross infection data from the Moebus and Nattkemper 1981 North Atlantic dataset reanalysed by Flores et al. 2013. Each spot indicates the precense of infection between a bacterial host isolate (rows) and a phage isolate (columns). Squares highlight the large scale modular pattern, whilst many of the modules contain a nested pattern within them. White spots indicate an infection between two different module classes.

One type of data exploring the community structure between bacteria and phage comes in the form of a binary infection network – formed by checking which bacterial isolates each phage isolate can infect – where presence of infection is indicated by a 1 and absence of infection by a 0. The largest dataset of this kind was collected by Moebus and Nattkemper 1981 in the North Atlantic and reanalysis by Flores et al. 2013 showed that it displayed what we term a nested-modular structure. This means that there exists specific groupings of phage which can infect specific groupings of bacteria with few infections between different groupings (modular), but also that within each of these modules there is a pattern such that there is a gradient in susceptibility to infection from the within module bacteria and a gradient of ability to infect from the within module phage (nested). The size of this dataset makes it useful as a means to look for signals of the coevolutionary processes that led to its formation.

We recently published a paper in Interface Focus exploring interaction networks formed by models of coevolutionary dynamics between bacteria and phage motivated by the question: what mechanisms are required to promote a nested-modular community structure? In this paper we explore what we term the relaxed lock-and-key model that represents the fitting of phage tail fibres (keys) to bacterial cell receptors (locks) that sustains high diversity of bacteria and phage and compare the structural properties of the networks formed in our model with those in the Moebus and Nattkemper 1981 data.

We find that our model networks can create high diversity communities of phage and bacteria with nested-modular structures and that the relaxed lock-and-key mode of coevolution provides a plausible explanation for these features being found in ocean samples. We also highlight how it can be difficult to directly compare experimental and model data and suggest that productive avenues for future research will be to look at other large scale cross-infection datasets to see if the nested-modular structures observed in the North Atlantic are characteristic of a coevolutionary signal across phage-host systems. In addition the development of experimental techniques to gain quantitative information about the interaction strengths in these types of data and the analytical techniques with which to analyse them will be useful tools for understanding phage-bacteria coevolution.

Online tools for researchers

There are a whole bunch of useful tools and websites out there on the internet that can be useful for both researching itself as well for sharing research outputs present and past with like-minded others! I thought I’d gather some of my favourites that are applicable to the majority of researchers regardless of field into this blog post. I feel the internet is a rich resource of information that can (and should) be accessed, added to and interacted with. I have no doubts that there are other many useful things that I’ve forgotten or don’t know about, so tell me what I’ve missed in the comments and I’ll add it to this post!

Literature

If you are like me searching for papers used to be as simple as going to Google and selecting ‘Scholar’ from the drop down menu. Despite some anger from researchers when this happened however this option is now banished deep in the depths of Google’s  ‘Even more’ product menu. What is really infuriating is that once you finally navigate to scholar you have to enter the search terms a second time!  I am using the bookmarklet application scholarfy, which makes doing a search for papers a one click process once more, which is very helpful!

Social media – sharing and discovering literature

You may have a university profile page that lists what you’re doing, but these can be poorly updated and may be difficult for others to find without a direct link. There are many ways to interact with your research community and profile your work online, but I think it probably depends on what you are researching to where the majority of discussions and interactions are taking place. To enjoy using social media you need to find a site which shares information in a way you are comfortable with and you should also think about whether you want to separate your personal and research feeds (e.g. Facebook is not something I would myself use for research). Personally I find Twitter very useful for gaining lots of snippets of information from many different users (in many different fields!), much of my reading is supplied by the people I follow, and I like the short format and the ability to follow and unfollow people without the usual social obligations. However, it isn’t for everyone and it isn’t where everyone is! I know some people are active on Google+ and other sites exist. There is also blogging (as you can see) which allows for more extended commentary and provides a good stage for discussion. There are several blogging providers out there, but as this post is on WordPress I will recommend that! Indeed you may want to combine the two – advertising a new blog post on twitter for example!

But you may also question how successful your social media strategy is. You could share your content using tools such as Google’s URL shortener (there are many more) that provides you a shortened version of a web address, particularly handy on twitter where space is limited to 140 characters per tweet, but it also allows you to see analytics about the users who click your link such as number of clicks, geographic location and where they clicked the link! Another handy tool is Altmetric whose beliefs are that in this socially connected internet that the sharing of research articles is a type of citation metric. It uses the DOI (digital object identifier) from articles and collates results linking to it from across several social networks – providing each DOI with a score based on both how many times it has been shared and in how many different social mediums. They also provide widgets that can be dynamically embedded on websites and blogs.

Landing page

With all the social media there is(!!!) it is a good idea to have a centralised landing page where users can see all the different online profiles and pages that you use for research purposes and navigate to the ones most preferable to them. This also allows you to be passive on many of these sites – where people can discover you and your link to your landing page, but to which you don’t feel the need to actively push information. It’s also worth mentioning here that some blogs e.g. WordPress allow you to automatically push new blog posts to several different accounts. You might want to create a landing page for yourself (I have!), or could quickly create something that looks quite snazzy on something like about.me for instance. Not only does this allow you to gather all your links into one place, but it also allows confirms to others that you are the you that they think they know!

Names

On a similar note for many people it is very likely that there is another researcher with the same name as you(there is a Stephen Beckett studying the science of chocolate for example! Yes I am jealous.) or if not there could be in the future. For some unlucky people there may even be two people with the same name researching in the same field! Having an ORCID profile and ID allows the distinction between which research articles belong to you and not and is also integrated with figshare (see Storage) as well as many journals, this could be useful if you do become particularly prolific and is something that should be linked with google scholar profiles.

Storage

We’ve all heard the stories of people losing all their work just before an important deadline, but things don’t have to be like that anymore – make sure you back up your important files somewhere! That somewhere could be a USB or external hard drive, but there are plenty of places online that allow you to store your data in the cloud for free. Dropbox (2GB +500MB per referral), Google’s Drive(15GB) and Microsoft’s Skydrive (7GB) are all potential places to look after your files and also allow sharing of folders with other selected users. Indeed there may be files that are in a completed state that haven’t been published in a journal, but may be of benefit to your research community. Things like datasets, code, posters, presentations, images and media are all valid research outputs that you could be sharing; figshare allows you to upload these kind of files to the web and gives each a DOI, or a permanent website address, which isn’t strictly necessary for citations, but it does help allow them to be more widely available and accessible! In this way figshare also acts as a preprint server in a similar way to arXiv, but it is not limited to just articles and the remit on the types of articles is not limited. They also have a great widget which allows you to dynamically embed your uploads to your own website or blog – such that they can be previewed in the browser. To prove it’s effectiveness I shall archive this post on figshare.

Versioning

I’m also starting to get excited about Github, which is an online repository primarily used for coding, but I’ve also seen examples of people using it to write articles. Github tracks changes line by line between different uploads of your files (the different versions) which allows you to see what’s changed and also allows several people to work on the same project, presumably at the same time without worrying about overwriting someone elses edits. It also possible to copy (or fork) other peoples projects to make your own edits, that could be merged with the original at a later date or treated as a separate branch. I’ve only just started trying it out, but it seems really cool!

Online Documents

It is also now possible to work on spreadsheets, presentations and other documents in the web. Analysing the pro’s and con’s of these might be a blog post in itself so I will quickly mention that both Google and Microsoft offer online Office tools and that others exist, including the popular zoomable presentation tool Prezi which is certainly worth a look. I also wanted to mention writeLaTeX which is based on LaTeX software popular amongst mostly mathematical based researchers due to how nicely it formats equations, but now includes a dynamically updating previewer and automatically debugs the article code, which makes it much improved from when I last used it in an offline state! In addition it also has many journal templates ready to use, includes options to collaborate on articles online with others (at the same time), generates pdf’s very quickly and allows you to push your articles to figshare (which I have done with this post) or F1000 research very easily.

Conclusions

To reiterate what I said at the top, this is in no way a complete list of the types of tools that are available on the internet that could be useful to researchers. But if I have missed something glaringly obvious or you have other comments, please comment below!

 

 

UPDATE: I forgot to include the link to my figshare version of this document, it can be found here:

Online tools for researchers. Stephen Beckett. figshare.
http://dx.doi.org/10.6084/m9.figshare.757780

Uncertainty In Interaction Networks

I was across in Bath last week for a meeting titled ‘Uncertainty in Interaction Networks’, a launch event for Bath’s new centre of network science (and collective behaviour!). I had the chance to show off my latest poster and I thought the quality and diversity of the speakers was excellent. There were participants from a wide range of fields and disciplines all interested in using networks as part of their research; whether it be transmitting data across the internet in an efficient way, understanding the processes leading to animal group social behaviour, or analysing how financial markets think and interact.

The field of network science is relatively new; and interdisciplinary at heart. It was great to hear many of the speakers mention this and how approaching networks from slightly different disciplinary angles with different objectives has really helped push the field to where it is today and provides a great opportunity to collaborate on interdisciplinary problems.

My particular highlight was Sheri Markose’s talk on financial markets and how they should be regulated. It turns out that most banks have been thinking about risk and uncertainty only in connection to themselves rather than at the scale of the larger collective network — leading to the creation of few, large, interconnected hubs that meant that when one bank failed the rest of the system was highly susceptible to the resulting aftershocks in a ‘financial contagion’ as witnessed in the recent recession. It also turns out that standard measures are not good for looking for early warning signals – a story that is ringing true for a variety of systems. However, it turns out that looking at eigenvector/value ideas – highly intersected with mathematical ideas of stability may provide such a system and a means of regulating members of the industry – such that banks with more dominant eigenvalues should be taxed more. Robert May, famed for forging progress in many areas in mathematical and community ecology, also spoke of market stability and how the perception of sharing risk across the market was not quite what it seemed. Derivatives may be a risky business! He also gave some advice for people entering a new field of research or starting on a new problem. That one should think first, then draw up an idealised schematic or toy model of the problem, before jumping into the literature and asking the experts what they think.

Also several people were using twitter as a large social media dataset. Yamir Moreno analysed data related to the 15-M spanish revolution (related to the Occupy and Arab Spring movements of 2010-11), while John Bryden showed that by taking twitter as a whole network, it can be decomposed into closely interacting groups that share a common sociolect which reflects their shared interests – whether that be the education profession or sharing their love of J. Bieber!

It was a well organised meeting with lots of inspiring talks, which are meant to become available online at some point in the near future. Thanks very much to the organisers for putting together such a great event and helping me see networks everywhere!

Phage therapy and ‘sloppy journalism’?

I recently read this article on the potential of phage therapy in medicine. At face value, it seems a nice piece giving a little background to the concepts and gives a human perspective through the quotes of several scientists. However, I was really struck by how badly the science of bacteria-viral interactions was reported. As I am actively researching in this area it could well be that I am excessively biased and nit-picking to what I consider bad descriptions, but there are parts of this article which made me squirm!

“patients are treated for all kinds of bacterial infections with viruses called phages.”

Phages(short for bacteriophages) are the viruses that infect bacteria. That is definitely me being picky, but further down in the article it says:

“Dr Hoyle says that one of the advantages of the viruses over antibiotics is that they target only the harmful bacteria.”

By itself this is an incredibly misleading sentence and implies that all viruses only infect bacteria which are harmful! If we look at what Dr Hoyle actually says:

“It doesn’t have the side-effects or the negative aspects of antibiotics, like diarrhoea, because of its high specificity. It’s not the silver bullet that antibiotics are, but it has its advantages as it works well on chronic infections. It enters the site and continues to do its work even after application.”

In fact Dr Hoyle is describing the effects of a particular solution of many types of phages, which are known as phage cocktails, which has been specifically tailored for a certain type of bacterial infection. The preceeding sentence written by the journalists is misleading and falsely generalises Dr Hoyle’s words into a claim that was not made. This can be understood by getting to the essence of why these treatments use phage cocktails in the first place – why not just use one type of phage?

By Dr Graham Beards (en:Image:Phage.jpg) [Public domain], via Wikimedia Commons

Populations of phage and bacteria are not genetically identical – some genetic diversity exists. This leads to cohabiting communities of bacterial hosts and their infecting phage becoming drawn into an coevolutionary arms race, where bacteria can mutate and evolve greater resistance to the phage in one way or another (there are a multitude of potential ways this can happen, which I will save for another post) and phage, which also mutate are able to respond to these changes and increase their ability to infect. By picking the right phage for use in a phage cocktail it is possible to treat with the inherent genetic diversity of the harmful bacteria and limit the possibility of the bacteria gaining resistance. The point in treating any bacterial infections is that that the bacteria are eliminated or sufficiently suppressed before they can gain resistance to the treatment through mutation, which is why there is now growing concern about lack of new antibiotics. It is the phages ability to create these coevolutionary arms races with bacteria which makes them appear an attractive treatment strategy, that is regaining interest in medical, agricultural, military and industrial professions, as unlike antibiotics they can react to genetic changes in the bacterial population and this should have been mentioned somewhere in the article as is the main reason why phage therapy has potential!

However, this ability is double sided; coevolutionary arms races can lead to diversification of bacteria and phage communities and as evolution in undirected, this could lead to bacterial species that are more harmful to human health.

To effectively treat bacterial infection using phage then is best done using a phage cocktail of many types of phage that can infect the infecting bacteria. However, this does not guarantee success. Indeed there is still much research being done on the potential performance of phage therapy to particular bacterial infections and investigate it’s potentially confounding effects. Another factor to consider as pointed out to me by @NotQuitePhil is the potential for toxic shock, caused by an increased concentration of toxins produced by ‘stressed out’ bacterial cells in response to the treatment (being infected essentially starves the cells – their resources are rerouted into creating phage progeny; though this is not necessarily the driver for the response), this is also a common problem when considering the use of antibiotics.

It may not seem like such a big error to you, but what I see as being a false attribution makes Dr Hoyle look bad; and in turn it reflects poorly on the article, it’s authors and the other quoted sources. Furthermore it provides the audience with false information! This is bad for a feature article. I know it is not always going to be possible especially when reporting breaking news, but I really think journalists should ask for an expert proofread of their article before it goes to the presses – who better than perhaps one of the quoted sources themselves, who are quoted as experts and will not want their views or their field of study miscommunicated to the general public. The point is sloppy facts should be avoided where possible; and they can be easily avoided. This approach could improve scientist-media relations as well as the strength, communication and content of such articles themselves!

The other option of course is to allow the audience to interact with the article and alert the authors if errors are found. I haven’t had this blog checked before pressing send, so if I have factually/grammatically or otherwise gone wrong please let me know!