Digital Curation - iSchool at UT Austin

Tuesday, October 19, 2010

Monday, November 30, 2009

DRAMBORA

In April 2008, DRAMBORA Interactive was released. Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) is a methodology and online tool that includes a form-based interface, peer comparisons, reporting mechansisms, and maturity tracking. The tool requires repository personnel to describe its characteristics. The goal is to create an ever-evolving ontology of repository attributes through which repositories can be compared to similar repositories. The toolkit was developed by the Digital Curation Centre (DCC) and DigitalPreservationEurope (DPE). It is not competitive with audit checklists, such as nestor or TRAC, but encourages repositories to use these tools in conjunction with DRAMBORA.

DRAMBORA is a two phase system self-assessment tool. The first phase results in a comprehensive organizational overview. The second phase involves risk identification, assessment, and management. Through the bottom-up approach repositories evaluate themselves based on the repositories own contextual environment and can then compare their characteristics to other repositories. According to this article, DRAMBORA’s creators are also developing “key lines of enquiry” which are sets of questions that will guide auditors within an organization to focus on significant issues or risk factors.

The overall idea is that a successful digital repository is one that plans for uncertainties, converts them into risks, then manages these risks. It involves a cyclical process by which a repository reduces the level of risk after each iteration. Participation is not rewarded by a certification or endorsement. Repositories benefit by strengthening their self-awareness and ability to identify and manage risks. This enables them to present information about their repository that makes them more approachable and trustworthy. The creators argue that DRAMBORA “offers benefits to repositories both individually and collectively” in that it opens up lines of communication between repositories. DRAMBORA facilitates classification of digital repositories so that services and characteristics are more easily communicated to an audience.

The website states that he purpose of the DRAMBORA toolkit is to facilitate the auditor in:
- Defining the mandate and scope of functions of the repository
- Identifying the activities and assets of the repository
- Identifying the risks and vulnerabilities associated with the mandate, activities and assets
- Assessing and calculating the risks
- Defining risk management measures
- Reporting on the self-audit

There is an offline version of DRAMBORA that you can download from the website, but both options require registration. While it is a European initiative, DRAMBORA has been implemented at numerous repositories in the US and Europe.

Tuesday, November 24, 2009

Two Views on Digital Curation

The two articles I looked at for this post both concern digital curation. One was an article in Advertising Age called “Digital Curation Is a Key Service in Attention-Strapped Economy” and the second article is a response to the first article at a blog called Resource Shelf, which calls itself a “daily newsletter with resources of interest to information professionals, educators and journalists.”

The first article is written by Steve Rubel, who is a marketing strategist and a blogger. In his article he talks about the internet offers an endless supply of choices that far exceeds our ability to pay attention to it all (basically he is saying that supply is outstripping demand). He mentions how dominant both Facebook and Google have become. Rubel then goes on to say that no matter how dominant these two sites are they can’t hold our complete attention, in part because they “are often a mile wide and half-an-inch deep.” This leads him into bringing up digital curation, and discussing how brands (like IBM, UPS, and Microsoft, which are some of the example sites and "curation" project he gives) are starting to curate digital information to help people "find the good stuff." He doesn't spend much time discussing his thoughts on brands being curators, though, which is disappointing because the idea is intriguing.

Rubel closes by saying that both human-powered and automated digital curation “will be the next big thing to shake the web.”

I enjoyed seeing a non-information professional’s take on the internet and curation. And it was especially interesting to me how Rubel was using the term digital curation. He never really gives a definition, and seems to make it sound like a fairly simple thing, when in fact it is pretty confusing and complex. I also find it perplexing that in Rubel’s take on digital curation it is journalists who will be playing a key part in it all. I have a journalism background (and a degree in the subject that I will never use), and I can’t imagine journalists doing half the things we have discussed in class about curating information. Rubel says that journalists won’t be the only ones taking part, but he completely fails to mention information professionals at all in his article, which is what the Resource Shelf blog entry responded to.

The Resource Shelf writer says they were sad to see that librarians and information professionals were left out of Rubel’s article. Sadder still, Resource Shelf says that librarians and information professionals are often forgotten by those outside of our field when it comes to discussions like these – which seems ridiculous because who would be better at curating information than information professionals? This response article also goes on to talk about how librarians have been “curating” digital information for years (which we’ve learned) and also about how collection development will become a form of digital curation in the future, which I given much thought to before.

Apparently the blogger at Resource Shelf actually emailed Rubel about his article to invite him “a virtual tour of some of the resources librarians have been curating for years.” Hopefully Rubel responds to the blog because that could result in an enlightening discussion.

Anyway, it was just interesting to read these two different takes on digital curation. Both writers agree that digital curation is a worthy goal to work towards.

Shared Names Project: Linking Biomedical Databases

I stumbled across an interview of science blogger Walter Jessen that is full of interesting gems related to this class. I'm going to pick the Shared Names Project to talk about in depth here. The Shared Names Project is attempting to assign URIs for publicly available biomedical database records and publish RDF documentation about those records. In doing this, they hope to make it easier to link data sets and tools across projects. Without shared URIs, it is necessary to do a lot of mapping to get two data sets to link up. The shared URIs will make it so that pieces of a data set can easily be extracted into another.

Each URI will point to an RDF document that is hosted on servers maintained by the project. Within the RDF document, there will be documentation about what the URI denotes (what database it is), links to various versions of the records (XML, ASN, HTML), links to corresponding resources that use the shared naming scheme, links to external resources ("For example, the RDF for PubMed record 15456405 could link out to the iHOP page for the article described by the PubMed record."), and other information that might be useful to humans or computers.

The project has a wiki that discusses the cyberinfrastructure (technical and administrative) issues they're working out. One technical issue I think I teased out from the Use Cases page is that some "records" may be contained in other records (just as an image may be contained inside an article but is still be a resource in itself), so how do you indicate the level of granularity you want? They also discuss what metadata they should provide for each URI. I don't fully understand how one would use these shared names in a practical sense, but I thought it might be worth pointing out since we've talked a bit about (albeit vaguely) about data sets talking to each other. Similar projects mentioned on the Shared Names wiki are MIRIAM and Integr8, though I don't know how this project and those are similar or different.

Monday, November 23, 2009

Digital Curation & User Testing

Marchionni, Paola. “Why Are Users So Useful?: User Engagement and the Experience of the JISC Digitisation Programme.” Ariadne, no. 61 (October 2009). http://www.ariadne.ac.uk/issue61/marchionni/.

This recent article by JISC's Paola Marchionni refocuses attention on the purpose of curated digital collections: that is, their use by different user groups. Marchionni begins by noting that many digitization projects are still not paying enough attention to their users that their users' needs. They become so caught up in trying to make their content accessible online, that they don't adequately research their key users. As a result many publicly-funded projects are going un- (or under-) used. Marchionni illustrates the insight users can provide by presenting two case studies of projects that incorporated users into their development process: the British Library's Archival Sound Recordings 2 (ASCR2) project (a collection consisting of over 25,000 recordings) and Oxford University's First World War Poetry Digital Archive (WW1PDA) (a collection that contains over 7000 items pertaining to WWI poets, including digitized images of materials held at UT's own Harry Ransom Center).

Though Marchionni's article helpfully reminds digitization (and digital curation) projects to keep their electronic eyes on the prize and really take their users into account, I'm not sure that much of what Marchionni presents in her list of suggestions for user engagement is particularly surprising. She recommends first recognizing the importance of interacting with users and even having an "Engagement Officer" position as the ASR2 project did. She also advises establishing an early and on-going relationship with users. The WW1PDA project, for example, developed a typology of users, with a steering committee of scholars in the field of WWI literature advising which materials should be digitized and participating in quality control, and a separate group of secondary school and higher education instructors helping to develop and offer feedback on the education section of the project. Marchionni also emphasizes the importance of knowing what to do with user feedback. When users expressed anxieties about the integration of Web 2.0 tools out of fear that they might undermine the authority of the WW1PDA archive, the project decided to integrate such functionality in a way that made the lines between the archivists' and the users contributions more clear.

Some of the more interesting lessons regarding users came from the WW1PDA's approach to educational resources. The project held workshops for teachers in order to discover what functionality this group would like to see on the site. In a rather ballsy move, the project then asked the workshop members to help author a number of learning resources for the website. Though this did result in the creation of some resources, ultimately the project realized that perhaps it had overreached in what it was asking their busy users to produce. (Frankly, I would be a little annoyed if I agreed to participate in a workshop on a new resource and then came away having been assigned the time-consuming "homework" of creating a bunch of resources for that project).

Though asking users to create lessons plans and other teaching materials was not as successful as the WW1PDA project might have hoped, users were willing (and excited!) to contribute materials from their own familial archives to the project. In fact, the project received such a high level of response to their requests that they held extra workshops to help the public digitize their items.

Some of Marchionni's suggestions seem to blend user engagement and marketing. For example, the WW1PDA's teachers' workshops seemed to have functioned in part as a source of user feedback, but also as a forum for promoting and publicizing the resource. Teachers were seen as the key to two user groups: teachers and students. Similarly, Marchionni also suggests targeting any information dissemination activities at specific user groups. The ASR project, for example, publicized its Holocaust collection by contacting networks for Historians and those in the field of Jewish and Theological Studies. Though his may seem more like advertising than user engagement, it's nevertheless important to remember that we sometimes need to market our resources if we want them to be used. Finally, after highlighting the importance of user engagement, Marchionni ends with a reminder not to lose sight of the project's mission: though it's important to listen to user feedback, we shouldn't be bullied by it. Focus on the the needs of one's primary users and keep in mind that you can't satisfy everyone.

One thing that I wish Marchionni had addressed in greater detail is the expense involved in maintaining a high level of user engagement. Obviously it's more expensive in the long run to pour money into a resource that doesn't get used than to devote some money to engaging users, but nonetheless creating sustained relationships with users can be a drain on already strained budgets and staff schedules. I'd love to hear more about how small projects or ones meager financial resources might effectively develop ongoing relationships with users.

Friday, November 20, 2009

The Relevance of Twitter

This has been a Twitter heavy semester for me, and admittedly for a while I was thinking something along the lines of "you know, does anyone actually use this tool for anything beyond mindless banter?" A few people have questioned the relevance of Twitter to my face. The usual refrain goes something like "you know Twitter is talked about a lot, but I don't think anyone actually uses it for anything useful." Then there might be a mention of the much lower retention rate of Twitter vis-a-vis Facebook (the two are always compared though I'm not entirely sure why considering that Facebook is a social networking tool and Twitter is a micro-blogging tool. Different things.)

Leslie Carr, on his blog RepositoryMan recently confessed to having similar doubts about the utility of Twitter asking himself if it wasn't just "some gratuitous teenager technology?" So he conducted a study. At a recent CETIS conference, Carr used the Twitter API to aggregate all of the tweets from the conferees. From these tweets he wanted to determine how many of the tweets were either, on the one hand, "technical/academic/professional," and on the other, "personal/informal/gossipy." Although he created other categories for the tweets, Carr was clearly interested in quantitatively studying the relative "significance" of the informational value of the tweets from the CETIS conference.

From his analysis, Carr determined that 70% of the tweets provided the sort of "informational" value that he was looking for and that about 41% of the conference attendees contributed tweets that were either "entirely" or "mainly" informational. Carr doesn't go too in-depth into what the criteria according to which he determined the relative informational value of the tweets, though he did admit that "useful information" was information that was useful "to him."

It's an interesting study, though after reading this post, I have to say that it seems that even the tweets which Carr did not think had "informational value" (e.g., tweets about the poor quality of wireless connectivity at the conference) could be very useful for other audiences. The conference organizers might have found the gripey tweets about the wireless issues very useful, as do businesses which are now mining Twitter for information regarding reactions to their products. The take-away from reading this study, for me, was actually that tweets can provide useful information to any Twitter user given the right circumstances and conditions. This is not to say that I think every tweet is useful. Tweets about Megan Fox can almost always be ignored. Although, I don't know. Maybe in twenty years, a Media Studies or Gender Studies researcher might even find the Megan Fox tweets to be of some value.

Thursday, November 19, 2009

Google Swirl

I am consistently excited by new developments in visual browsing and searching on the web. Google's new development to come out of Google Labs, Google Image Swirl is very provocative and is very close to something that I have been imagining a need for. It would be fascinating it the application could be employed on pre-curated collections of images (ie. to "Google Swirl" a collection of visual art, such as ARTstor). But first, what is Google Swirl?

As a bridging of Picasa Face Recognition and Similar Images, search results depend upon both image metadata and computer vision research. There are comparisons to Google's Wonder Wheel (which displays search results graphically) and Visual Thesaurus. One enters a search term and 12 groupings of images appear visualized as photo stacks. One chooses a particular image and the Flash experimental interface "swirls" to display your image and branches to numerous other images with varying degrees of relationship to that image.

Wednesday, November 18, 2009

Automated Data Processing: Too Big for Our Puny Brains

Over my past few blog entries I've been thinking more and more about automated processing of scientific data, as well as distributed efforts to handle massive backlogs of observations. Science is now generating such huge data sets that our ability to intelligently query them in any manual fashion is impossible. For example, CERN's Large Hadron Collider will be producing 40 TB of data every single day. For these reasons, if we hope to make any progress with scientific data, we'll need to employ AI (artificial intelligence) as our new scientific tools. Maybe this topic will seem to be at the far reaches of what we've been discussing in class, but I think it is a useful look at just how these datasets will actually be used. Instead of human searchers crawling the databases, we'll see automated systems trying to distill the knowledge from data bits.

An article from this month's Communications of the ACM describes the efforts of two computer scientists from Cornell University to build a new machine learning system. Whereas older machine learning systems sought to build predictions based on data, this new system was looking for basic invariant relationships such as the conservation of energy, that are scientific constants. This kind of system could formulate scientific laws as basic as the law of gravitation.

The breakthrough in this case comes from giving computers relatively little starting information and instead allowing them to derive rules and theories as they proceed, testing different ones and weighing their success in describing the situation. Other scientists have expressed interest in this system, because it is also scalable across domains. The ACM article on this mentions that this system automates the final part of the traditional scientific paradigm: from observational data to model formulation, to predictions, to laws, to explanatory theories.

Interestingly, the end of the article hints that the results of this system may be beyond human understanding. Interpretation of the results and their meaning may not be possible in human terms. This raises the question, with all this data, what good is it? How does it further our knowledge? How do we even know if the results are correct, if we can't understand them?

Digital Scholarly Communication Projects List

I mentioned a couple of weeks ago that I did an independent research project last spring about new forms of scholarly communication. In the spirit of academic sharing, here's my googledocs spreadsheet with about 75 or so different projects I found and related notes. Halfway through the semester I decided that there was too much to talk about with any depth, so I narrowed my project to be about emerging forms of peer review. I discussed things like rating systems, online comments, different kinds of blinded review, and legitimization and trust. I think the conclusion of the paper was a little weak because it turns out (gasp!) that the promotion and tenure system isn't really capable of handling new forms of peer review. I came to realize that the projects in the spreadsheet containing some element of peer review, whether formal (formal as in, I guess, journal sanctioned) or informal, are really just isolated experiments without any measurable influence in the larger scheme of things. It would have been more useful if I had been able to talk to some of the people that contribute to these projects to see if they found them valuable or not. Here are a few selected projects you guys might be interested in.

gpeerreview: Google's unfinished answer to peer review, involving getting endorsements from "endorsement organizations," graphing those endorsements, and then providing some kind of credibility ranking
Faculty of 1000: online research tool that highlights the most interesting papers in biology, based on the recommendations of over 1000 "leading scientists"
MONK: an open source "digital environment" that humanities scholars can use to analyze patterns in texts
www.myexperiment.org: scientists contribute their scientific workflows (I assume this is like sharing their lab notebooks) so that others can use them, also see UsefulChem
SciLink: Facebook for scientists, but instead of using email contacts to mine your connections like Facebook does, it uses bibliographies from articles

PLANETS: integrated services for digital preservation

PLANETS (Preservation and Long-term Access through NETworked Services) is a European initiative to provide long-term access to large digital collections. I originally came across an article and followed up by looking at the website for more information. PLANETS seems to be predominantly about tool development with the goal of creating a sustainable framework for “increasing Europe's ability to ensure access in perpetuity to its digital information.” The project began in June 2006 and is funded by the European Union under the Sixth Framework Programme.

Planets is not a repository project but expects each participating institution to maintain storage for their digital data. The goal is to work toward preserving entire collections, not just creating stand-alone applications that can handle one aspect of data preservation, such as migration or emulation. The project is a collaborative effort based on the idea that no single institution is going to be able to handle the level of development needed. The initiative is drawing from the expertise and experience of numerous partners in different countries.

The website explains the deliverables:

Preservation Planning services that empower organisations to define, evaluate, and execute preservation
Methodologies, tools and services for the Characterisation of digital objects
Innovative solutions for Preservation Actions tools which will transform and emulate obsolete digital assets
An Interoperability Framework to seamlessly integrate tools and services in a distributed service network
A Testbed to provide a consistent and coherent evidence-base for the objective evaluation of different protocols, tools, services and complete preservation plans
A comprehensive Dissemination and Takeup program to ensure vendor adoption and effective user training.

After hearing a presentation on digital preservation initiatives in the US by classmates in another course, I am interested in the differences between the systems in place in the EU that make this kind of large scale collaborative project possible. Reflecting on the Larsen article, On the Threshold of Cyberscholarship, I realize that PLANETS falls clearly into the research stage of activity, where tool development is key to the success of a collective infrastructure for access to digital materials. I get the impression that European institutions, and maybe scholars, are more likely to achieve success in the area of cyberinfrastructure development. Is this because of funding, social behavior, scholarly expectations?

Free Access to the Web

For my final blog post, I found an old (by internet standards) article from 2000 that views the entire web as one low-cost library and discusses the marvel that much of the access to it is free. Arms, writing at Cornell, envisions the web as replacing libraries in many people's lives. Whereas libraries, and especially research libraries, are incredibly expensive and limit access to their members, the web is self-funding (in that individual "publishers" pay for that privilege) with free access for anyone with an internet connection. (Of course, one could argue that this doesn't really constitute free...) Furthermore, Arms concludes that although the expected model of information provision on the internet was fee-based subscriptions, it turned out that there is enough free information of quality out there to obtain genuine substitutes. The example he gives is that Cornell provides legal sources via the web, including case reports, hitherto only available with an exorbitantly expensive Westlaw subscription, and one need not be affiliated with Cornell to access it.

Another point that Arms makes is that digital libraries can be nearly completely automated. He asserts that a brute force search, such as that provided by Google, with enough information and in the hands of a good researcher, can actually be much more powerful than an intelligent search by trained librarians. (While I don't like the implications for library services, it does seem like a lot of the focus on the need for reference librarians is in terms of not-good or amateur researchers...) This automation actually increases access as you no longer need to work through a small group of homogeneously trained elites.

Arms identifies two issues or potential problems for further research. One is insuring quality of information. This role was traditionally performed by the publishing process, but with the self-publishing afforded by the web, we can no longer count on good publishing practices. The second is permanence. Flip a switch on a server, and its information vanishes.

I found this article interesting mostly because of its now somewhat historic outlook. Some things have not turned out as Arms saw them in 2000, namely the level of free access. As we see with the Google Books project, proprietary interests are finding their way into the new cyber-reality, and the Great Copyright War has yet to be fought. Interestingly, though, Arms did identify two key issues that continue to be relevant: trusting found information, and ensuring its permanence. I suspect that the best answer to the former is via education of the public. At some point, the onus has to be on the searcher. The second problem, in my mind, is much more problematic, and it is one that plagues the physical as well as virtual information worlds. All in all, I found this early article quite interesting.

Tuesday, November 17, 2009

Aquatic and Riparian Effectiveness Monitoring Program

After talking about the problems facing ecological data, I wanted to read a little about the data collecting work my friend did over the summer for Aquatic and Riparian Effectiveness Monitoring Program (AREMP) in Oregon and see how that data fits in with the Long Term Ecological Research program. AREMP surveys 250 watersheds in the northwest and their collection practices (including what photos to take, how to record coordinates and how to use site markers) are described on their site. To collect the data they have to enlist a number of people like my friend to collect watershed samples using GIS. Although 250 is a lot of watersheds, it only represents about 10% of the watersheds in the area that is being sampled.

The goal of AREMP is to use a decision support model to evaluate watersheds for overall watershed condition. A number of attributes are assigned to each watershed and once all the attributes for a watershed are sampled the data is aggregated to determine a watershed score. To aggregate the data and find a score, AREMP uses software called Ecosystem Management Decision Support (EMDS) which creates the model and then assesses the condition of the watersheds based on the data. AREMP says that they would be happy to share their data with anybody who would like to see it.

EMDS is pretty interesting, the EMDS document says, "EMDS does contain tools for conducting “what if” scenarios. For example, one can estimate how watershed condition will improve if 500 pieces of large wood were added to the stream".

I wasn't able to find anything connecting AREMP with LTER but I did discover more problems with ecological data. For instance, the EPA and AREMP both use probability sampling designs but, "indicator and sampling methods differ from those used by the EPA, and these differences hinder collaboration and data comparison" (Hughes, 2008, p. 853).

I would still like to know more about AREMP's data and how their data collection methods compare to other ecological studies.

Hughes, R., Peck, D. (2008). Acquiring data for large aquatic resource surveys: The art of compromise among science, logistics, and reality. Journal of the North American Benthological Society (27)4, 837-859.

OCRIS: Online Catalogue and Repository Interoperability Study

JISC: Online Catalog and Repository Interoperability Study (OCRIS): Final Report

This study reviewed Library Management Systems (and the associated OPAC) with the Institutional Repositories (IR) of Higher Education Institutions in the UK.

The goals: determine whether the repository content within the scope of the institutional OPAC (and extent it is recorded in the OPAC); examine interoperability of OPAC and repository software; list services offered by OPAC's and repositories; identify potential for improvement in links to other institutional services; make recommendations for development of further links between OPAC's and repositories.

The primary findings are distressing. Only 2 percent of the study respondents stated that their systems were definitely interoperable, and 14 percent stated that interoperability was pending. There was an 81 percent overlap in scope for all items in IRs and OPACs; generally, IR's contained bibliographic data and OPACs contained full text.

Clearly, differences in scope/policy are not clear and there is either uncoordinated effort, hindering interoperability, or duplication of effort and/or redundant information. In order to provide a more feasible and appropriate long-term vision for IRs and OPACs, institutions should take a structured look at the goals of each service (IR vs OAPC) and coordinate efforts to best provide interoperability and reduce duplication of effort.

This was interesting... OPACs and IRs arguably have different intents - generically speaking, circulating collections versus long-term preservation and access, but they are really not so different. While both may not store information, each provides a service in locating information. In light of the volumes of money spent on IR development/OPACs/interoperability, and the number of available/established entities available to study, it is a very good time to step back and consider how these services might be better managed and coordinated to provide the best available service to the end user and the institution. I would like to see a corresponding study of American institutions.

Competing Requirements for Self-archiving

Stevan Harnad wrote a response blog to a letter to the editor that appeared in D-Lib Magazine. The original letter to the editor was a hypothetical dialog between an author whose work was recently accepted and an open-access consultant. In the dialog, the author is unsure about how to deal with a funder who requires the article to be deposited in an online repository and a publisher who may or may not allow the article to be deposited in a location they have no control over. In his post, Harnad takes excerpt from the dialog and writes more information about what the open-access consultant is saying, sometimes correcting him.
First, Harnad clarifies the opt-out clauses in self-archiving mandates. The opt-out clause is about whether or not you need to persuade the journal to accept your addendum thereby formalizing your right to deposit your article. While this is worthwhile, it is not essential so authors can choose to opt-out if they cannot persuade the journal or they simply don't wish to try. But regardless of the presence of an opt-out clause, authors still deposit their articles immediately. It is not necessary to find another publisher if the publisher denies your request.
Second, if the publisher has an embargo period, and the author wishes to honor it, they simply deposit the article as closed access for the appropriate period of time. There is no need to talk to the publisher about the embargo period.
Harnad suggests that depositing an article is not as confusing as it seems. An author simply needs to deposit all drafts as soon as an article is accepted for publication. According to his post, 63% of the top journals allow such a deposit. If your journal is one of the other 37%, simply set the article to closed access.
The issues surrounding depositing a published article are complex and probably result in many authors choosing to do nothing rather than try to navigate through all the competing requirements. Though closed access is not ideal, having a copy in a repository is better than not having anything stored. It seems that the more an author knows about the process of depositing an article and what they are allowed to do, the more likely it is that they will go to the trouble of depositing their article. The way things currently stand, most authors choose not to mess with online repositories because they don't even know where to find out what they are allowed to do.

Monday, November 16, 2009

JoVE (Journal of Visualized Experiments) is a new approach to peer reviewed journals : specifically devoted to the biological sciences (life sciences), JoVE is indexed by PubMed. The goal of JoVE aid the transmission of information - particularly that which is not sufficiently represented by static text and images. JoVE claims that their approach - using video publishing - "promotes efficiency and performance" by eliminating time wasted on learning and perfecting new techniques based on traditionally published works.

The Editorial Board for JoVE boasts members of the scientific community from the best institutions in the world... Harvard, Mount Sinai School of Medicine, Princeton,
University of Zurich -the list illustrates an impressive number of highly qualified board members. The project was begun at Harvard in 2006 by a post-doc, Moishe Pritsker, now CEO and editor-in-chief of JoVE.

Access:
Initially, JoVE was conceived of as an open-access project, however, that model proved itself impossible, given the high costs of producing the videos. According to Pritsker,

"The reason is simple: we have to survive. To cover costs of our operations, to break even, we have to charge $6,000 per video article. This is to cover costs of the video-production and technological infrastructure for video-publication, which are higher than in traditional text-only publishing. Academic labs cannot pay $6,000 per article, and therefore we have to find other sources to cover the costs." (http://scholarlykitchen.sspnet.org/2009/04/06/jove/)

Thus, a pricing structure was created to cover these costs: "$1,000 for small colleges to $2,400 for PhD-granting institutions, prices which are in league with other commercial scientific journals. In addition, authors are charged $1,500 per article for video production services ($500 without), and there are open access options: $3,000/article with production services ($2,000 without)." (http://scholarlykitchen.sspnet.org/2009/04/06/jove/)

Taking a look JoVE's Press section, I was surprised to find they had posted articles criticizing their decision to go closed-access. While these criticisms are valid, so are JoVE's explanations of why open-access wasn't a possibility. Given the newness of this "product" and the fact that there is no existing model, it makes sense that charging for access is the only way to offset the price of video production without all of the costs falling on the research institute.

How it looks:
Videos are very high quality and accompanied by a complete scholarly article that acts as a transcript of sorts to the video.

Overall, I'm very impressed by JoVE, and while for now it is closed access it seems that as more stakeholders buy in and the model becomes more widely accepted it may be possible in the future.

Monkeying Around with Twitter Data

This week I was looking for articles about access to data and found a new interesting development. An Austin-based company called InfoChimps.org, that offers data sets for download, last week announced they were selling Twitter datasets. For quite a bit of money.

InfoChimps' mission "is to increase the world's access to structured data". The company appears to offer much data for free and prefers to be a platform where people may "post data under an open license". The data is available for browsing, and if the site doesn't actually have the data, it will point the user to where they can get the data for free. This is excellent for data sharing and access to large data sets. The site's homepage lists "Interesting Datasets" for perusal. I clicked on the first one, which was College Enrollment of Recent High School Completers 1960-2005. There is an intro paragraph to the data, where it's from, and an example of it. This set was prefaced with the caveat that the "files have data mixed with notes and references, multiple tables per sheet, and, worst of all, the table headers are not easily matched to their rows and columns." Kind of funny, yet good information to know!

So, Infochimps offers free datasets - fabulous. But their recent announcement to sell Twitter data met with some skeptical questions. The Read Write Web blog discusses this development in depth, describing the data, which isn't the full tweets, but hashtags, RTs, @ messages, and other associated info. This is apparently really useful and great information to have, and the developers at InfoChimps are hoping that people create interesting apps with this data. InfoChimps mined this data themselves, by hitting Twitter Developer API 20,000 times per hour (I almost know what this means). That's a LOT of data. Marshall Kirkpatrick, the author at RWW, questions the complete legality of selling this data, and worries Twitter is going to come a-knockin'. Many commenters on the blog entry also thought so. InfoChimps (last week) swore they were on solid legal ground, but a new post from yesterday on their blog revealed that Twitter had asked them to remove the datasets. While InfoChimps swears their data had nothing personally revealing, privacy concerns came from commenters and apparently Twitter, who claims they just want to prevent any 'malicious use' of the data.

I gleaned from the blog entries related to this issue that Twitter isn't very forthcoming with their data, which is making them a new Bad Guy of social media and data sharing. It's interesting that people are upset that Twitter won't share, but probably don't care as much if a university won't share some research datasets? Shouldn't data be data? Who cares if one is more sexy than another? Access is still important. InfoChimps sounds a little naive to think that selling Twitter datasets for $9000 wouldn't cause a stir, but maybe that's what they actually intended! At least it brought some mild attention to the issues at play here.

Sunday, November 15, 2009

Gordon

On November 4th, the San Diego Supercomputing Center (SDSC) announced that they have been awarded $20 million by the National Science Foundation to develop a new supercomputer aimed at "solving critical science and societal problems now overwhelmed by the avalanche of data generated by the digital devices of our era." This new computer is known as Gordon.

The SDSC has been part of the National Science Foundation's plan for cyberinfrastructure in the sciences for a number of years, as we read earlier in the semester. The development of Gordon is part of the NSF's continuing effort to keep up with the computing needs of the sciences. According to Jose L. Munoz, the deputy director and senior science advisor for the NSF's Office of Cyberinfrastructure, "'Gordon will do for data-driven science what tera-/peta-scale systems have done for the simulation and modeling communities, and provides a new tool to conduct transformative research.'"

Gordon, which is scheduled to be installed by Appro International, Inc. in 2011, will employ flash memory to speed solutions to data-intensive problems, such the analysis of individual genomes, much faster than spinning disk technology. Gordon is the follow up to the Dash system, another SDSC project, which was the first computer to use flash devices. Gordon will feature 245 teraflops of total compute power, 64 terabytes of DRAM, 256 terabytes of flash memory, and four petabytes of disk storage.

Another feature of Gordon will be 32 so-called "supernodes," each consisting of 32 compute nodes capable of 240 gigaflops/node and 64 gigabytes of DRAM. Linked together by virtual shared memory, these "supernodes" each have to potential of 7.7 teraflops of compute power and 10 terabytes of memory. The "supernodes" will be linked together with an InfiniBand network, which is capable of 16 gigabits per second of bidirectional bandwidth, which, apparently, is eight times faster than some of the most recently developed supercomputers. Gordon will be made available to researchers via an open-access national grid.

The point of Gordon is to make possible the complex applications of data-driven science. According to the press release, Gordon will have potential benefits in both the academic and industrial settings. Gordon should also be useful for predictive science, which seeks to make models of real-life phenomena. Because of the large-scale memory on a single node and the consequent increase in computation speeds, Gordon should allow for the creation of models that more accurately mimic these phenomena. With such capabilities, Gordon will be the next step in the development of data-driven science.