I suggest you ...

Let non-author users provide cleaned up versions of datasets

One of Dryad's philosophies is that it should be as easy as possible for users to contribute data. This is valuable for getting data archived, but it typically makes the data much more difficult to work with, especially in automated ways. Users can clean the data up themselves, but this means that user has to repeat the same work and it this still doesn't solve efforts to automate the acquisition, cleaning, and use of, data.

I proposed providing a mechanism by which someone other than the authors of the data set can provide a restructured/cleaned up version of the data that is more usable (e.g., a set of csv files instead of a multi-sheet Excel file). Credit would likely not be necessary for this work, though a mechanism for blame (in the event that the work were to introduce errors) might be useful.

21 votes
Vote
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    I agree to the terms of service
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    Ethan White shared this idea  ·   ·  Admin →

    1 comment

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      I agree to the terms of service
      Signed in as (Sign out)
      Submitting...
      • Tim Lucas commented  · 

        This could definitely work especially if the cleaned up version was held to a strict set of standards. While there's a lot to be said for very few restrictions on original data (better that it's hosted and dirty, than clean and unavailable), the same is not true of cleaned up data (better unduplicated and dirty, than duplicated and still kinda dirty).

        I'd say a minimum would be a reproducible pipeline for going from the dirty data to the clean data and stricter format controls for common datatypes (i.e. flat data frames).

      Feedback and Knowledge Base