Sunday 4 March 2018

Digest of reproducibility activities from the non-profit cTuning foundation and dividiti in 2017

Last year was very intense for the cTuning foundation and dividiti - we continued working closely with AI, ML and systems communities to automate experimentation while improving reproducibility and reusability of results based on our long-term vision.

First of all, we had a chance to apply the new ACM artifact reviewing and badging policy (which we had co-authored a year earlier as a part of the new ACM taskforce on reproducibility based on our prior Artifact Evaluation experience) at the ACM CGO’18 and PPoPP’18. The good news is that we had yet another record number of submissions: nearly half of accepted papers submitted artifacts for validation totaling to 30! We also switched from EasyChair to HotCRP for artifact submission and evaluation since the last one provided a more convenient and anonymous communication mechanism between authors and reviewers during evaluation to continuously solve issues! This also motivates us to remove a “technical clarification” session during Artifact Evaluation at the future conferences since our reviewers already communicate with authors during evaluation!

We also noticed that our Artifact Appendix Template which we had prepared several years before in an attempt to unify CGO,PPoPP and PACT AE is now used at other conferences including SuperComputing (see Artifact Description in this SC’17 paper which will be used for CLUSTER competition at SC’18)! We plan to collaborate with CLUSTER colleagues as a part of the ACM pilot projects to automate artifact evaluation and introduce workflow frameworks to convert these artifacts to our Collective Knowledge format (see SC’16 paper and CGO’17 article with artifacts and workflows shared in the CK format).

Finally, we got a very positive feedback about our open artifact evaluation by the community from the past CGO/PPoPP’17 which we consider using even more in the future (see our motivation):

At the same time, we noticed several ambiguities with the new policy for “artifacts available” and “artifacts reusable” badges.

After consulting with our ACM colleagues, we updated reviewing criteria for “artifacts available” badge at http://cTuning.org/ae/reviewing.html:

The author-created artifacts relevant to this paper will receive an ACM "artifacts available" badge only if they have been placed on a publicly accessible archival repository such as Zenodo, FigShare or Dryad. A DOI will be then assigned to their artifacts and must be provided in the Artifact Appendix! The authors can also share their artifact via ACM DL - in such case they should contact AE chairs to obtain DOI (not yet automated unlike above repositories).

Criteria for “artifacts reusable” badge turned out to be even more vague particularly for systems research where experimental workflows often involve very complex and continuously changing algorithm/software/hardware stack. Many authors considered that having their artifact public with a ReadME, a few ad-hoc scripts to build and run experiments, and a Docker image is enough to get “artifacts reusable” badge.

Each year we see all the burden and suffering of our evaluators to figure out how to deal with numerous ad-hoc, non-portable and often failing scripts, non-unified data formats, and ad-hoc validation. That is why we argue that some sort of common workflow frameworks combined with cross-platform package managers and common APIs must be used to make an artifact easily reusable, portable and customizable (see Collective Knowledge, spack, easybuild , etc.). Such automation and unification can help to make Artifact Evaluation sustainable in a longer term particularly when more artifacts are submitted.

However, since we do not want to enforce our views and didn’t yet manage to reach a satisfying conclusion, we started discussing these issues at the open Artifact Evaluation discussion session at CGO/PPoPP’18. We also described some of these issues and possible solutions in our CNRS’17 presentation “Enabling open and reproducible computer systems research: the good, the bad and the ugly”.

At the same time, we continue working with ACM, Raspberry Pi foundation and the community to improve automation and experiment crowdsourcing using our open-source Collective Knowledge platform as well as sharing of artifacts and workflows as portable, customizable and reusable components with a common Python API and JSON meta-information.

You can see practical examples of such “plug&play” artifacts and workflows in our recent interactive and reproducible CK-based article “A Collective Knowledge workflow for collaborative research into multi-objective autotuning and machine learning techniques”. It presents our long-term educational initiative to teach students how to benchmark and co-design software and hardware stack for self-optimizing computer systems in a collaborative and reproducible way. It has all workflows, artifacts and results shared as portable, customizable and reusable CK components via GitHub and FigShare to let the community validate, reuse, improve and build upon them while crowdsource experiments via our public CK repository. Furthermore, we want researchers to quickly reuse and compare against shared performance results for common benchmarks and datasets on specific platforms and compilers during feedback-directed compilation and autotuning, rather than spending considerable effort rebuilding and rerunning such experiments!

We are also very excited to organize the 1st ACM ReQuEST tournament  based on Collective Knowledge platform with a consortium of leading universities (Cornell, Washington, Toronto, EPFL, Cambridge) and a strong advisory board: http://cKnowledge.org/request . This novel competition series focus on reproducible and Pareto-efficient co-design and optimization of the whole application/software/hardware stack for AI, ML, deep learning and other emerging workloads in terms of speed, accuracy, power and costs. More importantly, a growing number of participants will continue improving the common and CK-based optimization workflow and sharing portable and customizable AI/ML blocks optimized across diverse models, data sets and platforms from IoT to HPC. The benchmarking results and winning SW/HW/model configurations will be visualized on a public interactive dashboard and grouped according to certain categories (e.g. embedded vs. server). They can be also reproduced, reused, improved and compared against, thanks to the common CK framework. Our eventual goal is to share all winning algorithms and related artifacts as “plug&play” CK components with a common API to let the community immediately validate, customize, reuse and build upon them thus removing technology transfer gap and enabling open systems/AI/ML research!


The first edition of ReQuEST will serve mainly as a testbed for our approach, framework and repository, so we decided to limit submissions only to deep learning algorithms for image classification. It will be collocated with ASPLOS'18 - ACM conference on Architectural Support for Programming Languages and Operating Systems, which is the premier forum for multidisciplinary systems research spanning computer architecture and hardware, programming languages and compilers, operating systems and networking. Authors of Pareto-efficient or original submissions will be invited to present their findings at the associated ACM workshop. At the end of the tournament, we will provide a report to our advisory board presenting the outcome of the tournament, issues, possible solutions and next steps.

We look forward to collaborating with all of you in 2018 to automate research and experimentation, improve reproducibility of published results, develop efficient systems for AI and other emerging workloads, accelerate AI/ML/systems research, make a breakthrough in AI and enable intelligent systems everywhere!

Miscellaneous resources (2017)

Events

Initiatives 

Repositories

Tools

Articles


See all related resources here.