Call for Healthcare Data Analytics Challenge

Health-care Analytics Challenge Chair: Vinod Vydiswaran, University of Michigan, USA


For the second year, the IEEE ICHI conference will feature a Healthcare Data Analytics challenge as a platform for health data scientists, researchers, and students to participate in a grand challenge related to healthcare. This year’s challenge is a healthcare forum post classification task.

Task Description

Patients and caregivers often share their health-related concerns in community forums and discussion boards such as ehealthforums, medhelp, and patientslikeme. On such discussion forums, members post questions and other members respond with suggestions or answers to the leading question. This provides a rich environment for healthcare researchers to study question-answering challenges in healthcare. One of the important tasks towards automated question answering is to identify the type of questions being asked by the patients and their caregivers.

In this year’s healthcare data analytics challenge, you will be provided with a real forum message posted on a health discussion forum. The forums have been classified in one of the following seven categories:

  • Demographic (DEMO): Forums targeted towards specific demographic sub-groups characterized by age, gender, profession, ethnicity, etc.
  • Disease (DISE): Forums related to a specific disease
  • Treatment (TRMT): Forums related to a specific treatment or procedure
  • Goal-oriented (GOAL): Forums related to achieving a health goal, such as weight management, exercise regimen, etc.
  • Pregnancy (PREG): Forums related to pregnancy, including forums on difficulties with conception and concerns about mother and unborn child’s heath during pregnancy
  • Family support (FMLY): Forums related to issues of a caregiver (rather than a patient), such with support of an ill child or spouse.
  • Socializing (SOCL): Forums related to socializing, including hobbies and recreational activities, rather than a specific health-related issue.

Although purpose of a health discussion forum might be relatively clear, it is challenging to classify messages posted in such forums without the knowledge of the target forum. In this challenge, your goal is to classify the forum messages in one of the above forum categories based on just the title and text of the message.

Challenge Specification

You will be provided with a training dataset consisting of 8,000 questions (each with a post title and message text), and labeled with one of the seven categories described above. You will also be provided with a test set of 3,000 questions that are unlabeled. Your task is to label each of these questions in the test set with ONE of the category labels (the four letter code).

You are allowed five submission runs. Each submitted run result should be a comma separated text file (csv file) consisting of a header line “ID,Category” followed by exactly 3,000 lines with test question ID (1 through 3000) and a category code (one of {DEMO, DISE, TRMT, GOAL, PREG, FMLY, SOCL}).

You can use any external data in your approach (dictionaries, ontologies etc.) provided you reveal all the data sources used in your approach in the report.

System Evaluation and Submission Guidelines

The submitted runs will be evaluated on the classification accuracy, which is the fraction of test questions with the right category label over the total number of test questions (3000, in the challenge test set).

The participating teams are also required to submit a write-up summarizing their approach and describe the variations among their submitted runs. Based on the overall classification accuracy and the techniques used, three to five finalists will be chosen to present their approach at the conference.

All participating teams will receive a summary of their classification accuracy. The finalists will be asked to submit a final version of their reports that can include the performance results of their runs. In addition, finalists will need to submit their code and models used for their approach. The code will be run against an unseen dataset of additional forums questions. The final performance results and challenge winners will be announced at the conference.

To submit via Precision Conference System, login through and follow the steps below :

  • To submit your runs, click on “new submissions”, and then “Submit to Doctoral Consortium, Industry Track, and Data Challenge”, and then add a new paper.
  • Specify the title of your technical paper. Start the title with your team name, followed by a colon, and then the actual title of the report.
  • Include all authors to the paper and also list them in the author list
  • Abstract: Include a brief summary of approaches tried and runs being submitted.
  • Select the subcommittee as "Healthcare Data Analytics Challenge"
  • You may ignore the keywords section.
  • Upload the technical report as a pdf in the document section.
  • Upload a zip file containing all runs (up to 5 runs allowed per team) and a readme file. You need not include the technical report pdf in this zip file.
  • Acknowledge that all names and affiliations are on the paper.
  • You may update the runs and the pdf until the deadline.

How to participate

  • Register your team by sending the name of the leading team member to the Challenge Chair via email to In your email, please specify (a) the subject as “ICHI Healthcare Data Challenge registration”, (b) the name and email address of the “corresponding” / leading team member, (c) name of the institution, (d) the desired name for the participating team. You will use the participating team name when you submit runs. You do not need to mention names of all team members in your initial request, but please include all names in your challenge report.
  • As a response to your registration request, the Challenge Chair will approve the (unique) team name and send you the link to the training and test datasets.
  • Submit your classification runs on the test set by the submission deadline (June 30, 2016). You can submit up to five classification runs on the test set. Name each submission run file uniquely as follows: -run., where is the unique team name registered with the Challenge Chair, and is a number between 1 and 5. For example, if your registered teamname is “zenith” and it is the second run, the name of the submission file will be zenith-run2.csv
  • All participating teams will need to submit the following as a part of the challenge participation (as a zip file): (a) Up to five submission runs on the test dataset in the specified format, and (b) A technical report (up to four pages IEEE conference proceedings format). The technical report should include (i) names of all participating team members), and (ii) details of your approach, including any databases, ontologies etc. used.
  • All teams will receive a summary of their runs by the notification date (July 15, 2016). Based on the submitted runs and report, three to five finalists will be selected.
  • Finalists will also have to submit (a) A revised technical report with summary of performance results based on the test set; (b) The system code and model used to generate the best performing run; (c) A readme file containing clear instructions on how to compile, deploy and run your system; and (d) Any knowledge bases / other files needed to run the system.
  • Before the conference, the final submitted systems would be tested against an unseen dataset that has not been shared with the participants before. Results from this data set will decide the final winners of the challenge.
  • Finalists will be invited to give a short presentation outlining their approach at the conference. The winners of the challenge will also be announced then.
  • The papers from all finalists will be compiled into a technical notebook accompanying the conference proceedings.
  • Rules

    • Each participating group can have as many people as needed, but each person can be part of no more than two groups.
    • You can submit up to five classification runs on the test set. If you submit multiple runs with the same run ID (1 to 5), the previous runs will be overwritten by the subsequent runs.
    • Systems can be developed in one of the standard programming language (C, C++, C#, Java, Python, Perl, R, Matlab). If you intend to use any other language, you will need to obtain prior permission from the ICHI 2016 Healthcare Data Analytics Challenge Chair.
    • You can use databases, ontologies, dictionaries, etc. in your system provided you all reveal all such knowledge bases that you have used in your report.
    • At least one member of each finalist group is required to register for the conference and be present at the conference.


    • Challenge Solution and Paper Submission Deadline: June 30, 2016
    • Finalist Decision Notification: July 18, 2016
    • Finalist Code Submission and Final reports Due: July 29, 2016

    Please submit electronically via the Precision Conference System. Please select the track for Data Challenge.

    Please contact the Healthcare Data Analytics Challenge Chair – Dr. V.G.Vinod Vydiswaran ( if you have any questions or need additional information about the challenge.

