Background:
Breast cancer starts when cells in the
breast begin to grow out of control. These cells usually form a tumor that can
often be seen on an x-ray or felt as a lump. The tumor is malignant (cancer) if
the cells can grow into (invade) surrounding tissues or spread (metastasize) to
distant areas of the body. Breast cancer occurs almost entirely in women. but
men can get breast cancer too.
The most common symptom of breast cancer
is a new lump or mass. A painless, hard mass that has irregular edges is
more likely to be cancer, but breast cancers can be tender, soft, or
rounded. They can even be painful. For this reason, it is important to
have any new breast mass, lump, or breast change checked by a health care
professional experienced in diagnosing breast diseases.
However, not all are malignant! Help predict
whether the cancer is benign or malignant.
Relevant Papers:
First Usage:
W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction for
breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
Electronic Imaging: Science and Technology, volume 1905, pages 861-870, San
Jose, CA, 1993.
[Web
Link]
Medical literature:
W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to
diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
163-171.
[Web
Link]
The evaluation of this dataset is done using Area Under the ROC curve (AUC).
An example of its application are ROC curves. Here, the true positive rates are plotted against false positive rates. An example is below. The closer AUC for a model comes to 1, the better it is. So models with higher AUCs are preferred over those with lower AUCs.
Please note, there are also other methods than ROC curves but they are also related to the true positive and false positive rates, e. g. precision-recall, F1-Score or Lorenz curves.
AUC is used most of the time to mean AUROC, AUC is ambiguous (could be any curve) while AUROC is not.
The AUROC has several equivalent interpretations:
Assume we have a probabilistic, binary classifier such as logistic regression.
Before presenting the ROC curve (= Receiver Operating Characteristic curve), the concept ofconfusion matrix must be understood. When we make a binary prediction, there can be 4 types of outcomes:
To get the confusion matrix, we go over all the predictions made by the model, and count how many times each of those 4 types of outcomes occur:
In this example of a confusion matrix, among the 50 data points that are classified, 45 are correctly classified and the 5 are misclassified.
Since to compare two different models it is often more convenient to have a single metric rather than several ones, we compute two metrics from the confusion matrix, which we will later combine into one:
To combine the FPR and the TPR into one single metric, we first compute the two former metrics with many different threshold (for example
The following figure shows the AUROC graphically:
In this figure, the blue area corresponds to the Area Under the curve of the Receiver Operating Characteristic (AUROC). The dashed line in the diagonal we present the ROC curve of a random predictor: it has an AUROC of 0.5. The random predictor is commonly used as a baseline to see whether the model is useful.
If you want to get some first-hand experience:
You cannot sign up from multiple accounts and therefore you cannot submit from multiple accounts.
Privately sharing code or data outside of teams is not permitted. It's okay to share code if made available to all participants on the forums.
You may submit a maximum of 5 entries per day.
You may select up to 2 final submissions for judging.
Rank | Team | Score | Count | Submitted Date |
---|