Multi-dataset Time Series Anomaly Detection

Anomaly Detection in Time Series

Important Note: 

At the beginning of Phase II we will clear the leaderboard. 

We have to do this, because we have reasonable doubt that

there are a lot of duplicate accounts and hand labeling done.

 

Also, we will be requesting two new things:

 

1. Affiliations - We request you to provide your school or work

email addresses as affiliations. You can also participate

un-affiliated, however you cannot change from un-affiliated

to school or work.

 

2. Submission of code : We will be requiring submission of code

along with each submission you make in a day.

 

Please note: We reserve the rights to remove participants whom

we find reasonable doubts of hand labeling or having multiple

accounts. It is against the spirit of the competition. ( See Rules Page)

In recent years there has been an explosion of papers on time series anomaly detection appearing in SIGKDD and other data mining, machine learning and database conferences. Most of these papers test on one or more of a handful of benchmark datasets, including datasets created by NASA, Yahoo, Numenta and Tsinghua-OMNI (Pei’s Lab) etc.

While the community should greatly appreciate the efforts of these teams to share data, a handful of recent papers [a], have suggested that these are unsuitable datasets for gauging progress in anomaly detection.

In brief, the two most compelling arguments against using these datasets are:

  • Triviality: Almost all the benchmark datasets mentioned above can be perfected solved, without the need to look any at any training data, and with decade-old algorithms.
  • Mislabeling: The possibility of mislabeling for anomaly detection benchmarks can never be completely eliminated. However, some of the datasets mentioned above seem to have a significant number of false positives and false negatives in the ground truth. Papers have been published arguing that method A is better than method B, because it is 5% more accurate on benchmark X. However, a careful examination of benchmark X suggests that more that 25% of the labels are wrong, a number that dwarfs the claimed difference between the algorithms being compared.

Beyond the issues listed above, and the possibility of file drawer effect [b] and/or cherry-picking [c], we believe that the community has been left with a set of unsuitable benchmarks. With this in mind, we have created new benchmarks for time series anomaly detection as part of this contest.

The benchmark datasets created for this contest are designed to mitigate this problem. It is important to note our claim is “mitigate”, not “solve”. We think it would be wonderful for a large and diverse group of researchers to address this issue, much in the spirit of CASP [d].

In the meantime, the 250 datasets that are part of this challenge reflect more than 20 years work surveying the time series anomaly detection literature and collecting datasets. Beyond the life of this competition, we hope that they can serve as a resource for the community for years to come, and to inspire deeper introspection about the evaluation of anomaly detection.

Further, in order to keep the spirit of the competition high, we would like to thank Hexagon-ML for not only sponsoring the competition but also providing the winning price rewards:

  • First Prize : $2000 USD
  • Second Prize : $1000 USD
  • Third Prize : $500 USD
  • For the top 15 participants we will provide a certificate with rank.
  • For all other participants we will provide participation certificate

We hope you will enter the contest, and have lots of fun!

Best wishes, 

 

Prof. Eamonn Keogh, UC Riverside and Taposh Roy, Kaiser Permanente


Cite this competition:

Keogh, E., Dutta Roy, T., Naik, U. & Agrawal, A (2021).


Multi-dataset Time-Series Anomaly Detection Competition, SIGKDD 2021.


https://compete.hexagon-ml.com/practice/competition/39/ 


References

[a] https://arxiv.org/abs/2009.13807 Current Time Series Anomaly Detection Benchmarks are Flawed and are Creating the Illusion of Progress. Wu and Keogh

[b] https://en.wikipedia.org/wiki/Publication_bias

[c] https://en.wikipedia.org/wiki/Cherry_picking

[d] https://en.wikipedia.org/wiki/CASP

rwu034 hello.py python 0 March 11, 2021, 10:43 p.m.
finlayliu aaa.py python 0 March 19, 2021, 3:48 p.m.
stekboi ss1.py python 1 March 21, 2021, 9:16 p.m.
stekboi ss1_9JvBFi8.py python 1 March 21, 2021, 9:21 p.m.
zhangbohan kdd.py python 1 March 26, 2021, 2:23 a.m.

Evaluation

  • Evaluation for this competition will be done based on the outcomes of Phase II only.

  • There will be a public leaderboard showcasing the results instantly as you submit a submission file. 

  • The private leaderboard showcasing rank and winner will be released one week after the competition is over on April 16th 2021. This will be the final leaderboard.

  • We will use percentage as a metric to compute the forecast.

  • Every time-series has exactly one anomaly.

  • For every correct identification of the location of anomaly you will get 1 point and 0 points for every incorrect.

  • We have added +/- 100 locations on either side of the anomaly range to award the correct answer.


    Example




    There are 250 files, for every correct answer you will get 1 point and 0 for incorrect. The max score you can obtain is 100% ( as long as you do this in code using an algorithm, no hand labeling). We reserve rights to disqualify any participant if we suspect of any. Please see rules for participating in the competition.

Rules

We reserve the right to change the rules and data sets if deemed necessary.


Updates for Phase II (from description page):


Q) Why must you submit code with every attempt?

(A) Recall the goal of the contest is not to find the anomalies. The goal of the contest is to produce a single algorithm that can find the anomalies [*]. If your submission turns out to be competitive, your submitted code allows you to create a post-hoc demonstration that it was the result of a clever algorithm.

(Q) Why must you use an official university or company email address to register?

(A) Experience in similar contests suggest that otherwise individuals may register multiple times to glean an advantage. It is hard to prevent multiple registrations, but this policy goes someway to limit the utility of an unfair advantage.

[*] Of course, the “single algorithm” can be an ensemble or a meta-algorithm that automatically chooses which algorithm and which parameters to use. However, having a human to decide which algorithm, or which parameters on a case-by-case basis is not allowed. This is not a test of human skill, this is a test of algorithms.   

 


Note:

The spirit of the contest is to create a general-purpose algorithm for anomaly detection. It is against the spirit of this goal (and explicitly against the rules) to embed a human’s intelligence into the algorithm based on a human inspection of the contest data. For example, a human might look at the data and notice: “When the length of the training data is odd, algorithms that look at the maximum value are best. But when the length of the training data is even, algorithms that look at the variance are best.” , then the human might code up a meta algorithm like “If odd(length(training data)) then invoke … ”. There is a simple test for example above, if we just duplicated the first datapoint, would the outcomes be essentially identical? Of course, an algorithm can be adaptive. If the representational power of your algorithm is able to discover and exploit some regularity, then that is fine. However, an algorithm that logically memorizes which of datasets it looking at, and changes it parameters/behavior based on that observation (not on the intrinsic properties of the data it observes) is cheating. Our code review for the best performing algorithms will attempt to discover any such deliberate overfitting to the contest problems.



Announcements:

To receive announcements and be informed of any change in rules, the participants must provide a valid email to the challenge platform

 

Conditions of participation:

Participation requires complying with the rules of the Competition. Prize eligibility is restricted by US government export regulations. The organizers, sponsors, their students, close family members (parents, sibling, spouse or children) and household members, as well as any person having had access to the truth values or to any information about the data or the Competition design giving him (or her) an unfair advantage are excluded from participation. A disqualified person may submit one or several entries in the Competition and request to have them evaluated, provided that they notify the organizers of their conflict of interest. If a disqualified person submits an entry, this entry will not be part of the final ranking and does not qualify for prizes. The participants should be aware that the organizers reserve the right to evaluate for scientific purposes any entry made in the challenge, whether or not it qualifies for prizes.

Dissemination:

The Winners will be invited to attend a remote webinar organized by the hosts and present their method.

Registration:

The participants must register to the platform and provide a valid email address. Teams must register only once and provide a group email, which is forwarded to all team members. Teams or solo participants registering multiple times to gain an advantage in the competition may be disqualified.

 

Intellectual Policy:

All work is open source. 

 

 

Anonymity:

The participants who do not present their results at the webinar can elect to remain anonymous by using a pseudonym. Their results will be published on the leaderboard under that pseudonym, and their real name will remain confidential. However, the participants must disclose their real identity to the organizers to claim any prize they might win. 


One account per participant:
You cannot sign up from multiple accounts and therefore you cannot submit from multiple accounts.  

 

Team Size:

Max team size of 4.

 

No private sharing outside teams:
Privately sharing code or data outside of teams is not permitted. It's okay to share code if made available to all participants on the forums.

 

Submission Limits:
You may submit a maximum of 1 entry per day.

 

Specific Understanding:

  1. Use of external data is not permitted. This includes use of pre-trained models.

  2. Hand-labeling is not permitted and will be grounds for disqualification.

  3. Knowledge between 2 files should not be shared, any violation of this will lead to disqualification

  4. Submissions should be reasonably constrained to standard Open source libraries (Python, R, Julia and Octave)

  5. If submitted code cannot be run, the team may be contacted, if minor remediation or sufficient information not provided to run the code, the submission will be removed.

  6. If an algorithm is stochastic please make sure you save the seeds.

Leaderboard

Rank Team Percentile Count Submitted Date
1 gen 80.40000000 18 May 2, 2021, 3:36 a.m.
2 DBAI 79.20000000 15 May 5, 2021, 9:29 p.m.
3 MDTS 78.00000000 9 May 3, 2021, 9:33 p.m.
4 HU WBI 76.80000000 15 April 30, 2021, 8:02 a.m.
5 poteman 76.00000000 13 April 23, 2021, 3:55 a.m.
6 eiji 74.00000000 10 May 6, 2021, 4:23 a.m.
7 TAL_AI_NLP 73.20000000 14 April 28, 2021, 11:56 p.m.
8 ralgond 71.20000000 28 May 6, 2021, 12:01 a.m.
9 Old Captain 70.80000000 10 May 1, 2021, 1:41 a.m.
10 LUMEN 70.00000000 13 May 6, 2021, 3:19 p.m.
11 insight 68.80000000 9 April 28, 2021, 4:11 a.m.
12 MSD 66.80000000 2 May 1, 2021, 3:12 a.m.
13 willxu 66.40000000 7 April 22, 2021, 12:50 a.m.
14 HI 66.40000000 10 April 24, 2021, 6:09 a.m.
15 kddi_research 66.00000000 8 April 27, 2021, 6:20 p.m.
16 NVIDIA Giba 65.60000000 4 April 17, 2021, 10:15 a.m.
17 yu 64.00000000 6 April 12, 2021, 1:37 a.m.
18 MeisterMorxrc 63.20000000 16 April 24, 2021, 9:19 p.m.
19 lansy 62.80000000 3 April 18, 2021, 8:33 p.m.
20 huangguo 62.40000000 24 April 28, 2021, 6:19 a.m.
21 PaulyCat 62.00000000 11 April 26, 2021, 2:38 a.m.
22 runningz 60.80000000 4 May 1, 2021, 7:17 a.m.
23 JulienAu 60.80000000 1 May 6, 2021, 3:07 a.m.
24 CASIA 60.40000000 2 April 8, 2021, 7:31 p.m.
25 Limos Team 60.40000000 8 April 15, 2021, 12:54 a.m.
26 OWLs 60.40000000 1 April 22, 2021, 10:40 a.m.
27 Gidora 59.60000000 16 April 24, 2021, 10:20 p.m.
28 Alibey 59.20000000 4 April 21, 2021, 8:21 p.m.
29 AIG_Mastercard 58.80000000 10 April 26, 2021, 9:57 p.m.
30 Kubota 58.00000000 5 April 22, 2021, 4:02 p.m.
31 jin 58.00000000 4 April 22, 2021, 7:27 p.m.
32 FirstDan 57.60000000 3 April 11, 2021, 8:58 p.m.
33 walyc 57.60000000 10 April 13, 2021, 6:10 a.m.
34 BigPicture 57.20000000 7 April 27, 2021, 6:23 p.m.
35 Wakamoto 57.20000000 14 May 6, 2021, 9 a.m.
36 NONE 56.80000000 3 April 25, 2021, 8:15 p.m.
37 Songpeix 56.80000000 7 May 3, 2021, 10:53 p.m.
38 varlam 56.40000000 1 April 18, 2021, 9:44 p.m.
39 xuesheng 56.40000000 26 April 19, 2021, 12:05 a.m.
40 haizhan 55.60000000 24 April 21, 2021, 6:05 a.m.
41 Newborn Calves 55.60000000 13 April 24, 2021, 5:14 p.m.
42 kdd_gcc 55.20000000 3 April 17, 2021, 7:37 a.m.
43 syin1 54.40000000 3 April 12, 2021, 12:39 a.m.
44 Pooja 54.00000000 8 April 25, 2021, 9:28 a.m.
45 kris13 53.20000000 2 April 14, 2021, 4:52 a.m.
46 tang 53.20000000 5 April 30, 2021, 5:18 p.m.
47 fizzer 52.80000000 3 April 21, 2021, 11:28 p.m.
48 KP 52.40000000 8 April 15, 2021, 2:48 a.m.
49 hpad 52.00000000 3 April 11, 2021, 1:30 a.m.
50 whatsup 52.00000000 1 April 11, 2021, 9:09 a.m.
51 exp234 52.00000000 6 April 12, 2021, 1:45 a.m.
52 JJ 51.60000000 4 April 26, 2021, 5:38 a.m.
53 88aaattt 50.80000000 5 April 29, 2021, 10:06 a.m.
54 DayDayUp 50.80000000 1 May 4, 2021, 6:10 a.m.
55 Snowman 49.60000000 1 April 22, 2021, 4:39 a.m.
56 zzl 49.60000000 9 May 5, 2021, 7:29 p.m.
57 Jim 49.20000000 2 April 12, 2021, 11:33 p.m.
58 Hello 49.20000000 12 May 2, 2021, 7:56 p.m.
59 linytsysu 48.00000000 3 April 29, 2021, 1:20 a.m.
60 daintlab 47.20000000 6 May 3, 2021, 4:35 a.m.
61 xus 47.20000000 1 May 6, 2021, 1:21 a.m.
62 sion 46.80000000 8 April 24, 2021, 12:57 a.m.
63 UCM/INNOVA-TSN 46.80000000 9 April 26, 2021, 2:30 p.m.
64 yanxinyi 46.40000000 1 April 19, 2021, 6:49 p.m.
65 AOLeaf 46.00000000 1 April 15, 2021, 9:06 p.m.
66 hren927 46.00000000 3 May 3, 2021, 3:49 a.m.
67 sakami 45.60000000 1 April 29, 2021, 7:44 a.m.
68 demo_user 45.20000000 1 April 14, 2021, 6:27 a.m.
69 166 44.40000000 2 April 23, 2021, 6:56 a.m.
70 wenj 44.00000000 3 April 12, 2021, 11:35 p.m.
71 lzc775269512 44.00000000 3 May 5, 2021, 9:46 p.m.
72 void 43.20000000 1 April 19, 2021, 7:59 a.m.
73 xiaoqiangteam 43.20000000 2 May 5, 2021, 12:50 a.m.
74 LQKK 42.80000000 1 April 8, 2021, 8:05 a.m.
75 Ida 42.80000000 1 April 11, 2021, 5:44 a.m.
76 Anony 42.80000000 3 April 20, 2021, 5:43 a.m.
77 ML_Noob 42.80000000 8 April 24, 2021, 3:14 p.m.
78 AD 42.80000000 9 April 26, 2021, 6:23 p.m.
79 lyf 41.60000000 1 May 6, 2021, 1:05 a.m.
80 Hector 41.20000000 3 April 28, 2021, 12:25 a.m.
81 takuji 40.80000000 17 April 18, 2021, 12:48 a.m.
82 Anony 40.00000000 4 April 13, 2021, 1:42 a.m.
83 yuanliu 39.20000000 1 April 10, 2021, 7:57 a.m.
84 Liu 39.20000000 1 April 10, 2021, 8:10 a.m.
85 Splunk Applied Research 38.80000000 4 May 5, 2021, 9:53 a.m.
86 Giba 38.40000000 4 April 15, 2021, 7:25 a.m.
87 Anony 38.00000000 3 April 13, 2021, 2:06 a.m.
88 Anony 37.20000000 1 April 19, 2021, 1:45 a.m.
89 katsuhito 37.20000000 5 April 28, 2021, 11:52 p.m.
90 UMAC 36.80000000 5 May 5, 2021, 1:06 a.m.
91 mzrske 34.40000000 5 April 14, 2021, 6:06 p.m.
92 Emmitt 34.40000000 2 April 24, 2021, 11:56 p.m.
93 yuanCheng 33.60000000 5 April 11, 2021, 10:33 a.m.
94 BMul 33.20000000 5 April 21, 2021, 8:03 a.m.
95 fred 33.20000000 2 May 5, 2021, 1:56 a.m.
96 Unicorns 32.40000000 1 April 22, 2021, 12:24 a.m.
97 zsyjy 32.00000000 6 April 11, 2021, 6:36 p.m.
98 WJH 30.80000000 4 April 17, 2021, 5:01 a.m.
99 itouchz.me 30.40000000 5 April 17, 2021, 7:03 a.m.
100 truck 28.00000000 3 April 30, 2021, 3:03 p.m.
101 HL 26.80000000 4 May 2, 2021, 3:50 p.m.
102 mouyitian 26.00000000 2 April 19, 2021, 10:53 p.m.
103 BTDLOZC 25.60000000 1 April 20, 2021, 6:26 p.m.
104 hector 24.40000000 2 April 19, 2021, 9:04 p.m.
105 jarvus 24.40000000 4 April 27, 2021, 1:24 a.m.
106 rushin 24.40000000 2 April 29, 2021, 2:33 a.m.
107 penguin 24.40000000 7 May 2, 2021, 12:07 a.m.
108 Monkey D. Luffy 23.20000000 5 May 4, 2021, 12:48 p.m.
109 KeepItUp 22.80000000 1 April 12, 2021, 10:51 p.m.
110 AnomalyDetection 22.80000000 2 April 17, 2021, 10:41 p.m.
111 ZJU_Control 22.80000000 2 May 6, 2021, 4:34 a.m.
112 Andre 22.00000000 1 May 3, 2021, 11:20 p.m.
113 dspreit 21.60000000 1 April 21, 2021, 9:45 p.m.
114 sdl-team 21.20000000 4 April 10, 2021, 6:29 a.m.
115 hot 18.00000000 2 April 17, 2021, 3:58 a.m.
116 Anony 17.60000000 1 April 19, 2021, 11:32 a.m.
117 kmskonilg 17.20000000 3 April 21, 2021, 12:42 a.m.
118 donaldxu 17.20000000 3 April 21, 2021, 8:30 a.m.
119 Rush B 16.40000000 3 April 26, 2021, 3:47 a.m.
120 WintoMT 15.20000000 1 May 5, 2021, 9:15 a.m.
121 tEST 12.40000000 2 April 14, 2021, 8:52 p.m.
122 Prarthi 12.00000000 1 April 11, 2021, 10:02 a.m.
123 Seemandhar 11.60000000 1 April 11, 2021, 9:46 a.m.
124 qustslxysdxx 10.40000000 1 April 24, 2021, 3:37 a.m.
125 piticli 9.60000000 3 May 4, 2021, 4:16 p.m.
126 hg2 9.20000000 11 April 30, 2021, midnight
127 Zoey 9.20000000 1 May 5, 2021, 3:28 a.m.
128 axioma 8.40000000 5 May 3, 2021, 9:31 a.m.
129 chunli 8.00000000 2 May 3, 2021, 6:24 a.m.
130 ceshi 7.20000000 1 April 24, 2021, 3:58 a.m.
131 zyz 6.80000000 3 April 18, 2021, 6:35 p.m.
132 patpat 3.20000000 1 April 15, 2021, 5:54 a.m.
133 nickil21 2.80000000 1 May 5, 2021, 1:23 a.m.
134 Host 1.20000000 1 April 7, 2021, 11:10 p.m.
135 ereshkigal 1.20000000 1 April 28, 2021, 9:18 p.m.
136 Support 0.40000000 1 April 18, 2021, 4:14 a.m.
137 baozi 0.40000000 1 April 23, 2021, 8:39 p.m.
138 iyoad 0.40000000 1 April 28, 2021, 12:35 a.m.
139 uday 0.00000000 1 April 7, 2021, 10:07 p.m.
140 Competition Host 0.00000000 1 April 7, 2021, 11:39 p.m.
141 finlayliu 0.00000000 1 April 14, 2021, 12:25 a.m.
142 LEARNING 0.00000000 1 May 3, 2021, 10:11 a.m.

Getting Started



Overview of the Time Series Anomaly Detection Competition


Detecting Anomaly in univariate time series is a challenge that has been around for more than 50 years. Several attempts have been made but still there is no robust outcome. This year  Prof. Eamonn Keogh and Taposh Roy as part of KDD Cup 2021 are hosting the multi data set time series anomaly detection competition. This goal of this competition is to encourage industry and academia to find a solution for univariate time-series anomaly detection. Prof. Keogh has provided 250 data-sets collected over 20 years of research to further this area. Please review the brief overview video developed by Dr. Keogh.






Here is a simple example show-casing how to find anomaly in a single time series file provided.


#importing the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matrixprofile as mp
from matrixprofile import *

#reading the dataset
df=pd.read_csv('/Users/code/timeseries/ucr_competition_data/005_UCR_Anomaly_4000.txt', names=['values'])

#set window size
window_size=100
#calculating the matrix profile with window size'4'
profile=mp.compute(df['values'].values, window_size)

#discover motifs
profile = mp.discover.motifs(profile, k=window_size)
print(profile['discords'])

475

Teams

578

Competitors

1,173

Submissions