Anomaly Detection in Log Files - Use Doc2Vec?

Hi there,

I’m tasked to find (unlabelled) anomalies in log files and figured Doc2Vec followed by DBScan might be a good choice. To verify this strategy works I came up with an extremely simplified example which wouldn’t really require language processing but the messages are in reality more complex (typically a message is made up from multiple words where you have similar message where only a few words are different and messages which are completely different). Despite the simplification I’m failing to conclude if the approach makes sense and the examples I found here seem to be different or a bit too complex to get started. So maybe someone’s got time to take a look at this and share his/her thoughts?

Logs typically contain messages for different operations a machine carries out during processing a product. In my example there is only three messages (a, b and c) and they’re always in this sequence. However, sometimes products are being more difficult and some tasks need multiple attempts but always b should come after n-times a and c should come after n-times b (where n is larger than or equal to 1). I limited my example to 6 scenarios:
22 normal products (a-b-c)
22 difficult products (a-a-b-b-c-c),
22 very difficult products (a-a-a-b-b-b-c-c-c)
3 anomalies (a-a-a, a-b-c-a-b-c and c-c-c-a-b-a-a-b-b).
Doc2Vec_MockUp_DataSet.xlsx (15.9 KB)

My goal is to find three product clusters and 3 anomalies either as separate clusters or noise and I figured I could use the product ID (called “Window” in the attached file) as document title and the message (a, b, or c) as text, then train a Doc2Vec model on the data, apply some distance metric and use DBScan to find the clusters. I’m not getting anything close to what I expect, no matter what distance metric or DBScan settings I use so I’m wondering if my understanding and usage of Doc2Vec is correct in the first place. Any thoughts on this are highly appreciated!
Doc2Vec_Test.knwf (111.5 KB)

Thanks in advance,

Hi Mark

First of all I’m in no way an expert on machine learning, so please take my impressions below with a grain of salt… Also, devil is in the details, and perhaps I’m completely missing the point by looking only at the mockup scenario.

Generally speaking, in many use cases the definition of “anomaly” is tricky , and tipically has to be built on top of statistical concepts. In the scenario you are describing, though, you describe an anomaly in terms of a clear specific rule: products must be processed in a well defined order, and anything that doesn’t follow that order or doesn’t have all the required products is an anomaly.

If the above is true, I feel that your approach might be overkill, and perhaps other ideas can be tried first. As an example, please find attached a minimum workflow that solves the problem for your mockup dataset (it outputs the window ids containing an anomaly). Please note that the concept could be customized to handle situations in which the same product generates slightly different logs, as long as the basic log sctructure is kept as you described.

Regarding Doc2Vec, I feel that it couldn’t be a suitable approach in your case anyway, because it relies on item co-occurence, without paying attention to the order in which they appear…

If a more advanced approach is actually needed (for instance, detect uncommon combinations, even when they follow a conceptually correct order), I would explore architectures capable of modelling a sequence in time (recurrent neural networks, 1D CNN…)

Please excuse me if I’m doing a too simplistic interpretation of your email, hope this helps.

Best Regards,


kforum.knwf (21.1 KB)

Hello Jorge,

if you’re not an expert on ML, what does that make me? :cry:
Anyways, thanks for taking the time to produce the workflow and the detailed response! I thought about it intensely and here are my conclusions:

For the sake of simplicity I had provided and extremely simplified log and made some statements which are not 100% true in reality. Specifically there are a lot more different messages (some of them being only different flavours of the ones given in the example like a*, a’ or â instead of a but also hundreds of totally different ones like d, e, f, …) and more importantly, the sequence is not really always exactly the same but depending on the product made or when e.g. a daily auto-tuning routine kicks in sequences can vary quite a bit. That said I’m relatively sure overkill is not my problem.

However, your comment regarding suitability of Doc2Vec made me think and I realized this is probably where I was wrong. I had started off with a Bag Of Words approach combined with slicing the logs into time windows to see if the (relative) frequency of certain messages is unusual in some windows. This completely ignores the sequence of messages within those windows so I thought using Doc2Vec overcomes this issue and is yet easy to implement in an unsupervised scenario. But I guess it doesn’t really overcome the issue completely but instead only tells me if e.g. message a typically appears close to message b w/o allowing a conclusion like message a typically occurs directly before (and not after) message b. If this information is really relevant it seems like I need to dive into RNNs or even LSTMs and I already found a good starting point. What I still need to get my head around is how to use these networks in an unsupervised scenario. Maybe something with reconstruction errors can be used?

Long story short, my understanding has changed to this:

  • Detect unusual (relative) frequencies in certain time windows -> use BoW w/ windowing
  • Detect messages in an unusual context if the exact sequence doesn’t matter -> use Doc2Vec
  • Detect unusual sequences of messages -> use RNNs/LSTMs

Is this about right?

Thanks a lot,

Hi Mark,

For me it’s hard to tell if that approach would work (each particular case has lots of nuances to be studied, and seriously I’m not an ML expert!), however, at least in theory, I guess it could…

In any case my suggestion is to start as simple as possible, trying more complex algorithms as needed. Experimentation is your friend, and I’ve found KNIME to be really helpful also for this.

It would be great if you can share your progress!

Best Regards,


1 Like

Hi Jorge,

thanks for your interest! So here’s my progress:

The original log file mock-up I provided above consisted of three different sequences mimicking the manufacturing of a simple, a normal and a difficult product (abc, aabbcc and aaabbbcc). In the original mock-up the production sequence was always a simple product followed by a normal one followed by a difficult one and then starting over with a simple one and so one. I.e. the sequence was effectively multiple copies of just one sequence: abcaabbccaaabbbccc. I changed that such manufacturing order of simple, normal and difficult products is now random and created a much longer log file:
LSTM_MockUp_Log.xlsx (383.6 KB)

With this as my input data I adopted the fairy tale example such that I would use the initial 5,000 rows from the log for training and then loop through the remaining rows using batches of 20 rows each to predict the 21st one (i.e. 5,001-5,020 predicting 5,021 then 5,002-5,021 predicting 5,022 etc.).
LSTM_Test 1.knwf (1.5 MB)

Due to the randomization one can’t really expect a perfect prediction despite the simplicity of the example. For example if an input sequence ends with ccc you know for sure the next message must be a but you have no idea if that’s an a from a simple, normal or difficult product. In other words, if the input sequence ends with cca you can’t possibly know whether to predict a or b as the next message, you only know it can’t be c. That said I have done a rough estimate and if simple, normal and difficult products would have equal shares (which they don’t have, it’s 40%, 30% and 30%, respectively) I concluded you can’t beat an accuracy of 88.9%. I’m getting 87.1% correct predictions and also the pattern looks reasonable (i.e. I can only find mistakes where one can’t possibly know what to predict). So I guess the whole thing works pretty well even though it is overkill for this simple example.

So I figured it was time to increase complexity and I added “noise”. I simply allowed a random message X to be put in between the products with a 10% probability. Actually I started with even allowing X to appear inside a product sequence (e.g. aabXbcc) but quickly decided to take smaller steps. I thought this was a small modification, I should still be OK with using 20 preceding messages to predict the next one so I only changed the Keras Input Layer shape from ?,3 to ?,4 (now having 4 unique messages a, b, c and X) and the output layer dimensionalty accordingly but the result was disappointing. I played with more lag columns, more training etc. but the small change in input data makes the whole thing fail big time:
LSTM_Test 2.knwf (1.6 MB)

I understand the accuracy must go down since there are more scenarios where a prediction is simply impossible. However, a good model should “know” that in no case the messages highlighted below could occur in that position of the sequence since this was certainly never the case in the training data:

Long story short: in the most simple case it seems to work pretty well but just a little noise makes it fail completely. Haven’t got my head around why that is so any input highly welcome…


Mark, you moron! If you add a fourth message to the set of possible messages you should also look at the softmax output for that fourth message when concluding which message was predicted with the highest probability. If you do that, the LSTM works very well even with the “noisy” data… problem solved.


This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.