Example of Splunk Dedup command executionīy using Splunk Dedup command, the user can specify the counts of duplication with respect to events to keep either for every value of single filed or for combinations of each value among various fields.Different functions of Splunk Dedup filtering commands.Differentiation between Uniq and Splunk Dedup commands.The Splunk Dedup command will return the first key value found for that particular search keyword/field. ![]() The Dedup command in Splunk removes duplicate values from the result and displays only the most recent log for a particular incident. to get the fields you want displayed how you want.Splunk Dedup command removes all the events that presumes an identical combination of values for all the fields the user specifies. The secondary contributing factor – important, but not as much a factor as what I covered above – is that dedup must process the whole event, whereas stats chucks everything that isn’t part of what it’s counting (so if an event is 1kb in size, dedup has to return the whole kb, while stats is only looking at maybe 1/10 the total (if you include a coupld extra fields)).Īnother neat aspect of using |stats is that it creates a table for you – if you’re running |dedup, you then have to |table. dedup at a minimum is going to do ~54 million comparison (and probably a lot more – given it doesn’t merely take 13x the time to run, but closer to 25x). – it still has to run through all 4 million events, but all it is doing is adding one to the list for every value that shows up in that particular field – IOW, it “only” has to do a total of 4 million things (because it does need to look at every event returned). If the list is being created in a slighly more efficient manner (say a heap or binary search tree), it will still take ~43 million comparisons ( 3.2 million * log 2(11,000)).)Ĭompare this to the relative complexity of using |stats count by. (Because linear searching finds what it’s looking for on average by the time it has traversed half the list. Those last 3.2 million events each need to check against the list of >10,500 entries – which means, roughly, 16,8 billion comparisons still need to be made! That initial list of length m fills pretty quickly (it is, after all, only going to get to ~11,000 total entries (in this case)), but as it grows to its max, it gets progressively harder and hard to check whether or not the next event has already been dedup’d.Īt ~750,000 events returned (roughly 1/5 my total), the list is unique field values was 98% complete – yet there were still ~3.2 million events left to go (to find just 2% more unique field values). ![]() ![]() To use Big-O Notation, this is an O(n*m) operation (bordering on O(n 2))! In the particular case I was seeing yesterday, that means that every single event in the list of 4,000,000 events returned by the search has to be compared one at a time to a list (that I know is going to top out at about 11,000). ![]() In contrast, dedup must compare every individual returned event’s field that matches what you’re trying to dedup to it’s growing list of unique entries for that field. is that stats can hand-off the counting process to something else (though, even if it doesn’t, incrementing a hashtable entry by 1 every time you encounter an instance isn’t terribly computationally complex) and keep going. (A coworker added some more context, but it’s a smaller reason of why one is faster then the other.) Īfter some further reflection over dinner, I figured out the major portion of why this is – and I feel a little dumb for not having thought of it before. is substantially faster than running dedup. Yesterday I wrote-up a neat little find in Splunk wherein running stats count by.
0 Comments
Leave a Reply. |