# Hugging Face

Hugging Face supports around 100,000 pre-trained language models that can be used for various NLP tasks. The Hugging Face transformers library, which is a popular choice for NLP tasks such as text classification and machine translation, currently supports over 100 pre-trained language models. These models include popular models such as BERT, GPT-2, and RoBERTa. In addition Hugging Face provides tools and libraries that allow users to fine-tune and customize these models for specific tasks or datasets.

A Hugging Face Course – https://github.com/huggingface/course

CEO Clement Delangue, calls it the “GitHub of machine learning.” Its emphasis on an open, collaborative approach that made investors confident in the company’s $2 billion valuation, he said. “That’s what is really important to us, makes us successful and makes us different from others in the space.” DistilBERT is a smaller, faster, and cheaper version of the BERT language model developed by Hugging Face by controlling the loss function during training of a ‘student model’ from a ‘teacher model’. It bucks the trend towards larger models, and instead focusses on training a more efficient model. It has been “distilled” to reduce its size and computational requirements, making it faster to train and more efficient to run. Despite being smaller than BERT, DistilBERT is able to achieve similar or even slightly better performance on many NLP tasks. The triple loss function is devised to include a distillation loss, a training loss and a cosine-distance loss. Examples of generative models available on the Hugging Face platform include: 1. GPT-2: GPT-2 (Generative Pre-training Transformer 2) is a large-scale language model developed by OpenAI that can be used for tasks such as language translation and text generation. 2. BERT: BERT (Bidirectional Encoder Representations from Transformers) is a language model developed by Google that can be used for tasks such as language translation and text classification. 3. RoBERTa: RoBERTa (Robustly Optimized BERT Approach) is a language model developed by Facebook that is based on the BERT model and can be used for tasks such as language translation and text classification. 4. T5: T5 (Text-To-Text Transfer Transformer) is a language model developed by Google that can be used for tasks such as language translation and text summarization. 5. DistilBERT, described above. To generate text with DistilBERT, you would typically fine-tune the model on a specific task, such as machine translation or language generation, using a dataset that is relevant to the task. Once the model has been fine-tuned, you can use it to generate text by providing it with a prompt or seed text and letting it predict the next word or sequence of words. Docs on text generation – https://huggingface.co/transformers/v3.1.0/main_classes/model.html?highlight=generate Here’s an example of using transformers to generate some text. import transformers # Load the model and tokenizer tokenizer = AutoTokenizer.from_pretrained('distilgpt2') model = AutoModelWithLMHead.from_pretrained('distilgpt2') # Encode the prompt input_context_prompt = "Men on the moon " input_ids = tokenizer.encode(input_context_prompt, return_tensors='pt') # encode input context # Generate text outputs = model.generate(input_ids=input_ids, max_length=40, temperature=0.9, num_return_sequences=10, do_sample=True) # Sample candidate outputs and print for i in range(10): # 10 output sequences were generated print('Generated {}: {}'.format(i, tokenizer.decode(outputs[i], skip_special_tokens=True)))  Note the temperature parameter during model.generate(). A temperature of zero means the generation process will choose the most likely next word . A higher temperature allows for less likely words to be included in the generation process. # Machine Learning Security Seven security concerns in Machine Learning (ML) – 1. Data privacy and security: ML requires large amounts of data to be trained, and this data may contain sensitive or personal information. Appropriate measures need to be put in place to prevent data from being accessed by unauthorized parties. 2. Notebooks security: ML typically requires Jupyter or similar notebooks to be served for data scientists to work on data, code, and models, both individually and collaboratively. These notebooks need to be access controlled and protected from unauthorized access. This includes the code and git repos that host the code, and the model artifacts that the notebook uses or creates. 3. Model serving and inference security: ML models in production are commonly served and accessed over inference endpoints and such endpoints need authentication, authorization, encryption for protection against misuse. During model upgrades to an endpoint or changes to an endpoint and its configuration, a number of attacks are possible that are typical of a devops/devsecops pipeline. These need to be protected against. 4. Model security: Models can be vulnerable to attacks such as adversarial inputs, such as when an attacker intentionally manipulates the input to the model in order to cause it to make incorrect predictions. Another example is when the model makes an egregiously bad decision on an input, for example a self-driving car hitting an obstacle instead of avoiding it. It is important to harden the model and bound the decisions that come from its use. 5. Misuse: Even if a model works as designed, it can be misused, for example by generating fake or misleading content. It is important to consider the potential unintended consequences of using models and to put safeguards in place to prevent their misuse. 6. Bias: ML models can sometimes exhibit biases due to the data they are trained on. There should be a plan to identify biases in a model and take steps to mitigate them. 7. Intellectual property: ML models may be protected by intellectual property laws, and it is important to respect these laws and obtain the appropriate licenses when using language models developed by others. # Multimodal neurons typographic attacks https://openai.com/blog/multimodal-neurons/ ML Training on images and text together leads to certain neurons holding information of both images and text – multimodal neurons. When the type of the detected object can be changed by tricking the model into recognizing a textual description instead of a visual description- that can be called a typographic attack. Intriguing concepts indicating that a fluid crossover from text to images and back is almost here. # Processors for Deep Learning: Nvidia Ampere GPU, Tesla Dojo, AWS Inferentia, Cerebras The NVidia Volta-100 GPU released in Dec 2017 was the first microprocessor with dedicated cores purely for matrix computations called Tensor Cores. The Ampere-100 GPU released May’20 is its successor. Ampere has 84 Streaming Multiprocessors (SMs) with 4 Tensor Cores (TCs) each for a total of 336 TCs. Tensor Cores reduce the cycle time for matrix multiplications, operating on 4×4 matrices of 16bit floating point numbers. These GPUs are aimed at Deep Learning use cases which consist of a pipeline of matrix operations. Here’s an article on choosing the right EC2 instance type for DL – https://towardsdatascience.com/choosing-the-right-gpu-for-deep-learning-on-aws-d69c157d8c86 (G4 for inferencing, P4 for training). How did the need for specialized DL chips arise, and why are Tensors important in DL ? In math, we have Scalars and Vectors. Scalars are used for magnitude and Vectors encode magnitude and direction. To transform Vectors, one applies Linear Transformations in the form of Matrices. Matrices for Linear Transformations have EigenVectors and EigenValues which describe the invariants of the transformation. A Tensor in math and physics is a concept that exhibits certain types invariance during transformations. In 3 dimensions, a Stress Tensor has 9 components, which can be representated as a 3×3 matrix; under a change of basis the components of the tensor change however the tensor itself does not. In Deep Learning applications a Tensor is basically a Matrix. The Generalized Matrix Multiplication (GEMM) operation, D=AxB+C, is at the heart of Deep Learning, and Tensor Cores are designed to speed these up. In Deep Learning, multilinear maps are interleaved with non-linear transforms to model arbitrary transforms of input to output and a specific model is arrived by a process of error reduction on training of actual data. This PyTorch Deep Learning page is an excellent resource to transition from traditional linear algebra to deep learning software – https://pytorch.org/tutorials/beginner/nlp/deep_learning_tutorial.html . Tesla Dojo is planned to build a processor/computer dedicated for Deep Learning to train on vast amounts of video data. Launched on Tesla AI Day, Aug’20 2021, a video at https://www.youtube.com/watch?v=DSw3IwsgNnc AWS Inferentia is a chip for deep learning inferencing, with its four Neuron Cores. AWS Trainium is an ML chip for training. Generally speaking the desire in deep learning community is to have simpler processing units in larger numbers. Updates: Cerebras announced a chip which can handle neural networks with 120 trillion parameters, with 850,000 AI optimized cores per chip. SambaNova, Anton, Cerebras and Graphcore presentations are at https://www.anandtech.com/show/16908/hot-chips-2021-live-blog-machine-learning-graphcore-cerebras-sambanova-anton SambaNova is building 400,000 AI cores per chip. # Accuracy vs Recall vs Precision vs F1 in Machine Learning We want to walk through some common metrics in classification problems – such as accuracy, precision and recall – to get a feel for when to use which metric. Say we are looking for a needle in a haystack. There are very few needles in a large haystack full of straws. An automated machine is sifting through the objects in the haystack and predicting for each object whether it is a straw or a needle. A reasonable predictor will predict a small number of objects as needles and a large number as straws. A prediction has two attributes – positive/negative and accurate/inaccurate. Positive Prediction: the object at hand is predicted to be the needle. A small number. Negative Prediction: the object at hand is predicted not to be a needle. A large number. True_Positive: of the total number of predictions, the number of predictions that were positive and correct. Correctly predicted Positives (needles). A small number. True_Negative: of the total number of predictions, the number of predictions that were negative and correct. Correctly predicted Negatives (straws). A large number. False_Positive: of the total number of predictions, the number of predictions that are positive but the prediction is incorrect. Incorrectly predicted Positives (straw predicted as needle). Could be large as the number of straws is large, but assuming the total number of predicted needles is small, this is less than or equal to predicted needles, hence small. False_Negative: of the total number of predictions, the number of predictions that are negative but the prediction is incorrect. Incorrectly predicted Negatives (needle predicted as straw). Is this a large number ? It is unknown – this class is not large just because the class of negatives is large – it depends on the predictor and a “reasonable” predictor which predicts most objects as straws, could also predict many needles as straws. This is less than or equal to the total number of needles, hence small. Predicted_Positives = True_Positives + False_Positives = Total number of objects predicted as needles. Actual Positives = Actual number of needles, which is independant of the number of predictions either way, however Actual Positives = True Positives + False Negatives. Accuracy = nCorrect _Predictons/nTotal_Predictions=(nTrue_Positives+nTrue_Negatives) / (nPredicted_Positives +nPredicted_Negatives) . # the reasonable assumption above is equivalent to a high accuracy. Most predictions will be hay, and be correct in this simply because of the skewed distribution. This does not shed light on FP or FN. Precision = nTrue_Positives / nPredicted_Positives # correctly_identified_needles/predicted_needles; this sheds light on FP; Precision = 1 => FP=0 => all predictions of needles are in fact needles; a precision less than 1 means we got a bunch of hay with the needles – gives hope that with further sifting the hay can be removed. Precision is also called Specificity and quantifies the absence of False Positives or incorrect diagnoses. Recall = nTrue_Positives / nActual_Positives = TP/(TP+FN)# correctly_identified_needles/all_needles; this sheds light on FN; Recall = 1 => FN = 0; a recall less than 1 is awful as some needles are left out in the sifting process. Recall is also called Sensitivity . Precision > Recall => FN is higher than FP Precision < Recall => FN is lower than FP If at least one needle is correctly identified as a needle, both precision and recall will be positive; if zero needles are correctly identified, both precision and recall are zero. F1 Score is the harmonic mean of Precision and Recall. 1/F1 = 1/2(1/P + 1/R) . F1=2PR/(P+R) . F1=0 if P=0 or R=0. F1=1 if P=1 and R=1. ROC/AUC rely on Recall (=TP/TP+FN) and another metric False Positive Rate defined as FP/(FP+TN) = hay_falsely_identified_as_needles/total_hay . As TN >> FP, this should be close to zero and does not appear to be a useful metric in the context of needles in a haystack; as are ROC/AuC . The denominators are different in Recall and FPR, total needles and total hay respectively. There’s a bit of semantic confusion when saying True Positive or False Positive. These shorthands can be interpreted as- it was known that an instance was a Positive and a label of True or False was applied to that instance. But what we mean is that it was not known whether the instance was a Positive, and that a determination was made that it was a Positive and this determination was later found to be correct (True) or incorrect (False). Mentally replace True/False with ‘Correct/Incorrectly identified as’ to remove this confusion. Normalization: scale of 0-1, or unit norm; useful for dot products when calculating similarity. Standardization: zero mean, divided by standard deviation; useful in neural network/classifier inputs Regularization: used to reduce sensitivity to certain features. Uses regression. L1: Lasso regression L2: Ridge regression Confusion matrix: holds number of predicted values vs known truth. Square matrix with size n equal to number of categories. Bias, Variance and their tradeoff. we want both to be low. When going from a simple model to a complex one, one often goes from high bias to a high variance scenario. https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229 # ML Transformer and GPT-2 Meetup “The attention mechanism allows the model to create the context vector as a weighted sum of the hidden states of the encoder RNN at each previous timestamp.” “Transformer is a type of model based entirely on attention, and does not require recurrent or convolutional layers” Context vector is the output of the Encoder in an Encoder-Decoder network (EDN). EDNs struggle to retain all the required information for the decoder to accurately decode. Attention is a mechanism to solve this problem. “Attention mechanisms let a model directly look at, and draw from, the state at any earlier point in the sentence. The attention layer can access all previous states and weighs them according to some learned measure of relevancy to the current token, providing sharper information about far-away relevant tokens.” GPT: Generative Pre-Trained Transformer. Unlike BERT, it is generative and not geared to comprehension, translation or summarization tasks, but instead writing or generative tasks. It uses unsupervised learning to train a deep neural network with a seq2seq model. It does not use reinforcement learning (feedback from environment) or supervised learning. It uses “masked self-attention” to predict the next text during training on its dataset. The term “generative” is used to emphasize GPT’s ability to generate new, original text, rather than just processing or analyzing text that already exists. A generative model is a type of machine learning model that is trained to produce data, such as text, images, or music, that is similar to the data it was trained on. GPT is a generative model because it is trained on a large corpus of text data and can then generate new text that is similar to the text in its training data. This allows GPT to produce human-like text on a wide range of topics, which can be useful for a variety of applications, such as language translation, text summarization, and question answering. A “transformer” is a type of neural network architecture that was introduced in 2017. It is a deep learning model that is used for natural language processing tasks, such as language translation and text summarization. A transformer consists of two main components: an encoder, which processes the input text, and a decoder, which generates the output text. The encoder and decoder are connected by a series of attention mechanisms, which allow the model to focus on different parts of the input text as it generates the output. This architecture allows the model to process input text in a parallel, rather than sequential, manner, which makes it more efficient and effective than previous models. The transformer architecture has been widely adopted in natural language processing and has been shown to be highly effective for many tasks. In a transformer, the “attention” mechanism allows the model to focus on different parts of the input text at different times as it generates the output text. This is different from previous neural network models, which processed the input text sequentially, one word at a time. The attention mechanism in a transformer works by calculating a weight for each word in the input text. This weight represents the importance of that word in the context of the current output word that the model is generating. The model then uses these weights to decide which words in the input text to focus on as it generates the output. This allows the model to selectively focus on the most relevant words in the input text. https://en.m.wikipedia.org/wiki/Transformer_(machine_learning_model) was initially released June 2017 by Google Brain team. GPT was released June 2018 by OpenAI. BERT was released Oct 2018 by Google. GPT-2 was announced Feb 2019 by OpenAI, trained on 40GB of text. GPT-3 was introduced May 2020 and in beta testing in July 2020. Trained on 10x the data, or 400GB. BERT is a response to GPT and GPT-2 is in turn a response to BERT. This attention concept looks akin to a fourier or laplace transform which encodes the entire input signal in a lossless manner in a way that allows sections or bands of it to be referred to later. Although implemented differently it’s a way to keep track of and refer to global state. AutoML and Transformer – http://ai.googleblog.com/2019/06/applying-automl-to-transformer.html BERT and GPT are both based on the Transformer ideas. BERT is bidirectional and better at ccomprehending meaning from the whole sentence/phrase whereas GPT is better at generating text. https://jalammar.github.io/illustrated-transformer/ https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/ Bahdanau, 2014 introduced the concept of Attention https://arxiv.org/abs/1409.0473 “The most important distinguishing feature of this approach from the basic encoder–decoder is that it does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation. This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector. We show this allows a model to cope better with long sentences.” This description makes it more like a wavelet transform, that does auto-correleations of a signal at different levels of granularity to make sense of it. Conceptual progression 1. Input -> Encoder -> Decoder -> Output 2. Encoder maintains Hidden States to parse/grok the input. These are vectors. Once it goes through the input, it passes the final Hidden State, called the Context forward to the Decoder. 3. This Context is the bottleneck in the operation of the Decoder. 4. The Attention concept introduced by Bahdanau and others was to overcome the bottleneck in the Context 5. With Attention the entire set of intermediate Hidden states is passed on to the Decoder, not just the final Context. 6. The Decoder does a couple additional steps than before. a) it assigns a score assigned to each Hidden state b) it multiplies the Hidden state with the score. This set of scored vectors are then passed on to the Decoder to produce the Output. # NVidia Volta GPU vs Google TPU A Graphics Processing Unit (GPU) allows multiple hardware processors to act in parallel on a single array of data, allowing a divide and conquer approach to large computational tasks such as video frame rendering, image recognition, and various types of mathematical analysis including convolutional neural networks (CNNs). The GPU is typically placed on a larger chip which includes CPU(s) to direct data to the GPUs. This trend is making supercomputing tasks much cheaper than before . Tesla_v100 is a System on Chip (SoC) which contains the Volta GPU which contains TensorCores, designed specifically for accelerating deep learning, by accelerating the matrix operation D = A*B+C, each input being a 4×4 matrix. More on Volta at https://devblogs.nvidia.com/parallelforall/inside-volta/ . It is helpful to read the architecture of the previous Pascal P100 chip which contains the GP100 GPU, described here – http://wccftech.com/nvidia-pascal-specs/ . Background on why NVidia builds chips this way (SIMD < SIMT < SMT) is here – http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html . Volta GV100 GPU = 6 GraphicsProcessingClusters x 7 TextureProcessingCluster/GraphicsProcessingCluster x 2 StreamingMultiprocessor/TextureProcessingCluster x (64 FP32Units +64 INT32Units + 32 FP64Units +8 TensorCoreUnits +4 TextureUnits) The FP32 cores are referred to as CUDA cores, which means 84×64 = 5376 CUDA cores per Volta GPU. The Tesla V100 which is the first product (SoC) to use the Volta GPU uses only 80 of the 84 SMs, or 80×64=5120 cores. The frequency of the chip is 1.455Ghz. The Fused-Multiply-Add (FMA) instruction does a multiplication and addition in a single instruction (a*b+c), resulting in 2 FP operations per instruction, giving a FLOPS of 1.455*2*5120=14.9 Tera FLOPs due to the CUDA cores alone. The TensorCores do a 3d Multiply-and-Add with 7x4x4+4×4=128 FP ops/cycle, for a total of 1.455*80*8*128 = 120TFLOPS for deep learning apps. 3D matrix multiplication The Volta GPU uses a 12nm manufacturing process, down from 16nm for Pascal. For comparison the Jetson TX1 claims 1TFLOPS and the TX2 twice that (or same performance with half the power of TX1). The VOLTA will be available on Azure, AWS and platforms such as Facebook. Several applications in Amazon. MS Cognitive toolkit will use it. For comparison, the Google TPU runs at 700Mhz, and is manufactured with a 28nm process. Instead of FP operations, it uses quantization to integers and a systolic array approach to minimize the watts per matrix multiplication, and optimizes for neural network calculations instead of more general GPU operations. The TPU uses a design based on an array of 256×256 multiply-accumulate (MAC) units, resulting in 92 Tera Integer ops/second. Given that NVidia is targeting additional use cases such as computer vision and graphics rendering along with neural network use cases, this approach would not make sense. Miscellaneous conference notes: Nvidia DGX-1. “Personal Supercomputer” for$69000 was announced. This contains eight Tesla_v100 accelerators connected over NVLink.

Tesla. FHHL, Full Height, Half Length. Inferencing. Volta is ideal for inferencing, not just training. Also for data centers. Power and cooling use 40% of the datacenter.

As AI data floods the data centers, Volta can replace 500 CPUswith 33 GPUs.
Nvidia GPU cloud. Download the container of your choice. First hybrid deep learning cloud network. Nvidia.com/cloud . Private beta extended to gtc attendees.

Containerization with GPU support. Host has the right NVidia driver. Docker from GPU cloud adapts to the host version. Single docker. Nvidiadocker tool to initialize the drivers.

Moores law comes to an end. Need AI at the edge, far from the data center. Need it to be compact and cheap.

Jetson board had a Tegra SoC chip which has 6cpus and a Pascal GPU.

AWS Device Shadows vs GE Digital Twins. Different focus. Availabaility+connectivity vs operational efficiency. Manufacturing perspective vs operational perspective. Locomotive may  be simulated when disconnected .

DeepInstinct analysed malware data using convolutional neural networks on GPUs, to better detect malware and its variations.

Omni.ai – deep learning for time series data to detect anomalous conditions on sensors on the field such as pressure in a gas pipeline.

GANS applications to various problems – will be refined in next few years.

GeForce 960 video card. Older but popular card for gamers, used the Maxwell GPU, which is older than Pascal GPU.

Cooperative Groups in Cuda9. More on Cuda9.

# Neural Network Training and Inferencing on Nvidia

Nvidia just announced the Tesla P40 and P4 cards for Neural network inferencing applications. A review is at http://www.anandtech.com/show/10675/nvidia-announces-tesla-p40-tesla-p4. Comparing it to the Tesla P100 released earlier this year, the P40 is targeted to inferencing applications. Whereas the P100 was targeted to more demanding training phase of neural networks. P4o comes with the TensorRT (real time) library for fast inferencing (e.g. real time detection of objects).

Some of the best solutions of hard problems in machine learning come from neural networks, whether in computer vision, voice recognition, games such as Go and other domains. Nvidia and other hardware kits are accelerating AI applications with these releases.

What happens if the neural network draws a bad inference, in a critical AI application ? Bad inferences have been discussed in the literature, for example in the paper: Intriguing properties of neural networks.

There are ways to minimize bad inferences in the training phase, but not foolproof – in fact the paper above mentions that bad inferences are low probabalility yet dense.

Level 5 autonomous driving is where the vehicle can handle unknown terrain. Most current systems are targeting Level 2 or 3 autonomy. The Tesla Model S’ Autopilot is Level 2.

An answer is to pair it with a regular program that checks for certain safety constraints. This would make it safer, but this alone is likely insufficient either for achieving Level 5 operations, or for providing safely for them.

# Cassandra and the Internet of Boilers

A fascinating story about use of Cassandra for analyzing sensor data from boilers to predict their failuresin UK homes by British Gas appeared here.

The design of Cassandra is intuitively clear to me in its use of a single primary index to distribute the query load among a set of nodes that can be scaled up linearly. It uses a ring architecture based on consistent hashing. It emphasizes Availability and Partition-Tolerance over Consistency in the CAP theorom.

The data structure is a two level hash table, with the first level key being the row key, and the second level key being the column key.

Where Cassandra differs from a SQL db is in the flexibility of the data model. In SQL one can model complex relationships, which allow for complex queries using joins to be done. Cassandra has support for CQL (Cassandra Query Language) which is like SQL but does not support joins or transactions.  The impact is that the queries with CQL cannot be as flexible (or adhoc) as those for SQL. The kind of queries that can be done have to be planned in advance. Doing other queries would be inefficient. However this drawback is mitigated by use of Spark along with Cassandra. In my understanding the Spark cluster is run in a parallel Cassandra cluster.

Why are joins important ? It goes back to relationships in an E-R diagram. Can’t we just model entities ? When we store Employees in one table and Departments in another in a SQL db, each row has an id which is a shorthand for the employee or the department. This simplification forces us to look up both tables again via a join in a query – say when asking for all employees belong to (only) the finance department. But tables like departments may be small in size so they could be replicated in memory for quickly recovering associations. And tables like employees can be naturally partitioned by the employee id which is unique. This means that SQL and complex relationships may not be needed for number of use cases. If ACID compliance is also not a requirement, then nosql is a good bet. Cassandra differs from MongoDB in that it can scale much better.

Quote from British Gas: “We’re dealing largely with time series data, and Spark is 10 to 100 times quicker as it is operating on data in-memory…Cassandra delivers what we need today and if you look at the Internet of Things space; that is what is really useful right now.”

Here’s a blog that triggered this thought along with a talk by Rachel@datastax, who also assured me that Cassandra has been hardened for security and has Kerberos support in the free version.

British Gas operates Hive, a competitor to Nest for thermostats. Note that couple months back British Gas reported 2200 of its accounts were compromised.

# “Computer Detective in the Cloud”

Although light on details, this is an application of AI for securing against credit card fraud in real time using cloud computing.

AI has been in the news a few times this month – Google (TensorFlow), Facebook (new milestones in AI), Microsoft releasing Cortana (Nadella welcomes our AI overlords) and mention of an AI spring from IBM and Salesforce.

Machine learning has also been applied to spam detection, intrusion detection, malicious file detection, malicious url detection, insurance claims leakage detection, activity/behaviour based authentication, threat detection and data loss prevention.

Worth noting that these successes are typically in narrow domains with narrow variations of what is being detected. Intrusion detection is a fairly hard problem for machine learning because the number of variations of attacks is high. As someone said, we’ll be using signatures for a long time.

The previous burst of activity around neural networks in the late 80’s and early 90’s had subsided around the same time as the rise of the internet in the mid to late 90’s. Around 2009, as GPU’s made parallel processing more mainstream, there was a resurgence in activity – deeper, multilayer, networks looking at overlapping regions of images (similar to wavelets) lead to convolutional neural networks being developed. These have had successes in image and voice recognition. A few resources – GPU gems for general purpose computing, visualizing convolutional netscaffe deep learning framework.