

This guide is a very basic introduction to some of the approaches used in cleaning text data. But why do we need to clean text, can we not just eat it straight out of the tin? The answer is yes, if you want to, you can use the raw data exactly as you’ve received it, however, cleaning your data will increase the accuracy of your model. However, before you can use TF-IDF you need to clean up your text data. Suffice it to say that TF-IDF will assign a value to every word in every document you want to analyse and, the higher the TF-IDF value, the more important or predictive the word will typically be. The TF-IDF weight for a word i in document j is given as:Ī detailed background and explanation of TF-IDF, including some Python examples, is given here Analyzing Documents with TF-IDF. This higher score makes that word a good discriminator between documents. This means terms that only appear in a single document, or in a small percentage of the documents, will receive a higher score. The nature of the IDF value is such that terms which appear in a lot of documents will have a lower score or weight. Inverse Document Frequency (IDF) then shows the importance of a word within the entire collection of documents or corpus. The TF weighting of a word in a document shows its importance within that single document.

This means that the more times a word appears in a document the larger its value for TF will get. Term Frequency (TF) is the number of times a word appears in a document. These two vectors and could now be be used as input into your data mining model.Ī more sophisticated way to analyse text is to use a measure called Term Frequency – Inverse Document Frequency (TF-IDF).

These phrases can be broken down into the following vector representations with a simple measure of the count of the number of times each word appears in the document (phrase): Word
