![]() So we want to clean this up so Python can understand that these two phrases are identical. To Python, this might as well be "This message is spam" versus "This message is not spam." It knows they're different without any ability to understand how different. ![]() And it isn't saying that "This message is spam" is different from "This message is spam." in that they're really close, but one has a period and one doesn't. ![]() Let's test this theory by asking Python to compare "This message is spam" to "This message is spam." So of course, Python tells us that these two strings or phrases are not equal. But you may be asking yourself, why does this really matter? Why do we need to remove punctuation? The reason that we care about this is that periods and parentheses look like just another character to Python, but realistically, a period doesn't help pull out the meaning of a sentence. So we'll import that string package, and here you can just see all kinds of punctuation and special characters in this list. Luckily the string package contains a list of punctuation in it. In order to remove the punctuation, we have to have a way to show Python what punctuation looks like. The first step we're going to take to remove the noise is to clean out all the punctuation. So we'll run that, and you can see the same data frame that we were looking at in the last video. So we can see more of the text message to ensure our cleaning steps are having the intended effect. One note I'll make is that we are adjusting the width of each column that pandas will display. For more details on these steps, feel free to revisit "NLP with Python for Machine Learning: "The Essentials." So let's start by reading in our data and cleaning up the columns. That is removing punctuation, tokenization and removing stop words. We're going to very quickly cover three pre-processing steps that will help a machine learning model more easily pick up on the signal. The challenge with text data and machine learning is that heavy pre-processing or cleaning is required to remove as much noise as possible so that the model can pick up on the signal in the data.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |