How to Analyze Tweet Sentiments with PHP Machine Learning
This article was peer reviewed by Wern Ancheta. Thanks to all of SitePoint’s peer reviewers for making SitePoint content the best it can be!
As of late, it seems everyone and their proverbial grandma is talking about Machine Learning. Your social media feeds are inundated with posts about ML, Python, TensorFlow, Spark, Scala, Go and so on; and if you are anything like me, you might be wondering, what about PHP?
Yes, what about Machine Learning and PHP? Fortunately, someone was crazy enough not only to ask that question, but to also develop a generic machine learning library that we can use in our next project. In this post we are going take a look at PHP-ML – a machine learning library for PHP – and we’ll write a sentiment analysis class that we can later reuse for our own chat or tweet bot. The main goals of this post are:
- Explore the general concepts around Machine learning and Sentiment Analysis
- Review the capabilities and shortcomings of PHP-ML
- Define the problem we are going to work on
- Prove that trying to do Machine learning in PHP isn’t a completely crazy goal (optional)
What is Machine Learning?
Machine learning is a subset of Artificial Intelligence that focuses on giving “computers the ability to learn without being explicitly programmed”. This is achieved by using generic algorithms that can “learn” from a particular set of data.
For example, one common usage of machine learning is classification. Classification algorithms are used to put data into different groups or categories. Some examples of classification applications are:
- Email spam filters
- Market segmentation
- Fraud detection
Machine learning is something of an umbrella term that covers many generic algorithms for different tasks, and there are two main algorithm types classified on how they learn – supervised learning and unsupervised learning.
Supervised Learning
In supervised learning, we train our algorithm using labelled data in the form of an input object (vector) and a desired output value; the algorithm analyzes the training data and produces what is referred to as an inferred function which we can apply to a new, unlabelled dataset.
For the remainder of this post we will focus on supervised learning, just because its easier to see and validate the relationship; keep in mind that both algorithms are equally important and interesting; one could argue that unsupervised is more useful because it precludes the labelled data requirements.
Unsupervised Learning
This type of learning on the other hand works with unlabelled data from the get-go. We don’t know the desired output values of the dataset and we are letting the algorithm draw inferences from datasets; unsupervised learning is especially handy when doing exploratory data analysis to find hidden patterns in the data.
PHP-ML
Meet PHP-ML, a library that claims to be a fresh approach to Machine Learning in PHP. The library implements algorithms, neural networks, and tools to do data pre-processing, cross validation, and feature extraction.
I’ll be the first to admit PHP is an unusual choice for machine learning, as the language’s strengths are not that well suited for Machine Learning applications. That said, not every machine learning application needs to process petabytes of data and do massive calculations – for simple applications, we should be able to get away with using PHP and PHP-ML.
The best use case that I can see for this library right now is the implementation of a classifier, be it something like a spam filter or even sentiment analysis. We are going to define a classification problem and build a solution step by step to see how we can use PHP-ML in our projects.
The Problem
To exemplify the process of implementing PHP-ML and adding some machine learning to our applications, I wanted to find a fun problem to tackle and what better way to showcase a classifier than building a tweet sentiment analysis class.
One of the key requirements needed to build successful machine learning projects is a decent starting dataset. Datasets are critical since they will allow us to train our classifier against already classified examples. As there has recently been significant noise in the media around airlines, what better dataset to use than tweets from customers to airlines?
Fortunately, a dataset of tweets is already available to us thanks to Kaggle.io. The Twitter US Airline Sentiment database can be downloaded from their site using this link
The Solution
Let’s begin by taking a look at the dataset we will be working on. The raw dataset has the following columns:
- tweet_id
- airline_sentiment
- airline_sentiment_confidence
- negativereason
- negativereason_confidence
- airline
- airline_sentiment_gold
- name
- negativereason_gold
- retweet_count
- text
- tweet_coord
- tweet_created
- tweet_location
- user_timezone
And looks like following example (side-scrollable table):
tweet_id | airline_sentiment | airline_sentiment_confidence | negativereason | negativereason_confidence | airline | airline_sentiment_gold | name | negativereason_gold | retweet_count | text | tweet_coord | tweet_created | tweet_location | user_timezone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
570306133677760513 | neutral | 1.0 | Virgin America | cairdin | 0 | @VirginAmerica What @dhepburn said. | 2015-02-24 11:35:52 -0800 | Eastern Time (US & Canada) | |||||||
570301130888122368 | positive | 0.3486 | 0.0 | Virgin America | jnardino | 0 | @VirginAmerica plus you’ve added commercials to the experience… tacky. | 2015-02-24 11:15:59 -0800 | Pacific Time (US & Canada) | ||||||
570301083672813571 | neutral | 0.6837 | Virgin America | yvonnalynn | 0 | @VirginAmerica I didn’t today… Must mean I need to take another trip! | 2015-02-24 11:15:48 -0800 | Lets Play | Central Time (US & Canada) | ||||||
570301031407624196 | negative | 1.0 | Bad Flight | 0.7033 | Virgin America | jnardino | 0 | “@VirginAmerica it’s really aggressive to blast obnoxious “”entertainment”” in your guests’ faces & they have little recourse” | 2015-02-24 11:15:36 -0800 | Pacific Time (US & Canada) | |||||
570300817074462722 | negative | 1.0 | Can’t Tell | 1.0 | Virgin America | jnardino | 0 | @VirginAmerica and it’s a really big bad thing about it | 2015-02-24 11:14:45 -0800 | Pacific Time (US & Canada) | |||||
570300767074181121 | negative | 1.0 | Can’t Tell | 0.6842 | Virgin America | jnardino | 0 | “@VirginAmerica seriously would pay $30 a flight for seats that didn’t have this playing. | |||||||
it’s really the only bad thing about flying VA” | 2015-02-24 11:14:33 -0800 | Pacific Time (US & Canada) | |||||||||||||
570300616901320704 | positive | 0.6745 | 0.0 | Virgin America | cjmcginnis | 0 | “@VirginAmerica yes | nearly every time I fly VX this “ear worm” won’t go away :)” | 2015-02-24 11:13:57 -0800 | San Francisco CA | Pacific Time (US & Canada) | ||||
570300248553349120 | neutral | 0.634 | Virgin America | pilot | 0 | “@VirginAmerica Really missed a prime opportunity for Men Without Hats parody | there. https://t.co/mWpG7grEZP” | 2015-02-24 11:12:29 -0800 | Los Angeles | Pacific Time (US & Canada) |
The file contains 14,640 tweets, so it’s a decent dataset for us to work with. Now, with the current amount of columns we have available we have way more data than what we need for our example; for practical purposes we only care about the following columns:
- text
- airline_sentiment
Where text
will become our feature and the airline_sentiment
becomes our target. The rest of the columns can be discarded as they will not be used for our exercise. Let’s start by creating the project, and initialize composer using the following file:
{
"name": "amacgregor/phpml-exercise",
"description": "Example implementation of a Tweet sentiment analysis with PHP-ML",
"type": "project",
"require": {
"php-ai/php-ml": "^0.4.1"
},
"license": "Apache License 2.0",
"authors": [
{
"name": "Allan MacGregor",
"email": "amacgregor@allanmacgregor.com"
}
],
"autoload": {
"psr-4": {"PhpmlExercise\\": "src/"}
},
"minimum-stability": "dev"
}
composer install
If you need an introduction to Composer, see here.
To make sure we are set up correctly, let’s create a quick script that will load our Tweets.csv
data file and make sure it has the data we need. Copy the following code as reviewDataset.php
in the root of our project:
<?php
namespace PhpmlExercise;
require __DIR__ . '/vendor/autoload.php';
use Phpml\Dataset\CsvDataset;
$dataset = new CsvDataset('datasets/raw/Tweets.csv',1);
foreach ($dataset->getSamples() as $sample) {
print_r($sample);
}
Now, run the script with php reviewDataset.php
, and let’s review the output:
Array( [0] => 569587371693355008 )
Array( [0] => 569587242672398336 )
Array( [0] => 569587188687634433 )
Array( [0] => 569587140490866689 )
Now that doesn’t look useful, does it? Let’s take a look at the CsvDataset
class to get a better idea of what’s happening internally:
<?php
public function __construct(string $filepath, int $features, bool $headingRow = true)
{
if (!file_exists($filepath)) {
throw FileException::missingFile(basename($filepath));
}
if (false === $handle = fopen($filepath, 'rb')) {
throw FileException::cantOpenFile(basename($filepath));
}
if ($headingRow) {
$data = fgetcsv($handle, 1000, ',');
$this->columnNames = array_slice($data, 0, $features);
} else {
$this->columnNames = range(0, $features - 1);
}
while (($data = fgetcsv($handle, 1000, ',')) !== false) {
$this->samples[] = array_slice($data, 0, $features);
$this->targets[] = $data[$features];
}
fclose($handle);
}
The CsvDataset
constructor takes 3 arguments:
- A file-path to the source CSV
- An integer that specifies the number of features in our file
- A boolean to indicate if the first row is header
If we look a little closer we can see that the class is mapping out the CSV file into two internal arrays: samples and targets. Samples contains all the features provided by the file and targets contains the known values (negative, positive, or neutral).
Based on the above, we can see that the format our CSV file needs to follow is as follows:
| feature_1 | feature_2 | feature_n | target |
We will need to generate a clean dataset with only the columns we need to continue working. Let’s call this script generateCleanDataset.php
:
<?php
namespace PhpmlExercise;
require __DIR__ . '/vendor/autoload.php';
use Phpml\Exception\FileException;
$sourceFilepath = __DIR__ . '/datasets/raw/Tweets.csv';
$destinationFilepath = __DIR__ . '/datasets/clean_tweets.csv';
$rows =[];
$rows = getRows($sourceFilepath, $rows);
writeRows($destinationFilepath, $rows);
/**
* @param $filepath
* @param $rows
* @return array
*/
function getRows($filepath, $rows)
{
$handle = checkFilePermissions($filepath);
while (($data = fgetcsv($handle, 1000, ',')) !== false) {
$rows[] = [$data[10], $data[1]];
}
fclose($handle);
return $rows;
}
/**
* @param $filepath
* @param string $mode
* @return bool|resource
* @throws FileException
*/
function checkFilePermissions($filepath, $mode = 'rb')
{
if (!file_exists($filepath)) {
throw FileException::missingFile(basename($filepath));
}
if (false === $handle = fopen($filepath, $mode)) {
throw FileException::cantOpenFile(basename($filepath));
}
return $handle;
}
/**
* @param $filepath
* @param $rows
* @internal param $list
*/
function writeRows($filepath, $rows)
{
$handle = checkFilePermissions($filepath, 'wb');
foreach ($rows as $row) {
fputcsv($handle, $row);
}
fclose($handle);
}
Nothing too complex, just enough to do the job. Let’s execute it with phpgenerateCleanDataset.php
.
Now, let’s go ahead and point our reviewDataset.php
script back to the clean dataset:
Array
(
[0] => @AmericanAir That will be the third time I have been called by 800-433-7300 an hung on before anyone speaks. What do I do now???
)
Array
(
[0] => @AmericanAir How clueless is AA. Been waiting to hear for 2.5 weeks about a refund from a Cancelled Flightled flight & been on hold now for 1hr 49min
)
BAM! This is data we can work with! So far, we have been creating simple scripts to manipulate the data. Next, we are going to start creating a new class under src/classification/SentimentAnalysis.php
.
<?php
namespace PhpmlExercise\Classification;
/**
* Class SentimentAnalysis
* @package PhpmlExercise\Classification
*/
class SentimentAnalysis {
public function train() {}
public function predict() {}
}
Our Sentiment class will need two functions in our sentiment analysis class:
- A train function, which will take our dataset training samples and labels and some optional parameters.
- A predict function, which will take an unlabelled dataset and assigned a set of labels based on the training data.
In the root of the project create a script called classifyTweets.php
. We will use his script to instantiate and test our sentiment analysis class. Here is the template that we will use:
<?php
namespace PhpmlExercise;
use PhpmlExercise\Classification\SentimentAnalysis;
require __DIR__ . '/vendor/autoload.php';
// Step 1: Load the Dataset
// Step 2: Prepare the Dataset
// Step 3: Generate the training/testing Dataset
// Step 4: Train the classifier
// Step 5: Test the classifier accuracy
Step 1: Load the Dataset
We already have the basic code that we can use for loading a CSV into a dataset object from our earlier examples. We are going to use the same code with a few tweaks:
<?php
...
use Phpml\Dataset\CsvDataset;
...
$dataset = new CsvDataset('datasets/clean_tweets.csv',1);
$samples = [];
foreach ($dataset->getSamples() as $sample) {
$samples[] = $sample[0];
}
This generates a flat array with only the features – in this case the tweet text – which we are going to use to train our classifier.
Step 2: Prepare the Dataset
Now, having the raw text and passing that to a classifier wouldn’t be useful or accurate since every tweet is essentially different. Fortunately, there are ways of dealing with text when trying to apply classification or machine learning algorithms. For this example, we are going to make use of the following two classes:
- Token Count Vectorizer: This will transform a collection of text samples to a vector of token counts. Essentially, every word in our tweet becomes a unique number and keeps track of amounts of occurrences of a word in a specific text sample.
- Tf-idf Transformer: short for term frequency–inverse document frequency, is a numerical statistic intended to reflect how important a word is to a document in a collection or corpus.
Let’s start with our text vectorizer:
<?php
...
use Phpml\FeatureExtraction\TokenCountVectorizer;
use Phpml\Tokenization\WordTokenizer;
...
$vectorizer = new TokenCountVectorizer(new WordTokenizer());
$vectorizer->fit($samples);
$vectorizer->transform($samples);
Next, apply the Tf-idf Transformer:
<?php
...
use Phpml\FeatureExtraction\TfIdfTransformer;
...
$tfIdfTransformer = new TfIdfTransformer();
$tfIdfTransformer->fit($samples);
$tfIdfTransformer->transform($samples);
Our samples array is now in a format where it an easily be understood by our classifier. We are not done yet, we need to label each sample with its corresponding sentiment.
Step 3: Generate the Training Dataset
Fortunately, PHP-ML has this need already covered and the code is quite simple:
<?php
...
use Phpml\Dataset\ArrayDataset;
...
$dataset = new ArrayDataset($samples, $dataset->getTargets());
We could go ahead and use this dataset and train our classifier. We are missing a testing dataset to use as validation, however, so we are going to “cheat” a little bit and split our original dataset into two: a training dataset and a much smaller dataset that will be used for testing the accuracy of our model.
<?php
...
use Phpml\CrossValidation\StratifiedRandomSplit;
...
$randomSplit = new StratifiedRandomSplit($dataset, 0.1);
$trainingSamples = $randomSplit->getTrainSamples();
$trainingLabels = $randomSplit->getTrainLabels();
$testSamples = $randomSplit->getTestSamples();
$testLabels = $randomSplit->getTestLabels();
This approach is called cross-validation. The term comes from statistics and can be defined as follows:
Cross-validation, sometimes called rotation estimation, is a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. — Wikipedia.com
Step 4: Train the Classifier
Finally, we are ready to go back and implement our SentimentAnalysis
class. If you haven’t noticed by now, a huge part of machine learning is about gathering and manipulating the data; the actual implementation of the Machine learning models tends to be a lot less involved.
To implement our sentiment analysis class, we have three classification algorithms available:
- Support Vector Classification
- KNearestNeighbors
- NaiveBayes
For this exercise we are going to use the simplest of them all, the NaiveBayes classifier, so let’s go ahead and update our class to implement the train method:
<?php
namespace PhpmlExercise\Classification;
use Phpml\Classification\NaiveBayes;
class SentimentAnalysis
{
protected $classifier;
public function __construct()
{
$this->classifier = new NaiveBayes();
}
public function train($samples, $labels)
{
$this->classifier->train($samples, $labels);
}
}
As you can see, we are letting PHP-ML do all the heavy lifting for us. We are just creating a nice little abstraction for our project. But how do we know if our classifier is actually training and working? Time to use our testSamples
and testLabels
.
Step 5: Test the Classifier’s Accuracy
Before we can proceed with testing our classifier, we do have to implement the prediction method:
<?php
...
class SentimentAnalysis
{
...
public function predict($samples)
{
return $this->classifier->predict($samples);
}
}
And again, PHP-ML is doing us a solid and doing all the heavy lifting for us. Let’s update our classifyTweets
class accordingly:
<?php
...
$predictedLabels = $classifier->predict($testSamples);
Finally, we need a way to test the accuracy of our trained model; thankfully PHP-ML has that covered too, and they have several metrics classes. In our case, we are interested in the accuracy of the model. Let’s take a look at the code:
<?php
...
use Phpml\Metric\Accuracy;
...
echo 'Accuracy: '.Accuracy::score($testLabels, $predictedLabels);
We should see something along the lines of:
Accuracy: 0.73651877133106%
Conclusion
This article fell a bit on the long side, so let’s do a recap of what we’ve learned so far:
- Having a good dataset from the start is critical for implementing machine learning algorithms.
- The difference between supervised learning and unsupervised Learning.
- The meaning and use of cross-validation in machine learning.
- That vectorization and transformation are essential to prepare text datasets for machine learning.
- How to implement a Twitter sentiment analysis by using PHP-ML’s NaiveBayes classifier.
This post also served as an introduction to the PHP-ML library and hopefully gave you a good idea of what the library can do and how it can be embedded in your own projects.
Finally, this post is by no means comprehensive and there is plenty to learn, improve and experiment with; here are some ideas to get you started on how to improve things further:
- Replace the NaiveBayes algorithm with the Support Vector Classification algorithm.
- If you tried running against the full dataset (14,000 rows) you’d probably notice how memory intensive the process can be. Try implementing model persistence so it doesn’t have to be trained on each run.
- Move the dataset generation to its own helper class.
I hope you found this article useful. If you have some application ideas regarding PHP-ML or any questions, don’t hesitate to drop them below into the comments area!