By Andrew Johnston
Did you know that you have a twin? Well, in a way at least. The twin I’m speaking of isn’t an exact genetic copy of yourself; it’s a digital version of yourself. This digital twin may not be entirely accurate, but it is portraying you across the world wide web and influencing how the world interacts with you — companies predict your purchases, and strangers may “get to know you” — and all (possibly) without your
knowledge or consent.
One thing is clear about customer data — there’s a lot of it. In 2009, Facebook had 30,000 servers. It now has an estimated 180,000. In 2010, Wal-Mart stored transaction data in a 2.5 million gigabyte database … that’s equivalent to about 750 million songs. What do these organizations do with all of your information?
Enter data mining, a field which combines computing and statistics. Its goal is to find patterns in big data sets and extract useful information. In the case of Facebook, they collect data and analyze it to determine which advertisements are relevant totheir users.
Data mining has its roots in probability and statistics. Early methods, such as regression analysis, were developed in the 18th and 19th centuries to extract basic information about data and make predictions. Today, more sophisticated tools are used, such as decision trees and neural networks. The former uses graphs to model possible decisions, and the latter simulates the learning process of neurons in the brain.
Although it is a difficult process, there are cases in which data mining has been very successful, especially for large companies. In a New York Times article titled “How Companies Learn Your Secrets,” Andrew Pole from commercial retailer Target explains that this company (like many others) often collects information about their consumers. For example, he explains that Target analyzes the combination of products bought by women to assign them a “pregnancy prediction” score. Then, Target can send coupons to the woman for baby products like diapers and cribs to get her routinely visiting the store.
In one case, Target predicted that a teenage girl was pregnant. Later, her irate father approached a Target manager with coupons he had received in the mail for baby clothes. He voiced his outrage at receiving such an inappropriate advertisement. The surprised manager called the man a few days later to apologize. Now it was the father who was embarrassed; he had spoken with his daughter only to discover that Target was right, his baby girl was pregnant. With Target coming to Canada in 2013, it is important for you to be aware of the data that corporations collect … or else you may learn something about yourself that you didn’t know before.
David Skillicorn, a professor at the Queen’s University School of Computing, notes that in the consumer realm, data mining can be a double-edged sword. “Dealing with a company that knows a lot about you means that the company can offer you something that you’re likely to want,” he said. In the case of Target, it’s possible that pregnant women would appreciate the discounts.
Commercial retailers aren’t the only ones doing this though — many online companies monitor your searches and then respond in line with what you typed. “The point about Google’s targeted advertising is that they’re trying to show you ads for things you might actually care about — and that’s better than just bombarding you with junk,” Skillicorn said. On the other hand, people are often unaware that Google is monitoring their searches at all. “People often find it a little bit creepy,” Skillicorn said with a laugh.
Making predictions about what people are interested in can be a tricky business. Dr. Skillicorn has found that prediction algorithms are often inaccurate; one example is the online retailer Amazon. He said that people may receive recommendations for products that they could conceive of wanting, but that it’s ultimately something they’re not interested in. “It actually doesn’t work very well to be almost right, because that in some ways is more annoying than being totally off,” he said.
One reason that the searches may be off is that they don’t take into account who you actually are, but only incorporate your previous searches into the prediction algorithm. However, one website knows us all very well, and it often substitutes “real-life” social interactions with quips, comments and photos. Facebook is really your digital twin.
According to a recent study by Veronika Lukacs, 88% of people “creep” (monitor in great depth) an ex’s Facebook profile after a breakup. She also found that 74% try to creep an ex’s new partner. Keeping a close watch on your ex-significant other is possible because so many of us publish personal data online. “People have really gotten comfortable not only sharing more information and different kinds, but more openly and with more people,” Mark Zuckerberg, CEO and founder of Facebook, said in a live interview in 2010. This may have something to do with Facebook itself. In 2009, it changed the privacy settings on its website so that users would be recommended to share more information publicly. Zuckerberg himself accidentally published pictures of him having a Star Wars battle with his girlfriend and hugging a teddy bear in
If you have a Facebook account, you probably share photos, messages and events with your friends. But, you and your friends are not the only ones interested in this data. If you post a video, Facebook may track the time, date and place you recorded it. The company can also document how you interact with other users and the device you use to login. For example, if you visit the page from a mobile phone, they may collect GPS information to determine what city you’re in and if your friends are nearby.
More than one tech blogger has wondered what Facebook plans to do with all of this very specific data. The company derives 82% of its revenue through advertising and buyers are starting to question if they are getting an appropriate bang for their marketing buck. Recently, General Motors announced that it would no longer buy any Facebook ads.
After a disappointing Initial Public Offering (IPO), Facebook will be looking for new ways to capitalize on its mountain of data. Currently, it employs a team of 12 researchers, known inside the company as the Data Science Team, to mine the database for usable insights.
“Two big problems people think about in data mining are clustering and prediction,” Skillicorn said. “Clustering is really a way of understanding the structures in the data.” As an example, he cites telephone carriers that analyze the patterns in peoples’ calls. They identify and study different types of callers in order to construct and improve the various plans that they offer. For example, if a company noticed that most of their consumers are texting, then they may offer more “unlimited texting” plans.
Prediction involves analyzing historical data and trying to guess what new data will look like. As an example, Dr. Skillicorn notes that Facebook and LinkedIn use your current data to suggest people that you may know.
Ubiquitous internet companies and retail giants have access to a startling amount of data, but there is also a potential for data mining on a smaller scale. “Queen’s, for example, could do a lot more with analyzing grade patterns … to predict students who are in trouble,” observes Dr. Skillicorn. “Students who have very odd patterns of marks might be cheating.” However, Skillicorn remarked that validation is always the big challenge in data mining — by simply looking at marks, we can identify a student as an outlier and infer that they are cheating, but they might not be.
David Skillicorn will be teaching CISC 333 – Introduction to Data Mining this year.