Is Your Personal Data at Risk?
As our digital world grows, more and more personal data is being collected and stored by schools, businesses, government agencies, research groups, online services, etc. While the efficiency of the Internet provides an excellent backbone for amassing and accessing this wealth of information, it also comes with an unfortunate drawback. Specifically, the potentially high value of that data and the ease of access that the Internet provides also serves to attract and motivate thieves, hackers, and other malicious parties that would like to gain access to this rich supply of personal data.
As our digital world grows, more and more personal data is being collected and stored by schools, businesses, government agencies, research groups, online services, etc. While the efficiency of the Internet provides an excellent backbone for amassing and accessing this wealth of information, it also comes with an unfortunate drawback. Specifically, the potentially high value of that data and the ease of access that the Internet provides also serves to attract and motivate thieves, hackers, and other malicious parties that would like to gain access to this rich supply of personal data.
There are three uses of statistical analysis that are commonly used by scientists, mathematicians, politicians, and other professionals across the globe:
- Descriptive analytics provide information about collected data via statistics that you are probably familiar with — mean, median, mode, range, etc. They tend to "describe" circumstances, but don't offer conjectures about unknowns.
- e.g., How many (in terms of percentage) computer science graduates are paid salaries of $100,000 or more within five years of graduating?
- Predictive analytics may provide information about future (or merely unobserved or unknown) events based on previously collected and analyzed data.
- e.g., How likely is that I will be able to find a high-paying job if I choose to major in computer science vs. biology?
- Prescriptive analytics may provide information to maximize the chances of a future event occurring, based on comparing the predictive analyses of multiple options
- e.g., Which major should I choose in order to maximize my chances of making the highest starting salary after graduation?
Generally, more data leads to greater confidence. Each of these are based on building models from data. The models' fit to the data increases their power (and thus, utility). This is why big data can be so powerful.
Google's searches are often effective, because their data set is huge. They have a ton of data from which to conduct descriptive, predictive, and prescriptive analyses, and then use those analyses to improve user experiences.
Google's searches are often effective, because their data set is huge. They have a ton of data from which to conduct descriptive, predictive, and prescriptive analyses, and then use those analyses to improve user experiences.
Data Mining
Data mining is akin to the discovery of patterns in large data sets. Like ore mining, data mining begins with an exploration (analysis) of a resource pool (data), and proceeds to determine whether usable resources exist (correlations) and to what degree (how strong they are). Not all data miners "strikes it rich." Like ore mining, data mining can result in the observation of no useful patterns. However, like ore mining, sometimes data mining leads to a bonanza of useful information.
In data mining, the emphasis is on the discovery of new knowledge. Data miners want to find new patterns that were previously unobserved. They use statistical analysis of big data to discover what the human eye can't see, just like an ore miner might use a pick, dynamite, or lab test to uncover ore that was not visible to the naked eye before. This is a form of exploratory data analysis rather than statistical hypothesis testing.
In data mining, the emphasis is on the discovery of new knowledge. Data miners want to find new patterns that were previously unobserved. They use statistical analysis of big data to discover what the human eye can't see, just like an ore miner might use a pick, dynamite, or lab test to uncover ore that was not visible to the naked eye before. This is a form of exploratory data analysis rather than statistical hypothesis testing.
Data Mining Strategies
Anomaly detection (Outlier/change/deviation detection) — The identification of unusual data records, that might be interesting or simply data errors and require further investigation.Movie X is unlike any of the other movies in User Y's data set. Remove it from our calculations. (example: The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends, and Clifford.
Association rule learning (Dependency modeling) — Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.Recommender systems—Users who like Movie X tend to also like Movie Y.
Clustering — is the task of discovering groups and structures in the data that are in some way or another "similar," without using known structures in the data.Dynamically grouped movie categories: "Romantic Comedies in Paris starring former professional football players."
Classification — is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam."Movie X is a romantic comedy.
Regression — Attempts to find a function that models the data with the least error.Type X users typically increase their movie consumption rate by four movies per year.
Summarization — providing a more compact representation of the data set, including visualization and report generation.What type of movie does User X typically like? (i.e., sum up user X's preferences in Y words)
Anomaly detection (Outlier/change/deviation detection) — The identification of unusual data records, that might be interesting or simply data errors and require further investigation.Movie X is unlike any of the other movies in User Y's data set. Remove it from our calculations. (example: The Texas Chainsaw Massacre is on a list that mostly contains titles such as Teletubbies, Barney and Friends, and Clifford.
Association rule learning (Dependency modeling) — Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.Recommender systems—Users who like Movie X tend to also like Movie Y.
Clustering — is the task of discovering groups and structures in the data that are in some way or another "similar," without using known structures in the data.Dynamically grouped movie categories: "Romantic Comedies in Paris starring former professional football players."
Classification — is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam."Movie X is a romantic comedy.
Regression — Attempts to find a function that models the data with the least error.Type X users typically increase their movie consumption rate by four movies per year.
Summarization — providing a more compact representation of the data set, including visualization and report generation.What type of movie does User X typically like? (i.e., sum up user X's preferences in Y words)