SOCIAL MEDIA DATA MINING METHODS
Data mining techniques applied to social media are a relatively recent development compared to the longer history of social network analysis, which dates back to the 1930s. Despite its novelty, these techniques are already being used commercially. For instance, companies specializing in "Social Media Analytics" offer services that monitor social media platforms to provide insights into how products and services are perceived and discussed. These organizations utilize text mining algorithms and propagation models to analyze data from blogs and other social media channels, aiming to understand the movement and impact of information more effectively.
By implementing data mining techniques on social media sites, we can gain deeper insights and leverage data for various purposes including analytics, research, and business strategies. Key areas of application include detecting and analyzing communities or groups, tracking data diffusion and audience propagation, subject detection and tracking, analyzing individual and group behavior, and conducting market research for organizations.
Representation of Data
Similar to other forms of social media data, graph representation is commonly used to study social media datasets. In this context, a graph consists of vertices (or nodes) and edges (or links). Typically, users are represented as nodes, while the relationships or interactions between them are depicted as edges connecting these nodes.
Graph-based depictions are frequently employed to analyze data from social networking sites where people connect with friends, family, and business associates. This approach helps create visual representations of social networks. However, the application of graph structures to other types of online social media platforms, such as blogs, wikis, and opinion mining sites, is less immediately apparent.
For example, in the context of blogs, two types of graph representations can be used. One approach treats individual blogs as nodes, creating what is known as a "blog network." Another approach considers blog posts as nodes, resulting in a "post-network." In this latter model, edges are established when one blog post references another. Additionally, more sophisticated techniques, such as Internet Online Analytical Processing (iOLAP), integrate nodes representing individuals, relationships, content, and time to provide a comprehensive view of blog networks.
Wikis can be analyzed by treating authors as nodes and creating edges whenever an author contributes to an object or page. This method captures the collaborative nature of wiki contributions.
The graphical representation of social media data facilitates the application of classic mathematical graph theory and traditional analysis techniques. However, working with large graphs representing extensive social media networks poses challenges, including limitations related to computer memory and processing speeds. Handling these large-scale datasets often requires advanced methods to manage and analyze the data effectively. Additionally, automated procedures must address challenges such as filtering out spam, accommodating diverse formats within the same category of social media, and adapting to the constantly evolving content and structure of social media platforms.
Data Mining - A Process
When studying social media, several fundamental considerations are crucial to achieving meaningful outcomes. Each type of social media and its associated data mining objectives may require different methods and algorithms to effectively leverage the data. The choice of tools depends on the nature of the data and the specific goals of the analysis.
For instance, if the goal is to organize the data into predefined categories, classification tools may be appropriate. Conversely, if the data’s content is understood but trends and patterns are not clear, clustering tools might be more effective.
Understanding the problem at hand is essential for selecting the right approach. Thorough knowledge of the data and available data mining tools is necessary before applying any techniques. Consulting subject analysts or experts might also be beneficial for a deeper understanding of the dataset. Numerous texts and resources on data mining and machine learning can provide valuable insights into various techniques and algorithms.
Once the issues are well-understood and an appropriate data mining approach is chosen, it is important to consider data preprocessing. This may involve organizing the data to ensure manageable processing times and implementing privacy protection measures. Despite the vast amount of publicly accessible data on social media platforms, safeguarding individual rights and securing platform copyrights are critical. Addressing issues such as spam and incorporating temporal aspects of the data are also important.
Additionally, the impact of time must be considered. Depending on the research question, results may vary significantly over time. For example, in areas like subject detection, influence propagation, and network development, temporal factors can influence outcomes. Network structure, group behavior, and marketing trends can change over time, so what is relevant or popular now might not be the same in the future. Therefore, it is essential to account for these temporal dynamics when analyzing social media data.
When dealing with data represented as a graph, the process typically begins with a selection of initial nodes, known as seeds. From these seed nodes, the graph is traversed to collect data and analyze the structure. This process, known as network crawling, involves extending from the seed nodes based on the link structure to gather new information and update the network structure.
Crawling the network entails addressing challenges posed by dynamic social media platforms, such as restricted access, format changes, and structural errors like invalid links. As the crawler discovers new data, it stores this information in a repository for further analysis and continuously updates the network structure.
Social media platforms like Facebook, Twitter, and Technorati offer Application Programming Interfaces (APIs) that facilitate direct interaction with data sources. However, these APIs often impose limits on the number of transactions per day, based on the user's affiliation with the platform. In some cases, it is possible to collect data without APIs, but due to the vast amount of available data, it may be necessary to limit the crawler’s data collection scope. After data collection, postprocessing is essential to validate and clean the data. Traditional analysis methods, such as centrality measures and group structure analysis, can then be applied. Additionally, more advanced text and data mining techniques may be used to uncover deeper insights related to nodes and links.
To illustrate the application of data mining techniques in social media, consider two key areas: social media platforms and blogs. Both represent rich and valuable data sources with significant potential for contributing to scientific research and business insights. Analyzing data from these sources can offer meaningful insights into social dynamics and market trends, benefiting both the broader scientific community and commercial organizations.
Social Media Platform: Illustrative Examples
Social media platforms such as Facebook and LinkedIn consist of interconnected users, each with unique profiles. Users can interact with friends and family and share a variety of content including news, photos, stories, videos, and favorite links. While users have the ability to customize their profiles based on personal preferences, certain common data fields might include relationship status, birthday, email address, and hometown. Users also have control over how much information they include in their profiles and can manage who has access to this data. However, the extensive amount of data available on these platforms has raised significant security concerns and has become a prominent societal issue.
Here, the figure illustrates a hypothetical graph structure diagram representing typical social media platforms. Arrows in the diagram indicate connections that extend to broader sections of the graph.
Securing personal identity is crucial when handling data from social media platforms. Recent reports underscore the importance of privacy protection, as even anonymized data can potentially expose individual identities through advanced data analysis techniques. Additionally, security settings on social media platforms can limit the effectiveness of data mining applications. However, there are sophisticated methods that malicious actors might use to circumvent these security measures.