Social Spam Detection with Event Identification and Spam Manipulator Extraction
✅ Paper Type: Free Essay | ✅ Subject: Information Systems |
✅ Wordcount: 3842 words | ✅ Published: 8th Feb 2020 |
Report said 62% US adults currently acquire news and information from social networks [1]. In fact, due to billions of connections in social communities, social network has become the supreme kingdom of news and public events. And that’s why Donald Trump publishes every detail of his policy on Twitter, not just through the White House spokesperson [2]. The proliferation of social networks built on shared activities and comments by public users has turned them into serious threats for spreading fakes news, fraudulent advertising, propagating political rumors, biasing products’ value, and even inducing democratic chaos. Examples like fake news in Facebook [3], cheating reviews in Yelp [4], ISIS propaganda distribution [5], and West Virginia Chaos [6] have greatly presented the serious consequences causing by the biased and spam actively social medias.
If you need assistance with writing your essay, our professional essay writing service is here to help!
Essay Writing ServiceThe “Internet water army” plays a central role in this ecosystem. One of their main characteristics is: a real person can manipulate thousands of “zombie” accounts to complete his plan. That has been a worldwide phenomenon exemplified by USA, China, UK, and Australia companies [7]. For instance, the Subvert And Profit website claims “25,000 users who earn money by viewing, voting, fanning, rating, or posting assigned task”, with payments ranging from $0.40 (e.g., Facebook) to $1.00 (e.g., Digg) [7].
Another characteristic of spam manipulators is: their activities normally target on the most current topics and events. A one-month social rumor tracking presents that a rumor only last in days after it was published [8]. Besides, these spammers typically utilize the most current event to enlarge their influence. Under the influences of Boston explosion, a hacked official Twitter account of the Associated Press claimed that two new explosions in the White House and the President being injured. That rumor spread to millions of users and raised an immediate serious public panic, causing a crash of the stock market [8]. The close connections between spam manipulators and social events make the spam posts highly deceptive. Several former works had alerted that the self-development of social spam with new events topics had greatly increased the difficulties of detection [9, 10].
Problem needs to be fixed. Prior research and many observations have shown that the developing and fast-changing social spam with popular events has been a trend in current social networks. Therefore, how to effectively online detect spam manipulator tangling with social events is a merging problem in real-world situation.
Challenges. Several challenges make the online detection of social spam manipulator is very difficult to be implemented. The first challenge is to achieve event spam detection with online chaotic social data. Online social data are not clear identified, they normally mix with heterogeneous topics, events, or topics etc. By offline per-processing, social data can be easily separated within specific event/topic, which can be further processed in the spam manipulator identification. However, online social data are tangling with multiple events/topics, and that means social data are messed with topics or events. Therefore, identifying social event spam is difficult in the online social data.
Another challenge is to achieve efficient spam manipulator detection by processing the least amount of data. Traditional spammer detection often worked along with the entire dataset, however, they ignore a fact that dealing with real-world social data is subject to a huge amount of online data. These data are mixed with multiple social events or topics. So, identifying a single spam account from the messy data is extremely difficult. For example, Twitter services handle more than 2.8 billion requests and store 4.5 petabytes of time series data every minute [17]. Trying to filter out a spammer or spam post from this amount of data with the original feature set is dim prospected, e.g. normalizing social data into the same formats with the feature set can be extremely time-consuming. Therefore, identifying social event spam from as little data as possible is a key point in the face of real-world scenarios.
Limitations. Though lots of prior work presented various methodologies in detecting social spam manipulator, these works mainly focused on offline data processing. For example, by implementing a sophiscated feature engineering work to extract a feature set, they then achieve graceful performance in spam detection with offline data. However, this method has several limitations in dealing with online social spam data.
First, a static feature set from a specific data source is difficult to be extended to deal with the data from new sources (e.g., new topics or different platforms). For example, the user features can work well in dealing with Social Honeypot Dataset with F1 score near 94%, but for 1KS-10KN dataset the F1 score only has 79%; N-gram features in 1KS-10KN dataset can get F1 score more than 82%, but in Social Honeypot Dataset the F1 score can only achieve more than 70% [14]. Further, more work focused on complicated features, e.g., by finding anomalous patterns of pronouns, conjunctions, emotional words [15], user credibility [16], etc. However, these features are too specific and difficult to be extended to new data since these new data may not be extracted values based on these features.
Second, online spam detection requires relatively short delays, but traditional feature engineering work can take days or even months to complete, which is naturally not suitable for online social spam detection. Some work explored the real-time spam detection []. However, they specifically focused on URL detection, which limits their use in real-world situation []. Most successful applications of online/real-time processing in social media are only in topic or event recognition []. Achieving online social event spam detection requires the processing should have some degree of automation, especially when dealing with the labor-intensive feature engineering.
Figure 1: The system workflows
In this paper, we introduce xxx, a system to unmask the online social event spam with minimal data processing. As the Figure 1 shows, by utilizing the historical events spam data in RNN, we automatically extract the common spam event patterns. Besides, XXX can extract the features of spam manipulators simultaneously. When dealing with the online social data, we first utilize the event spam common patterns to filter out the spam event, which can greatly decrease the data size in the continuing processing. Then we implement the newly identified event spam with the features of spam manipulators to detect the new spammers. By jointly implementing with social event spam pattern and spam manipulator features, xxx can identify the new comely event spam and manipulators from the social data with minimized data size and high efficiency.
2. Recurrent Neural Network and LSTM
Recurrent Neural Network (RNN) is a rapidly emerging architecture originally from the traditional artificial neural network (ANN) [18]. The characteristic that differentiate RNN from other neural networks is the connections between nodes in the hidden layer form a directed graph along a sequence. Numerous research results have shown RNN is very good at predicting the next character in the text or the next word in a sequence, and can be used for complex tasks [18]. For instance, traditional ANNs can solve the “filling-the-blank” problem, e.g., give a solution of which word should be embedded into “Tom leaves Washington in July, he is now in _”, but the solution may contain “Washington” since via the N-gram (n word before the blank, n is typically 3 or 4) only contains “he is now in” and cannot have the memory of former sentence. In contrast, RNN can utilize the memory of historical words (e.g., “leave Washington”), gives a correct prediction which should not have “Washington” in the blank of the sentence. m
Figure 3 shows the example of RNN that can make correct predict of the vital information for sentence by training on a set of word sequence, such as “arrive Washington in July” or “leave Washington in July”. In the input layer, each word is a high-dimensional vector connected to the hidden unit. The hidden unit normally applies a simple operation, e.g. tangent nonlinearity function, on both the vector from the input layer and the memory of former output, and sends the results to the output and the next hidden unit. The training rationale is straightforward. For any given training sequence, the expected result should have the maximal possibility in the output layer. If the hidden layer yields result other than that, the weights and vectors will be updated by back-propagation through time sequence [22]. Afterwards, the updated vectors on input words will be copied to the output layer. In this way, RNN can be trained to predict the possibility of results with the memory of historical inputs.
Figure 3. A Recurrent Neural Network example for calculating the word of ‘leave’ or ‘arrive’ Washington. The training is done by using ‘leave Washington in July’ and ‘arrive Washington in July’ (here we only represent ‘leave’ in the first unit). Each output Y(t) corresponds to one of the words. Output Y(t) relates to both the current time input X(t) and the last time output Y(t-1). After training, RNN will remember the different values of word ‘leave’ and ‘arrive’ with ‘Washington’. In the test, when use test word ‘Tom is not in’, RNN will predict the values of word which cannot be “Washington”.
To overcome the limitation of short-term memory of RNN, a new long short-term memory (LSTM) networks that use special hidden units, the natural behavior of which is to remember inputs for a long time [18]. A special unit called the memory cell acts like an accumulator or a gate: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory [18]. As the Figure 3(b) shows, LSTM uses the gates (i.e. forget gate, input gate, and output gate) to control which part of former memory results can be utilized in the current computation, and then decide which part of output can send to the next computation.
Figure 4. A LSTM example and the architecture. The σ is a logistic sigmoid function. The input gate itdetermines the degree to which the new memory is added to the memory cell. The forget gate ftdecides the extent to which the existing memory is forgotten. The memory ct is updated by forgetting part of the existing memory and adding new memory. The output gate otis the amount of output memory. Y(t) is the output at this time slot t.
In this study, inspiring by the success of RNN in dealing with the time sequential applications, e.g. speech recognition [19], fake news detection [10], etc., we propose to use RNN to process the social spam posts. RNN is well-fitted to the social spam detection with two reasons: first, social network data is based on time sequences, i.e. posts are sequential in nature. Second, the training data is of variable length, i.e. the number of posts will various in different time slot. RNN can conveniently handle the length variable input with the neural network architecture. RNN can utilize the time sequence of the posts in its memory to achieve good performances. Besides, the characteristics of neural networks avoid the labor-intensive feature engineering since NN can automatically extract proper features in the training process.
3. System Details
In this section, we introduce the Sifter system, discuss each functional component of the system, and outline the details of workflow in Sifter.
3.1 Sifter Overview
Sifter is a distributed system which consists of multiple nodes. Each node is a basic functional cell in the system which is responsible for specific task during the processing.
Sifter Node: Each Sifter node is assigned a unique, 128-bit nodeId in a circular nodeId space ranging from 0~2128-1. All nodes’ nodeIds are uniformly distributed. Given a message and a key, the message can be guaranteed to be routed to the node with the nodeId numerically closest to that key, within log2bN steps, where b is a base with a normal value 4. Besides, each node maintains a topic table, which is used to store the topics/events collected from the local social media data sources. For example, if node 54ad22 gathers social data mainly from five topics ‘MeToo’, ’LaHaya’, ‘EXO’. ‘FelizLunes’, and ‘Emmy Award’, it will save these five topics instances into its topic table. The topic table is used to create Sifter groups and topic-based spam identification in the continuing processing.
Sifter Group: For the sake of detecting social spam events, Sifter supports multiple groups in the system, each of which targets on one specific topic or event from the real-world social communities. At the beginning, Sifter allows nodes to create topic/event-based group via its topic table. The Sifter group management is fulfilled by Scribe methodology which is an application-level group communication system built upon Pastry [20, 21]. Sifter uses a pseudorandom Pastry key to name a group, called groupId. Usually, the groudId is the hash of the topic/event’s textual name concatenated with its creator’s name. In Sifter, multiple nodes join in a group and maintain a spanning tree via Scribe. Sifter defines that only the node who collects the data from the specific topic can join the topic-based group. As the Figure 5 shows, for example, node ea2df identifies that it receives the social data from the topic ‘Emmy Award’. It then routes a JOIN message towards the group which has the groupId ‘Emmy Award’. The message will continue to be routed until it reaches the node d25ac in that group.
Each Sifter group specifically responds for one social event data. For example, nodes in the group (‘Carnival’) will deal with the social data under this topic (e.g., with ‘Carnival’ hashtag). The purpose of the event-based group is to achieve graceful spam detection performances with minimal related data. It is inefficient for detecting event spam in messy data from multiple mixed topics. Therefore, with isolation by topics, Sifter allows each group to focus on its topic, to mine the inner connections of social event spam. Besides, an event-based group can be useful for filtering out the spam manipulators in the next step. Since it is very difficult to mine a spam manipulator from a wide range of social communities, identifying a special account from the limited scope of a group will be much easier.
Figure 5. Group management in Sifter.
Group Root: The root node utilizes the historical social spam event data to extract the common patterns of spam events and identifies the characteristics of spam manipulators. The group root will multicast these patterns and characteristics to all the following nodes. After that, each node in the group will have a global view of spam event patterns and spam manipulator characteristics. Besides, it orchestrates the whole workflow of parent nodes and leaf nodes by multicasting different commands to them. For example, the root node decides when to exchange all features in different leaf nodes, determines model consistency in the whole system, and makes decisions of spam events.
Our academic experts are ready and waiting to assist with any writing project you may have. From simple essay plans, through to full dissertations, you can guarantee we have a service perfectly matched to your needs.
View our servicesIn Sifter group, the group root will be responsible for the consistency of RNN models. The group root will communicate with all other group members to acquire different features from RNN model. Then the group root will determine the final optimal model by evaluating all features. Finally, the group root sends the optimal model features to all other members and lets them update their models. With the consistent models, each group can effectively handle the social data within specific event.
Leaf nodes: As shown in Figure 5, multiple leaf nodes in Sifter can collect the social data from various local data servers through social network APIs. Each leaf node promptly updates its topic table based on the current popular topics from the local data server. Since a topic typically lasts for a few days, the frequency of topic updates is not very high. In addition, we set a threshold for topic updates frequency to prevent the node from joining or leaving one group too frequently if the update frequency is high. Moreover, each leaf node can maintain its own Recurrent Neural Network, which is used for the processing of social data, identifying the spam events, and filtering out the spam manipulators. Cooperating with the topic table, leaf node can effectively process the topic-based social data by utilizing the Recurrent Neural Network.
RNN in leaf node. Sifter assembles each leaf node with a separate RNN architecture for social data processing. For achieving good performances and detect social spam with long logs, we implement LSTM in leaf node. Considering treating each time-stamp as an input to a cell should be extremely inefficient and reducing utility [9], we propose to partition the data into segments, each of which will be an input to a cell. We apply a natural partitioning by changing the temporal granularity from seconds to hours. In each interval, we use the tf*idf values of the vocabulary terms as input. We prune the vocabulary by keeping the top-K terms according their tf*idf value, so the input dimension is K.
[1] Alessandro Bessi and Emilio Ferrara. “Social bots distort the 2016 US Presidential election online discussion”. In: (2016).
[2] Former staffer reveals how Trump fell in love with Twitter. https://www.businessinsider.com/trump-loves-twitter-mentions-print-out-tweets-2018-8
[3] Allcott, Hunt, and Matthew Gentzkow. “Social media and fake news in the 2016 election.” Journal of Economic Perspectives31.2 (2017): 211-36.
[4] Luca, Michael, and Georgios Zervas. “Fake it till you make it: Reputation, competition, and Yelp review fraud.” Management Science 62.12 (2016): 3412-3427.
[5] Shane, Scott, and Ben Hubbard. “ISIS displaying a deft command of varied media.” New York Times 30 (2014).
[6] So Much Trump Chaos. https://www.nytimes.com/2018/03/02/opinion/trump-gun-control-nra.html
[7] Internet Water Army. https://en.wikipedia.org/wiki/Internet_Water_Army
[8] Zhao, Zhe, Paul Resnick, and Qiaozhu Mei. “Enquiring minds: Early detection of rumors in social media from enquiry posts.” In Proceedings of the 24th International Conference on World Wide Web, pp. 1395-1405. International World Wide Web Conferences Steering Committee, 2015.
[9] Ruchansky, Natali, Sungyong Seo, and Yan Liu. “Csi: A hybrid deep model for fake news detection.” In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 797-806. ACM, 2017.
[10] Ma, Jing, et al. “Detecting Rumors from Microblogs with Recurrent Neural Networks.” IJCAI. 2016.
[11] Chakraborty, Manajit, Sukomal Pal, Rahul Pramanik, and C. Ravindranath Chowdary. “Recent developments in social spam detection and combating techniques: A survey.” Information Processing & Management 52, no. 6 (2016)
[12] Cao, Cheng, and James Caverlee. “Detecting spam urls in social media via behavioral analysis.” In European Conference on Information Retrieval, pp. 703-714. Springer, Cham, 2015.
[13] Wu, Fangzhao, Jinyun Shu, Yongfeng Huang, and Zhigang Yuan. “Co-detecting social spammers and spam messages in microblogging via exploiting social contexts.” Neurocomputing201 (2016): 51-65.
[14] Wang, Bo, Arkaitz Zubiaga, Maria Liakata, and Rob Procter. “Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter.” In Workshop on Making Sense of Microposts, vol. 1395, pp. 10-16. 2015.
[15] Zhang, Qiang, et al. “Spam comments detection with self-extensible dictionary and text-based features.” Computers and Communications (ISCC), 2017 IEEE Symposium on. IEEE, 2017.
[16] Viviani, Marco, and Gabriella Pasi. “Credibility in social media: opinions, news, and health information—a survey.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 7.5 (2017): e1209.
[17] Twitter – Statistics & Facts. https://www.statista.com/topics/737/twitter/
[18] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep learning.” nature 521, no. 7553 (2015): 436.
[19] Miao, Yajie, Mohammad Gowayyed, and Florian Metze. “EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding.” Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on. IEEE, 2015.
Cite This Work
To export a reference to this article please select a referencing stye below:
Related Services
View allDMCA / Removal Request
If you are the original writer of this essay and no longer wish to have your work published on UKEssays.com then please: