Digital Citizens

Finishing summary

2018-05-08T01:48:29+00:00

This final article will conclude by summarising the project discussing some of the overarching issues and processes that have been concerned with along the way.

Over the course of the project, aims and ideas have naturally evolved. reflecting on the proposed research question some evaluations of the project trajectory can be made. As proposed this can be detailed as follows.

Democratic Citizenship Amongst Automated Agents and Environments; What Challenges and Opportunities Exist for Citizens?

This can most accurately be described as a guide to the kind of project pursued, understandably broad and a bit vague. As progressions have been made, key concentrations were on the Facebook and Cambridge Analytica scandal. This was due to it being perceived as an interesting currently evolving piece of current affairs which addressed related themes. Specifically issues related to personalised media content, data privacy and their place within elections.

Considering this original question, opportunities seem to be in the hands of campaigners rather than citizens (or at least were). Citizens seem to be facing a greater mass of challenges in this environment. Namely, in the deciphering of political messages that may have more effective persuasive qualities could be putting them in a position where common truths are less available. How this issue evolves in the future will present a more detailed picture.

It has been noted the almost avoidance of most deliberative theory here, theory that is often central to discussions and constitutive to discussions of the public sphere. Perhaps I don’t feel Habermasian notions of the public sphere are reflected online or this ideal is not my ideal. Major social media platforms have been considered here as they are the sites that pertain the most users and thus the most citizens. If there is power in internet for bringing about democratic change it doesn’t seem to be through long considered rational public deliberation.

Conclusions conclusion

At times, it seems research priorities have been found more in exploration of methods over the topics they were supposedly studying. Looking at the project overall, attention has been given to the development of skills particularly related to natural language processing. This has been formative and fed back into more issue centred investigations. Incorporating the structure of a blog over a more traditional academic format has provided opportunities for experimentation and overall has been a rewarding experience. Any good practices built upon here will be taken forward into future research projects.

Content analysis of YouTube comments on Zuckerberg’s congressional hearing

2018-05-06T01:49:33+00:00

Noting the complete reliance of this project on computational methods, this post details steps taken in working with more traditional forms of content analysis. Staying with the theme of the Facebook and Cambridge Analytica scandal it looks at comments made on YouTube in response to Mark Zuckerberg’s Live hearing with congress.

Approach

Pretext

Having learnt about computer programming before more traditional content analysis, I hadn’t taken that much of an interest in it. Systematically processing media in accordance with a specific set of instructions sounds a lot like the work of machines. Not only that the interpretability of such ‘code books’ is a lot more subjective than a computer executing source code. However, after recently reading serval essays on the subject, interest was sparked in investigating its utilisation in a manner that combines the capacities of human and machine labour effectively. Zamith and Lewis (2015) provided comparisons between algorithmic and human approaches to coding data which informed strategies. The work of Graham (2008) was used as a starting point for an investigation of civic discussions online whilst branching off into their citations for further reference.

Investigation

Sticking with the theme of the Facebook and Cambridge Analytica scandal it was decided that an analysis of YouTube comments threads from videos broadcasting the US congressional of Mark Zuckerberg would be performed. These were collected using YouTubes Data API equating to roughly 2000 threads across 2 videos (hJdxOqnCNp8, 6ValJMOpt7s). These included the possibilities of replies within each. Using these sought to provide evidence that mapped discussions within the topic. Though YouTube comments can be just as inflammatory as Twitter posts their visibility poses more opportunities for the responses of other users over a more sustained period. Because of less restrictions on content length, sustained and lengthy arguments are possible through this medium. As threads contain responses from a variety of different actors, interactions can be graphed effectively. This is to be considered in a developing framework.

Abandonment

Ultimately, to make this process worthwhile, it seemed there needed to be a longer and more in-depth period of reflection made prior to analysing the content. Developing a framework in which to apply to the data was obviously the most time-consuming part. A general strategy of its application was to build an interface that aimed to optimise analysis by automating things like conditional logic (e.g. discussions of data regulation -> political talk) as well as any tasks that can be done by machines (e.g. constructing conversation graphs). In this process, there was a tendency to design the interface server in the most generalizable way possible which was another thing taking up too much time.

Screenshot of interface currently developed

Conclusion

This process wasn’t essential for the investigation of the Facebook and Cambridge Analytica scandal instead it was something of interest. It sought to address concerns that tools research being used were having too much of a formative effect on the types of investigations that were being made – I.e. only analysis that machines can perform well. Though not complete the word done has provided a starting point that may go on to be used in another project.

References

Graham, T. 2008. Needle in a Haystack. Javanost – The Public. 15(2). Pp.17-36.
Zamith, R. and Lewis, S. 2015. Content Analysis and the Algorithmic Coder: What Computational Social Science Means for Traditional Modes of Media Analysis. The ANNALS of the American Academy of Political and Social Science. [Online]. 659(1). pp.307-318. Available from: http://journals.sagepub.com/doi/abs/10.1177/0002716215570576#articleCitationDownloadContainer

Supporting resources

Data, politics and democracy part 4: Reflections

2018-05-01T22:23:43+00:00

Investigations here have predominantly focused on what has been shared, not how, or by which individuals specifically. This is translated into analysis of both the initial news coverage from the Guardian and the posts made via Twitter. Now this event in will be discussed in relation theories on digital privacy, also reviewing the regulatory conditions for Facebook and the underlying logic of web 2.0. It will finish by assessing some normative implications for democracies and the roles different forms of media play in addressing such issues.

Privacy Paradox?

It’s clear from both the news coverage and data collected here, that there was an apparent breach of trust in how data was shared, and through such strategies such as #deleteFacebook, users sought to express this. What is less clear is how much of an effect this has had on both long-term privacy attitudes and whether or not this incident will lead to any meaningful actions by either Facebook users or the site itself.

Privacy attitudes

Looking at what people say about their privacy and what they in actually do, often doesn’t add up. The privacy paradox describes the disconnect between people’s willingness to disclose personal information online given the levels of concern they express (Young, A. and Quan-Haase, A. 2013. p.479). Generally, people say they value things like privacy, freedom, and security but despite this, there are many situations which they are willing to waver certain rights in varying forms of exchange. These rights are at times negotiated, at other eroded by changes by societal norms, or personal omissions given more tacitly.

A longitudinal study examining privacy attitudes and self-disclosure patterns of Facebook users over the last 5 years, found that although ‘heavy users’ had seen a marked increase in concern, the opinions of ‘light users’ has remained approximately the same (Tsay-Vogel, M., Shanahan, J. and. Signorielli, N. 2018). Not only that, but increases in concerns seem to be plateauing for heavy users. Authors have argued that this supports the hypothesis that the ongoing exposure and habitual nature of social media use has effected perceptions not only what levels of self-disclosure are normal, but what is expected.

Sharing in changing contexts

Despite this, studies have also shown that individuals do take active steps to negotiate their privacy within the constraints of Facebooks available settings (Marwick, A. and Boyd, D. 2014; Young, A. and Quan-Haase, A. 2013). As a prerequisite, Facebook requires users to share at least some information for them to connect with other users. Helen Nissenbaum describes the importance of the context in which information Is disclosed, essentially that privacy strategies should be coherent to and informed by the conditions in which data was indented to be used (2010). As discussed in part 2 of this series, Facebook’s default settings dictate that most posted information is shared between ‘friends’, which to a large extent sets the tone of social exchange. Through tacit knowledge it is understood that their information is correspondingly used by Facebook as it sees fit, by personalising content, selling adds etc. It is in this context that privacy is negotiated, not only between general end users but equally developers, advertisers and within company itself. Given this, it is not surprising that there are disjunctions perceived acceptable standards.

To place this issue fully, privacy considerations need to consider the scope and topology of how data flows. The idea of networked privacy, gives the recognition that information is often intertwined within relationships of users making it difficult for individuals to fully negotiate how information may be shared (Marwick, A. and Boyd, D. 2014). This was explicit within this scandal as the original dataset collected by Aleksandr Kogan relied on the prior relaxed conditions that enabled Facebook users to share the personal data of all their friends through taking an online survey. Naturally, one could not argue that any of them would have predicted where this data would end up, especially using technology that didn’t then exist. Marwick and Boyd argue that these changing and co-constructed contexts that are framed by networked privacy can collapse the rules of contextual integrity that Nissenbaum describes (2014. pp.1063-1064). As context is interpretable in several ways, understanding initial consent once data has been exchanged multiple times loses its meaning.

Established concerns

Establishing privacy online as a paradox doesn’t offer explanation to the contradictions it claims. As has been discussed, individuals take active steps to mitigate data collection and though the increases in the levels of concern are stagnating, apprehension very much exists. Positioning the Cambridge Analytica scandal against the background of stories on privacy over the last 8 years (part 3: fig 5), the Idea that varying institutions are gathering large amounts of data on them is not something that is alien to citizens. While the Guardian is typically addressing a specific kind of left leaning reader, these stories often have led to wide spread coverage across differing media. Some people may not be concerned with their privacy; however, the overall picture is of greater sustenance to its related issues in social and broadcast media.

xkcd: Privacy Opinions. from: https://xkcd.com/1269/

The regulation of and by Facebook

Literature written by Tarleton Gillespie has recently considered the ‘regulation of and by platforms’ (2018), in particular considering how user generated content is managed. With reference to Section 230 of the US’s Communication Decency Act (1996), an argument is made that current content regulation, whilst protecting social media providers was instead designed for Internet service providers and search engines (pp.255-260). Stating the impossibility of platform impartiality, an emphasis is placed on importance of their own governance and curation online spaces (p.262). Regulations for tech companies are outdated, ill-suited and sometimes non-existent with respect to contemporary concerns. Be it the beta testing of self-driving cars on public roads, deciding how data is collected and sold, and the negotiations of speech limitations within platforms, the underlying theme expressed by law, particularly in the US, is that tech companies can just regulate themselves and innovation should not be stifled. The issue raised in this scandal concerns both the regulation of data and content produced for political contexts along with their underlying strategies.

Political regulation

With the implementation of the EU’s General Data Protection Regulations (GDPR), for European users at least, Facebook will supposedly have to uphold the new standards presented. This includes; attempts to make terms and conditions more transparent, the right to be forgotten, rights to access data concerning themselves and several other safeguards for citizens [ref. Though Facebook claims to be on board with the regulation, reports claim the company is shifting the data of over a billion users to more liberal regulatory environments. This is obviously counter evidence to their claims.

Self-regulation

Even without intervention by political institutions, Facebook has strong incentives to adapt their policies to meet user preferences. As Gillespie also notes, even with their level of monopolisation, companies like Facebook still don’t want to lose large numbers of users to competitors (2018, p.262). To do this it needs to keep the users happy. The trouble it is not just the end users that it needs to satisfy.

In analysing Facebooks attitude to regulation, it is important to take note of the multitude of actors it is aiming to satisfy. Bucher and Helmond (2018) bring consideration to the affordances of social media platforms in relation to their different types of consumers; Including advertisers, investors, end users, developers, and the platforms themselves. Looking to a variety of definitions of affordance, they emphasise the inherent reciprocity that is involved – that is, not just what technology affords users but equally what users afford technology. In Facebook, an obvious example of this would be through user actions providing data points for the sites personalisation algorithms. Throughout Facebook trying to satisfy the combination of its ethos, end users, developers, advertisers and investors, conflicts have naturally arisen. Facebook may want users feel their data is shared minimally, whilst to advertisers they might wish assure that maximum access is granted. Users themselves may also be in conflict with their relationship to data practices; they might not want to share their data but contrastingly prefer more personalised content.

Democratic Governance

Though Facebook may have enjoyed praise for their apparent democratising potential (e.g. superficially, the Arab Spring), the plight of claims for facilitating fake news, political polarisation, this scandal, and their apparent failure to take meaningful responsibility, has led to discontentment in users, politicians and investors alike. Facebook is a place where a combination of public, private and corporate interests competes for the social and connective assets that the site commands (Van Dijck, J. 2012). The problems of this online space reflect the issues faced by democracies in general. Could democratic input from users produce an environment of greater accountability and increased satisfaction? Well it turns out 2009 Mark of 200 million users thought so…

Reportedly, the experiment didn’t work out, though it’s not sure it can be said they tried. The company sites low turnout but critics have argued that users were only given a limited choice of two slightly different terms of service, not how the company was structurally run.

Standards for democracy

A key argument of this scandal, even stated by the whistle blower themselves, was that democracy had been in some way undermined. This idea will now be examined briefly.

Using Jesper Strömbäcks four models of democracy: procedural, competitive, participatory, deliberative (2005), as reference to standards with normative implications, we can position at what levels democracy might be being disrupted. Arguably, it is towards the least publicly involved the end of the spectrum that the discussion caused by the Cambridge Analytica scandal concerns (The order of perceived public involvement being from procedural to deliberative). Here considerations will be predominantly concerned with democracy as styled within the competitive standard.

Competitive democracy

The standard of competitive democracy holds that there be proper competition and choice between political elites, enabling thorough scrutiny from an informed electorate. In this sense, ‘it is the political elites that act, whereas the citizens react’ (2005, p.334). Citizens therefore select which of the political elites they think will give them the best product, ’as in a marketplace of goods’ (2005, p.334). Strömbäck goes on to state that for this to be so, fact can fiction be disguisable along with the purposes of different kinds of media content (2005, p.334).

Personalised campaigning

Key arguments surrounding the personalisation of media can be found in the works of writers like Eli Parser (2011), and Cass Sunstein (2001). One shared theme is the erosion of a shared reality, instead replaced by polarised homogenised spaces (filter bubbles / echo chambers). Helen Nissenbaum posits that considering individualised voter targeting; an argument for could be to maintain the freedom of competition within political campaigns and an argument against is that the process of personalisation would distort decision making processes of voters (2010). At Mark Zuckerberg’s congressional hearing, US Senator Chuck Grassley made a point of stressing that campaigns from both sides of the isle have progressively made use of the latest technologies within campaigns to achieve the upper hand. This is something that is true in other countries also. Most election campaigns require candidates to debate publically on a common set of issues. Without the personalisation of all media, though the background of these issues may be distorted, there needs to be at least some shared reality for candidate’s campaigns to gain momentum amongst undecided voters. Though it is probably not ideal, one could make the case of this being within the standard of competitive democracy. This is of course if truth is still upheld, something that seems tenuous here, posing more involved epistemic questions.

Roles of different media

Between social and broadcast media, it seems apparent that there are differences in the types of issues that can be effectively communicated. Social media seem to have efficacy in communicating shared experiences less reliant on concrete facts. Contrastingly the inherent structure and elevated broadcasting ability of traditional media renders them more adept in co-ordinating and presenting more complex stories. An implication for journalism in the idea of competitive democracy is that it can act as a watchdog and hold the political elites to account (Strömbäck, J. 2005. p.341).

With the example given by this issue: The Guardian/Observer could piece together and co-ordinate actors in a considered manner. Responses made on Twitter the with #deleteFacebook established communication channels for users to share sentiments which then extended and added to the initial story. We could ask, without the social media response would there be as much pressure for the US congress or Facebook to respond. Perhaps other actors may have perused the issue, it’s hard to tell. The question here is does the current media ecosphere support of the needs and desires of citizens and have actors have been held to account. Do scandals indicate functionality beyond the exposed dysfunction? the fact there are scandals often means there are attempts to address their issues. Apologies have been made and steps have been laid down to address the issue, but Facebook has so far managed to get away without fines or punishment.

Concluding discussion

In some respects, its unconvincing that Facebooks services were used in an un-indented manner. Facebook makes money by providing tools to communicate to curated audiences. This is what Cambridge Analytica was doing. The only differences were that 1) overtly political content was used in targeting instead of commercial and 2) the data used originated from a time of different internal policies. Facebook has seen this was unsavoury and is offering ways to address the issue. Though there is bound to be more data like this out there and similar situations may happen again. The tone has been set to what users currently think is acceptable. The control and accumulation of data is the key here. This is more generalizable to the underlying logic of the web 2.0 business model. The cyclic nature of processing and archiving data performed in collaboration between users and platforms, what Robert Gehl has described as ‘affective processing’ (2011) describes their state of function. We could cite the often-referenced Tim Berners-Lee as someone calling for the re-decentralisation of the internet. Work like that done at the Open data institute is obviously needed, there is a strong case for letting users have more control over their data. This is particularly important regarding its increase use within AI, something that magnifies the inequalities of data access.

Conclusion

Considering the methodologies used in this series, First with Twitter and then the articles published by The Guardian, provided perspective on subtopics within the issue. The TSNE visualisations and word co-occurrences used with the Twitter data mapped related terms, whilst hashtag counts more concretely identified trends. Comparisons between The Guardian articles of the corresponding timeline help to offer context to other recent personal data scandals. This exploration was beneficial in establishing a starting point for more in depth discussions. Main points that have been discussed here include standards for democracy, negotiations of data and privacy and the underpinning strategies of Facebook as a platform. Though what is presented here may be incomplete in parts, it provides a research and experience that may be helpful to personal future projects.

Considering the Cambridge Analytica scandal as it has evolved has shed light on numerous cultural foundations. Issues of privacy, data and election regulations in relation to technology have been recurrent in recent history. As technologies evolve or attitudes change, renegotiations consistently need to take place for aspirations of consensus to be approached, yet arguably never attained. Research considering this may very well be important in the formations of decisions made.

References

Bucher, T. and. Helmond, A. 2018. The Affordances of Social Media Platforms. In: Burgess, J., Marwick, A. and Poell, T. ed. The SAGE Handbook of Social Media. London: SAGE Publications. pp.233-253.
Gehl, R. 2011. The archive and the processor: The internal logic of Web 2.0. New Media & Society [Online.] 13(8). pp.1228-1244. [Accessed 5 May 2018]. Available from: http://journals.sagepub.com/doi/abs/10.1177/1461444811401735
Gillespie, T. 2018. regulation of and by platforms. In: Burgess, J., Marwick, A. and Poell, T. ed. The SAGE Handbook of Social Media. London: SAGE Publications. pp.254-278.
Marwick, A. and Boyd, D. 2014. Networked privacy: How teenagers negotiate context in social media. New Media & Socitety. [Online]. 16(7). pp.1051-1067. Available from: http://journals.sagepub.com/doi/abs/10.1177/1461444814543995
Nissenbaum, H. 2010. Privacy in Context: Technology, Policy, and the Integrity of Social Life. California: Stanford University Press.
Tsay-Vogel, M., Shanahan, J. and. Signorielli, N. 2018. Social media cultivating perceptions of privacy: A 5-year analysis of privacy attitudes and self-disclosure behaviours among Facebook users. New Media & Society [Online]. 20(1) pp.141-161. Available from: http://journals.sagepub.com/doi/abs/10.1177/1461444816660731
Strömbäck, J. 2005. In Search of a Standard: Four models of democracy and their normative implications for journalism. Journalism Studies. 6(3), pp. 331-345.
Van Dijck, J. 2012. Facebook as a Tool for Producing Sociality and Connectivity. Television & New Media. [Online]. 13(2) 160–176. Available from: http://journals.sagepub.com/doi/abs/10.1177/1527476411415291
Young, A. and Quan-Haase, A. 2013. Privacy protection strategies on Facebook, Information, Communication & Society. [Online]. 16(4), pp.479-500. Available from: https://www.tandfonline.com/doi/abs/10.1080/1369118X.2013.777757

Data, politics and democracy part 3: Analysis of The Guardian’s content

2018-04-12T22:23:30+00:00

Here will detail investigations into content created by the Guardian, a key player in the dissemination of the Facebook/Cambridge Analytica story. It will use computational methods such as keyword extraction, also making comparisons with the previously collected data from Twitter.

Strategy

Seeing that The Guardian was one of the main news organisations to break the initial story, arguably it is relevant to look at patterns within their coverage. Acting as The Fourth Estate, it’s common to assume that, news organisations are expected to hold governments and other public entities to account, how has this been done here? Whilst additionally taking into consideration the results of the Twitter data analysis’, comparisons between the two content types will be made. As part of a series, a picture is gradually being built up from a variety of different perspectives. Later this information will be used to discuss the content of the two media types, and the topic more generally.

Making use of the Guardians Open Platform API, all the articles between the 17/03/2018 - 24/03/2018 were collected for analysis. This collection period starts on the date of the initial story and finishes when the previously sampled Twitter data ends its collection span. Symmetrical query terms were likewise used, here translated in terms of the Guardians search API, this is literally written as facebookANDcambridge analytica. The content collected was then analysed using a variety of different methods, looking at both manual and machine generated structures within the data. Keywords were generated automatically using the RAKE algorithm, other approaches like TF-IDF scores over n-grams, and topic generation as discussed in previous articles were also utilised.

RAKE algorithm

Presented in Rose, S. Engel, D. Cramer, N. and Cowley, W. (2010), the Rapid Automatic Keyword Extraction (RAKE) algorithm does as the title describes, generating keyword phrases from individual documents – ideally short texts like abstracts. A characteristic of this algorithm is that it weights longer sequences more heavily, resulting in more greedy results. More details about how this is implemented can be found in the methods page of this blog.

To depict keywords across the whole corpora, a method described in authors paper for finding the most ‘essential’ terms was used conjunctively. The calculation of this can be summarised as follows:

$essentiality = (\frac{\text{edf}_k}{\text{rdf}_k}) \text{edf}_k$

Where the edf (extraction document frequency) is the number of documents the candidate was extracted from as a keyword and rdf (reference document frequency) is the number of times a candidate appeared across the collection. With this approach, perhaps we will be able to get an alternative but reasonable portrayal of the keywords within these articles.

Due to its relative simplicity, the RAKE algorithm and other related functions were implemented as needed here in the Python programming language and can be found in the text_tools section of this blogs GitHub repository (digitalcitizens/rake).

Analysis

Article Keywords

Manually generated

Looking at the human generated keywords tells about not only about the articles but also the internal practices of the organisation. Through being the query terms given, Cambridge Analytica and Facebook naturally top the chart, most of the top items are meta keywords that describe overarching collections of content like Technology or UK news. The categorisation of news types geographically can tell us more about the priorities of the content producers (e.g. ‘US news’). This can be described here as UK news > World news > US news. Because of this, although the US 2016 elections where a key part of the topic, we are more likely to see British issues like Brexit reflected.

Figure 1: Guardian Keywords tagged by Organisation.

With knowledge of what this data concerns we could use these tags to inversely query content over a longer period within the paper. Possible topics of interest can be listed as, Data protection, privacy, social media. Seeing how these topics have evolved over time might be an interesting line to follow in establishing the publishing patterns. This will be discussed at a later point in this article.

RAKE results

Comparing the RAKE key phrases with the human generated ones, the differences in style are apparent. The human keywords provide quite clear and considered meta data whose core function it to group content systematically. Here, though following qualitatively similar topics, we find structure in the natural language of the news reporters. For such a simple approach, it does appear to give good results. One noticeable caveat its failure to capture single word key phrases well – questionably Brexit is missing here. We can see also the tendency for it to be greedy and present longer phrases such as 50m Facebook profiles over more simply Facebook.

Figure 2: Guardian Keywords generated by RAKE algorithm.

Questions also arose in how much of the input text the algorithm should be applied to. Though best use cases for it are often described as shorter texts, like abstracts, the characteristics of the standfirst content used to describe articles had a tendency to cram words together without using stop words often. A purely constructed example of this might be: Canadian Whistle blower breaks Cambridge Analytica data scandal story. As this to many standards doesn’t contain any stop words, it would be a single keyword candidate, ultimately leading to unhelpful and long results. Because of this, full articles were given as inputs instead which performed more successfully.

Comparisons with Twitter data

Generating TF-IDF scores for unigrams and bigrams across both datasets, comparisons of the content can be made ( █ The Guardian, █ Twitter). With similar pre-processing steps being taken with each, an initial observation finds differences in structure and consistency. Thus, the twitter data has far more low information words and media specific language (e.g. retweeted). Query terms within the Twitter data seem to be far more impactful to the scores generated. As an item of content only needs one occurrence of a query term to be collected, the ratio of query terms to non-query terms is obviously higher in shorter texts. A possible way to combat this might be to normalise their scores based on content length, though this has not been applied here.

Figure 3:Top 20 unigrams and bigrams for Gaurdian articles.

Figure 4:Top 20 unigrams and bigrams for Twitter dataset.

As noted previously the Guardian content is more concerned with UK affairs, e.g. the vote leave campaign. Perhaps having more American users, Brexit related affairs do not show up at all within the Twitter rankings displayed. Though there might not have been time for stories to circulate and articles here do not represent total media coverage, one of the key actors Aleksandr Kogan seems to not have received as much attention from Twitter. To some extent we could argue that the main issue is the general social practices facilitated by Facebook between data and political organisations. Facebook Is the only actor here that has a consistent relationship with society, and naturally should receive the most attention.

Though better representations could have been achieved with the Twitter results it was decided here not to pursue even more cleaning steps. As noted before, this dataset was especially messy. Working with the Guardian articles, though needing some pre-processing steps was a real breath of fresh air.

Time series data over the last 8 years

Working with some of the top keywords generated by the paper, additional data was collected from the Guardians API. This queried posts between 2010 and the present. A times-series of articles per query can be detailed below.

Figure 5:Time series of query result counts

Using the tiles of the articles to determine subtopics within the query the most related bi-grams can also be shown below. Only privacy and data protection we displayed as the others were more generalised and of less interest.

Figure 5:Most common bi-grams per query (privacy, data protection)

It seems ‘phone hacking’ and ‘Edward Snowden’ are top of the list. Looking back at the time series data there only seems to be a small peak in 2011 at the time of the phone hacking scandal whilst around the time of the Snowden leak there seems to be a greater peak in data protection, privacy and the internet. Facebook seems to have got most attention mid 2016 then peeked up again recently. It’s unclear to my knowledge what happened in 2015 regarding data protection, but it seems like someone must have had a wild month or two. Another point of interest is that since 2016 Facebook maintained more coverage that the Internet more generally, something that feeds into the narrative that the internet is becoming merely the walled gardens of the giant social media platforms.

References

Rose, S., Engel, D., Cramer, N., and Cowley, W. 2010. Automatic Keyword Extraction from Individual Documents. In: Berry, M.W. and Kogan, J. ed. Text Mining: Applications and Theory. UK: Wiley-Blackwell. pp.1-20. Also available online

Supporting resources

Data, politics and democracy part 2: Twitter reactions to Facebooks so called ‘data leak’

2018-04-12T22:22:45+00:00

Using data collected whilst the Facebook/Cambridge Analytica story was first gaining momentum, this post looks at the responses made to it via Twitter. Experimenting word embeddings and other computational methods, it aims to map key dimensions that highlight the contextual relationships between different sentiments across the dataset.

Introduction

This section of the series can be outlined as follows: First a brief rationale is given on why Twitter is being used as a site of research rather than Facebook, second analytical methods will be introduced, thirdly findings will be discussed and finally a brief conclusion will be made.

Twitter over Facebook

It is probably fitting to answer why Twitter is being used to explore an issue most concerning Facebook users. One answer is simply, because it is easier and cheaper. Another is arguably that Twitter is far more clearly configured for the expression of public sentiment.

Collection cost

It is cheaper to collect the kind of public response data in mind on Twitter rather than Facebook due to the design of their data collection services. For aggregating posts real-time, Twitter has its Streaming API which is open for anyone to use. The closest thing Facebook has to this is its Public Feed API. Access to this is restricted to a limited set of pre-approved ‘media publishers’. To access Facebook user data at scale, you must either pay, have special institutional privileges, be providing a widely-used service or pretty much trick users into sharing it with you.

What’s being shared

Regarding the character of the content created within each site, there are specific design aspects to each that could have a formative effect on what is shared. This presents differences in the usefulness of each for this investigation. On Facebook, a user principally addresses their ‘friends’. On Twitter, it’s their ‘followers’. This is reflected in among other things the default post visibility settings. Facebook asks, ‘what’s on your mind?’. Twitter asks, ‘what’s happening?’. Ultimately, the difference in what each site purveys as its uses are, Facebook is about connecting people and Twitter is about connecting people to current affairs.

Short-comings

Though it may provide means for investigating sentiment on an international scale, public posts made on Twitter are undoubtedly not a reflective sample of all public opinion, even of the opinions of all Twitter users. Though it provides large amounts of data, its quality is often hard to determine. As we will see in this section, it can be especially messy at times. These themes and other critical engagements are discussed in more detail in a previous post (Notes on Digital Methods).

For this and the reasons discussed above, though not perfect, Twitter seems like a more appropriate tool to learn about public interaction with current affairs.

Approach

Approximately 500,000 tweets were collected using Twitters streaming API between the 20th and the 23rd of March 2018, filtering for the query terms Facebook and Cambridge Anaylitica. This was just after the Guardian’s initial story was released and was gaining traction across social and broadcast media. Along with counting hashtag frequencies and word co-occurrences, visualisations generated from word embeddings will be used to form a distant reading of the semantic relationships within the dataset. This experimentation provides a contextual overview of the response to help identify specific attributes for moving forward.

Word embeddings

Vector representations of words, or vector space models, aim to map the semantic similarity of words in continuous vector space. This has advantages over the more traditional bag-of-words model as it provides denser representations of terms. Instead of treating individual terms as unique identifiers, we can embed contextual information within them. For example, the similarities cats and kittens do and don’t have. These come in two essential styles, count based and neural embeddings. Within this investigation they will be used to compare related terms within the dataset.

The intuitions behind word embeddings depend on the distributional hypothesis, which implies that semantically similar words occur in similar contexts. As J.R. Firth summarises ’you shall know word by the company it keeps’ (1957; cited in Jurafsky, D. and James, M. 2009. p692).

Though definition of what constitutes a context can vary, in this article it will be employed in two distinct ways. One will assume context is created by a window of neighbouring words, for instance 2 either side. This will be used to build a neural model. The other will assume all words within a Tweet have a shared context and shall be used to measure co-occurrence in a count based manner.

Count based methods

For my own notes, an illustrative example is given. This example is slight variation on that provided in Grefenstette, E. 2017. The technique used in vector representations shown here also forms the basis for neural embeddings described later.

$\begin{align} & \text{... the cute kitten purred ...}\\ & \text{... the old furry cat meowed and purred ...}\\ & \text{... the small furry kitten meowed ...}\\ & \text{... an loud furry old dog barked ...}\\ \end{align}$

Say we target the words kitten, cat, and dog. Using the examples above and ignoring stop words (low information words like: ‘the’, ‘a’), we can list the witnessed context words for each as:

$\begin{align} & \textbf{kitten}: & & cute, purred, small, furry, meowed\\ & \textbf{cat} : & & old, furry, purred\\ & \textbf{dog} : & & loud, furry, old, barked \\ \end{align}$

After this small example our complete set of context vocabulary would be: {cute, purred, small, furry, meowed, old, loud, barked}. Using this generated vocabulary, one way we can create a vector representation for each of our target words is as follows:

$\begin{align} \text{kitten}&=\left[ \begin{array} &1& 1& 1& 1& 1& 0& 0& 0 \end{array} \right]\\ \text{cat} &=\left[ \begin{array} &0& 1& 0& 1& 1& 1& 0& 0 \end{array} \right]\\ \text{dog} &=\left[ \begin{array} &0& 0& 1& 0& 0& 1& 1& 1 \end{array} \right]\\ \end{align}$

To do this we denote the presence of the context words in the order described above with either a 0 (false) or 1 (true), depending if they appear in the same context as our target word. This is useful as we can now compute the similarity between each word, for instance with cosine similarity.

$cosine(\pmb u, \pmb v) = \frac {\pmb u \cdot \pmb v}{||\pmb u|| \cdot ||\pmb v||}$

(The numerator of the equation here is the dot product of the two vectors and the denominator is the product of the 2 Euclidian norms.)

$\begin{align} cosine(\text{kitten}, \text{dog}) & \approx 0.33\\ cosine(\text{cat}, \text{dog}) & \approx 0.25\\ cosine(\text{cat}, \text{kitten}) & \approx 0.68\\ \end{align}$

Computing this, as is expected from this completely constructed example, kitten is most like cat. WOW! how did that happen? Also, because the fact that both cat and dog have old in their context, dog is more like cat than kitten.

Neural embeddings

Beyond count based methods neural embeddings have also been widely employed to predict vector representations. Using this method embeddings are normally represented by a matrix of target and context words. Two of the main modelling strategies here, are the Continuous Bag-of-words model (CBOW) and the Skip-gram model. They function in pretty much opposite ways. Here is an illustration comparing both:

Image from: (Mikolov, T. Chen, K. et al. 2013) https://arxiv.org/abs/1301.3781

The CBOW model tries to predict the target word from a given set of context words and the Skip-gram model tries to predict the context words given a target word. In this post the Skip-gram model is used, implemented with the machine learning framework Tensorflow. This was done with reference to their demonstration word2vec_basic.py. Alterations to the original file have been made to pre-process the data differently and carry out some additional steps.

Visualising vector representations

Word vectors produce high dimensional data. To make sense of the representations visually we can project them into lower dimensional space. This will be done here using t-distributed stochastic neighbour embedding (t-SNE) implemented using the Python package Scikit-Learn.

Counting Co-occurrences

The Idea of a word co-occurrence matrix is something widely used in natural language processing. As a simple extra piece of analysis, the top co-occurrences of some terms of interest from within the dataset will be presented. The terms of interest can be described as: data, privacy, people, delete, trust, users. This is presented in a collection of bar charts.

Noise within the data

As discussed in a previous post Twitter Spam and Ham, the dataset collected here was especially noisy. As a reminder; this came in the form of spam targeting the trending topic but also the fact that tweets came from a wide variety of languages. The topic followed was an internationally discussed issue, however, it didn’t help that the query terms were organisation names instead of words belonging to the English language.

Approach summary

The methods for investigating the data here are experimental in a way that is trying to learn about methodological approaches and the data simultaneously. The main strategy being employed here is, using distributional semantics to link common terms. In doing this see what more can be understood the discussions within the sample.

Findings

Hashtags

Looking at hashtag frequencies, we can see that deleteFacebook was the most popular – something that was also widely reported by news organisations. Whilst filtering spam it was noted many tweets contained nothing but this repeatedly. Some of these Tweets looked like they come from automated accounts, others more natural. This did pose some questions of trend manipulation, but after counting the document frequencies and not repeats within tweets the trend did not change. Below the top ten tags can be seen with the query terms (‘Facebook’, ‘CambridgeAnalytica’) filtered from the collection. Taking this approach there doesn’t seem to be anything too overt to help explore the topic in any new directions.

Word embeddings

Running the Tensorflow model described previously for 10,000 iterations it finished with an average loss of around 4% in predicting contexts of given terms within the training data. Evaluation of how well the model performs in practice is obviously harder to determine. The size and quality of the data put into the model is obviously not going to give a good representation of the English language. It may however have the potential to map corpora specific terms and ideas. Below is the t-SNE visualisation of the word embeddings created. Hopefully what we should expect to see is that terms that are similar end up clustering nearby each other. What is projected is also the most common 250 terms within the dataset after the removal of stopwords.

It is argued that t-SNE visualisations tend themselves to be easily misread (Wattenberg, et al., 2016), hopefully that won’t be the case here lol. Running the visualisation program several times, the intuitive understanding is, that while global positioning within the graph tends to vary somewhat, locally relevant terms produce more repeatable results. For example, the small group of un-filtered French stop words vous, mais, pour, dans always clustered. Logically French words are not likely to appear in the same context as English ones, so that seems correct. Common bigrams such as fake news, social platforms, public security seem to have clustered also. As well as variations in tense, pluralisation etc.

Annotating the Space

Zooming in on the bottom left of the graph seems to locate the main areas of interest in this inquiry. Qualitative annotations have been made to bring further structure to the space. This is employed as a method of communication, not classification. Being an issue centred on the use of data, dimensions presented here can be said to emerge. These can be described as: the business or the economics of data, its politics or how data is used and regulated, and the individual securities of privacies of users. These obviously overlap and are in an active state of interplay.

The annotation of sociality was included as it points to language use that is clustered due to the corpora being from social media. The use of words like, share, follow, post, though they have become more used generally in language, would not be as present within a book for instance. Recounting this, and the way the dataset was especially subject to spam, is a reminder that the logic social media platforms operate on doesn’t stop. Even during the expression of dissent or outrage, not only are the platforms themselves profiting from these expressions, users are also incorporating their logics to promote their own position.

Co-occurrences

The co-occurrences presented here are as expected. Utterances of trust are most common with utterances of breach. This method simply provides another perspective for visualising points of interest.

Summary

Beyond being a DIY exercise, investigating a topic through Twitter that has already been covered extensively within the news doesn’t yield much new information worth noting. We can see people are talking about a breach of trust, #deleteFacebook and the data scandal in relation to politics – as was reported. Evaluating the methods used has also provided challenges. Particularly with the neural embeddings, as in reduced dimensional space, the data in visualisations are subject to compression. Overall this has provided a way to think about the initial reactions to the topic visually and differently to other approaches that have been taken within this blog. Its results will be used for further discussion in a later post.

References

Grefenstette, E. 2017. Lecture 2a- Word Level Semantics. [Online]. Available from: lectures/Lecture 2a- Word Level Semantics.pdf at master · oxford-cs-deepnlp-2017/lectures · GitHub
Jurafsky, D. and James, M. 2009. Speech and Language Processing. Second Ed. London, UK: Pearson Education Ltd. (Third Ed is available here online)
Maaten, Laurens van der, and Geoffrey Hinton. 2009. Visualizing data using t-SNE. Journal of Machine Learning Research. pp.2579-2605. [Online]. Available from:http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
Mikolov, T. Chen, K. et al. 2013. Efficient Estimation of Word Representations in Vector Space. [Online]. Available from: https://arxiv.org/abs/1301.3781
Wattenberg, et al., 2016. How to Use t-SNE Effectively, Distill. [Online]. Available from: http://doi.org/10.23915/distill.00002
Vector Representations of Words - TensorFlow
Word2Vec Tutorial - The Skip-Gram Model · Chris McCormick

Supporting resources:

Additional reading:

Data, politics and democracy part 1: Introduction

2018-04-12T22:22:17+00:00

Presenting a series of forthcoming posts related to the recent Facebook and Cambridge Analytica scandal, an introduction is made to the investigation whilst recording its aims ahead.

The scandal involving Cambridge Analytica’s apparent miss-use of Facebook data is an especially relevant piece of current affairs for an investigation of contemporary civic issues. It belongs to the wider issues surrounding data regulations, the economic foundations of web 2.0 and normative democratic ideals.

I won’t bother in going over all the exact story details here, for a point of reference here Is probably a good start with a whole collection of related articles here.

The plan

Instead of cramming everything in a single post like what has been done previously, this article will be split into four parts. This will develop a more in-depth sustained inquiry, giving at least one full article for reflection.

The sections planned, including this current post, can be outlined and summarised as follows:

Part 1: Introduction

You are here. This will be whatever this page is right now.

Part 2: Twitter reactions to Facebooks ‘data leak’

Initial reactions made via twitter will be studied whilst experimenting with word embedding’s. This aims to map contextual relationships within the response.

Part 3: Analysis of The Guardian’s content

This will detail investigations into content created by the Guardian, a key player in the dissemination of the Facebook/Cambridge Analytica story. It will use computational methods such as keyword extraction, also making comparisons with the previously collected data from Twitter.

Part 4: Reflections Discussions will be made in relation theories of digital privacy, also reviewing the regulatory conditions for Facebook and the underlying logic of web 2.0. It will finish by assessing some normative implications for democracies and the roles different forms of media play in addressing such issues.

Additional resources

Twitter Spam and Ham

2018-04-01T17:00:43+00:00

Noting the wrangling needed to make Twitter data more useable and some make do solutions employed. This was supposed to be part of an upcoming post but has been separated to make the rest of the original article flow better.

Though Twitter data generally is quite noisy and requires a lot of cleaning for analysis, what was collected for an upcoming post Part 2: Twitter reactions to Facebooks ‘data leak’ Initially was pretty much unusable. This was due to massive amounts of spam targeting and also because the query terms were not rooted in any particular language. This post is mainly recorded for future reference when working with Twitter data.

Spam tweets within the dataset

The problem:

Initial analysis of the tweets revealed that the tweets filtered for had been significantly targeted by spam and automated accounts. Without building a compressive spam filtering model this was most concretely revealed through examination of hashtag co-occurrences. A illustrative example of this case of spam could be 1000 tweets with exactly the hashtags:

#foo #tweet4tweet #bar #MontyPython
#fooBar #flyingCircus #CambridgeAnalytica

One set like this pertained over 2000 tweets with the same 6 hashtags an only 100 unique words between them. This was particularly disruptive to any computational analysis using any kind of frequency measure.

Make do solution:

Just applying some simple conditional logic we can decide whether a tweet is spam if has: either too many hashtags or if having above a certain threshold and whose set appears above a given limit within the tweet collection. This can be expressed in Python code as follows:

def is_spam(tags: tuple, ceil=10, floor=4, limit=40):
    return (True if len(tags) > ceil else
            True if len(tags) < floor
                 and collection_freq[tags] > limit else
            False)

The parameters given the algorithm ended up filtering 7% of tweets. This paper studying bots on twitter in more detail estimates that around 9-15% of all tweets currently come from spam accounts, so while almost in the same ball park this could have been more aggressive. Of 36925 tweets removed only 27 tag sets were regarded as spam.

Whether the parameters or method use was accurately effective is hard to say, it did render the dataset a lot more usable and seemingly less noisy. This problem of spam vs not spam is one of the classic examples given when teaching classification algorithms and merely using a something like a simple logistic regression model, more effective results could be made. However, due to the Twitters API terms of service permitting the sharing of datasets its quite hard to find labeled data to train a model on. There may be pre built solutions however, the problem may be very corpora specific however.

Language Barriers

The problem:

Only being fluent in english and only knowing just enough French and Spanish to identify things like stop words (common low information words), tweets from other languages are not much use to me. Forgetting to collect the language information was something that caused more work here.

Make do solution:

Depending how aggressive an approach seemed relevant, several techniques seemed to work ok. An initial step was to remove certain character ranges. Pretty much everything out side of ascii was removed with the regular expression matching [^\x00-\x7F]+. This meant removing emoji and all sorts of things but they were not of interest here.

Another strategy was just adding more items to the stop word list and include words from different languages. The Python package NLTK, provides around 2,400 stop words for 11 languages, that wasn’t to time consuming to employ.

Visualising Topics

2018-03-27T21:59:16+00:00

This post experiments with different ways of visualising topics within a dataset.

Introduction

In Theme Detection in Social Media Daniel Angus (2017) presents available tools for visualising textual data with the 2 software packages Leximancer and Discursis. Both tools seemed interesting, however, both were from closed source commercial projects, which was a problem. Leximancer was especially interesting, as it used a probabilistic model for generating networks of ‘concepts’ for analysis. However, only providing a 7-day free trial, I was not prepared to commit an extended period of time to understand the software in depth. Instead It seemed more fruitful to build upon previously used ‘topic extraction’ methods.

Before we continue the use of the terms ’topics’, ’concepts’ and themes, should probably be clarified. In the literature supporting Latent Dirichlet allocation (LDA) (Blei, D.M., Ng, A.Y. and Jordan, M.I. 2003) and previous analysis within this blog, use the term ‘topic’ is used to describe the probabilistic distribution across words, each word in the corpora belonging to each topic set to a varying degree. Throughout this article LDA is used whilst selecting the top n scoring words in each topic set to create topic representations. However, Leximancer’s supporting paper describes their representations that seem similar as ‘concepts’ - though looking at Daniel Angus’s table made with the software (fig 31.1. p.536), I’m not sure how the name ‘Tony Abbot’ is a concept.

Both Angus and the Leximancer paper use the term ‘theme’ to describe the annotations that the researcher gives the presented concept sets. So, the ‘theme’ of the example above is government. This seems more fitting than what I described previously as ‘qualitative descriptions’ and will be used in the future.

For final clarification between the terms ’topics’ and ‘concepts’… they will be used interchangeably due to how Leximancer has described its product, but it’s probably better to think of both in terms of being ‘topics’. This clarification has mainly been for myself as It was getting confusing reading the different uses between papers.

The main body of this article will now consider ways of generating presentable forms of topic visualisations though my own experimentations.

Topic Table

In a previous post Finding civic discussions on Twitter - Digital Citizens, topics generated were represented in a table with descriptions of the term groups. This was straight and to the point but it didn’t articulate all the information available efficiently.

For a start, we could use the weights of each term within the topic to indicate local importance. This brings us on to our first more graphic visualisation.

Word Clouds

Though they are probably a bit overused and to be honest I don’t like them, word clouds do a good job of highlighting specific terms within a body of text. Using the feature weights from the LDA model we can scale and adjust the opacity of each topics terms to show the most prominent. This was done with a simple python program that just generates some HTML to display this. For something more transportable maybe generating an SVG would be better and not much harder.

want legal card million russian obama vote immigr deport resid question citizen path illeg appli dreamer ask citizenship provid presid democrat trump tax american daca

privat unit presid order district associ press target major polici pull today polic attorney new report repres general offic pleas state join lawsuit trump citizen

amend american arm militari need everi respons weapon pay state assault kill elect want protect constitut govern democraci nra vote peopl citizen use gun right

wrong live year world citizenship born american peopl nation class like home right uk eu great india countri work citizen make know british help dual

citizen abid good american time countri peopl senior tri want come law yes need everi make say know gun think stop right becom crimin like

Each topic set is bounded by its own box. I haven’t added annotations to describe the topics but this could also easily be done. Straight away you can see not only the most prominent terms but also more clearly the ones that occur across different topics.

Gephi Graphs

Trying to represent these connections between the topics further we can construct a network of topics linked by shared terms. Topic roots have been labeled A through E respectively. I don’t think I’ve done that great a job but here is a visualisation of this in action.

Using in-degree to increase the scale of nodes and edges differentiates the more common terms across topics. However, beyond the small directional arrows this is not fully communicating the fact that the terms belong to topics.

A property of the network that has been constructed is that (if my graph theory knowledge isn’t failing me) it is a bipartite graph. This means that the nodes within it can be split into two distinct sets. Presented here is the fact that that topics can only connect to terms and vice versa. This represented below in the disjointed red and green nodes.

It may also useful to represent the different types of nodes more effectively within the final network visualisation.

Though Gephi makes available a wide variety of graph algorithms and visualisations at a touch of a button, I’m not aware of any methods to share these interactively. A big problem with complex graphs is that they’re often quite messy and things overlap. The ability to adjust and move things about is often helpful.

Interactive D3.js graphs

Because of the advances in web technology it now easier than ever to make all sorts of dynamic content in websites. D3.js is a quite a popular JavaScript framework designed for manipulating HTML documents based on data. Though not providing tools to do everything for you it focuses providing a strong foundation for creating your own ‘data driven documents’. After exporting data created in a previous python program into JSON format we now have an easy and relatively fast way of creating visual representations in the browser.

Arguably we can see more clearly example information such as topic’s C and E are related by terms linked to gun control. Also, that topic B is the least connected to the rest of the topics. Differentiating between different node types as discussed previously has helped this. The obvious downside to this approach is that you must be able to code the visualisations yourself.

Now, you can merely drag nodes around a bit, it might be better to add other forms of user interactivity. Future time investments could be spent creating something like this: bP Example - Double Vertical bP with labels - bl.ocks.org as an improvement to the graphs constructed here. As well as this, combinations of tools could be used together. For example, if it is difficult to implement or find specific graph algorithms, Gephi could be used to generate this.

Considerations

TODO

Conclusion

This has been a start to thinking about representing topics/themes within text visually. It is by no means comprehensive and the methods shown here can be improved on. These approaches would be beneficial in communicating data in future projects. How this will be done obviously depends on the nature of both the nature of the data analysed and of the idea being communicated.

References

Angus, 2017. Theme Detection in Social Media. In: Sloan, L. and Quan-Haase, A. ed. The SAGE Handbook of Social Media Research Methods. London U.K: SAGE Publications. pp.530-544.
A. E. Smith and M. S. Humphreys. 2006. Evaluation of Unsupervised Semantic Mapping of Natural Language with Leximancer Concept Mapping. Behaviour Research Methods, 38 (2), 262-279. Science — Leximancer
http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

Further Resources

Notes on Digital Methods

2018-03-24T16:19:27+00:00

The following article notes some considerations of social research within digital environments.

Online mediums have brought with them different possibilities for social research, particularly providing new opportunities for quantification. Brought about by changes in both the scale and topology of network interactions, the tools available and nature of research possible, research landscapes are changing. Social sciences have seen push to become more empirical though these approaches, whilst computational sciences have seen a softening of theirs and a more natural embracement of imprecision particularly in systems for social settings. The following article notes some considerations of social research within digital environments. This are predominantly unstructured notes on ideas, the methods page in the tool bar serves as a repository for some specific approaches of interest.

Methods page: Repository of some collected methods

Methods for conducting social research online must evolve their sensibilities to match the technologies they are inquiring. Richard Rodgers director of the Digital Methods Initiative notes the turn in research seen in the transition from web 1.0 to web 2.0. Whereas research in the web 1.0 era mostly involved scrappers and link analysis, web 2.0 research has produced predominantly API based research centred around the dominant platforms (2018. p.93-94). This periodised trend is reflective of wider user migrations from an open information network, to more centralised social networks.

The affordances of social media sites each configure user’s capacities for action differently. (Bucher, T. and. Helmond, A. 2018) likewise: ‘Platforms don’t just mediate public discourse, they constitute it”’ (Gillespie, T. 2018)

Social interaction online takes place principally in automated environments where human and non-human agency is an active state of interplay. The dynamics and capacity for non-human influence varies from site to site. Facebook for instance has markedly more algorithmic curation when compared to Twitter. Twitter though providing more control over content curation makes it easier to create bots. A recent study argued something like 9-15% of all tweets may come from automated accounts.

Sampling methods

Random: x% of the total population selected randomly.

Snowball: Iteratively build sample from developing connections from initial set. Network/graph based.

Topic-based: Filter for specific conditions (Keywords, users, hashtags).

Marker-based: Filter for specific meta-data such as location, language.

References

Gerlitz, C. and Rieder, B. 2013. Mining One Percent of Twitter: Collections, Baselines, Sampling. M/C Journal. [Online]. 16(2). [Accessed 17 March 2018]. Available from: http://journal.media-culture.org.au/index.php/mcjournal/article/view/620
Rogers, R. 2018. Digital methods for cross-platform analysis. In: Burgess, J., Marwick, A. and Poell, T. ed. The SAGE Handbook of Social Media. London: SAGE Publications. pp.91-110.
Bucher, T. and. Helmond, A. 2018. The Affordances of Social Media Platforms. In: Burgess, J., Marwick, A. and Poell, T. ed. The SAGE Handbook of Social Media. London: SAGE Publications. pp.233-253.
Boyd, D. and Crawford, K. 2012. Critical questions for big data. Information, Communication & Society. [Online]. 15(5). pp.662-679. [Accessed 10 April, 2018]. Available from: https://doi.org/10.1080/1369118X.2012.678878
Neff, G. and Nagy, P. 2016. Automation, Algorithms, and Politics: Symbiotic Agency and the Case of Tay. International Journal of Communication. [Online]. 10(1), 4915–4931. [Accessed 10 November 2016] Available from: http://ijoc.org/index.php/ijoc/article/view/6277

Mark Zuckerberg personality insights

2018-03-21T19:47:54+00:00

Considering the attention drawn to data privacy after the recent Facebook and Cambridge Analytica fiasco, it seems relevant to explore some available tools for gathering insights on online publics. This article will experiment with IBM’s off the shelf personality insights tool to example the kinds of features that can be constructed from user data. It will use Mark Zuckerberg’s response to Cambridge Analytica’s apparent miss-use of Facebook data as a sample source.

Introduction

The combination of modern psychology, big data and deep learning has opened possibilities for advertisers, political campaigns and others to personally target individuals on a massive scale. Research as such done by Michal Kosinski and others has continued to show a variety ways social media data can be used to predict personal attributes such as sexual orientation (Wang, Y. and Kosinski, M. 2018), age, gender and personality (more citations?). Though Cambridge Analytica’s impact on the 2016 U.S. election may be overstated, questions still arise to the effects of micro targeting can have on the functioning of democracies and to what extent have user’s consented to this subjection.

In this article IBM’s services are used as an example to demonstrate some generic models for creating insights from personal data. Mark Zuckerberg’s recent PR response posted publically on Facebook is given as example that will serve as the basis for the personality insights. Here it the sample text for your reference, if you would like to read it:

Beyond being a bit of fun, the example here aims to show how cheaply available insight tools are becoming whilst questioning there increasing use in society.

Natural Language Understanding

Before going on to the results of the personality insights we can also quickly use IBM’s Natural Language Understanding API to get a brief outline of the document. Copy and pasting the Marks post into the demo site we get as follows:

As well as summarising the object of the text sample, we can see I want to share as key the subject/action. This is describing the sematic roles of the document, what about the emotional content?

Interestingly the emotion is put forward as mixture of joy and sadness. Perhaps sadness because of the news, by optimistically joyful about prospects of your future with Facebook 💁.

Personality Insights

Personality insights can arguably be used to gauge what kind of consumer you will be, the kind of products you will be more likely to buy and more increasingly what political messages may sway you. IBM’s personality insights demo page describes the service as follows:

Gain insight into how and why people think, act, and feel the way they do. This service applies linguistic analytics and personality theory to infer attributes from a person’s unstructured text.

Again, just copy and pasting Marks post into the demo yields a range of results, the first thing we are greeted by is high level summary shown below:

I always thought Mark was unlikely to be influenced by social media during product purchases but how did the application know that? Well according the what is described in the science of the services, specific personality profiles that are constructed suggest certain consumer behaviour. We can consider what this means a bit more with some of the other data the demo provides.

Personality, Needs, Values

Diving into some of the data we see personality represented in three main categories:

Big Five model: This is one of the most widely studied personality models in clinical psychology. It describes a person in terms of openness, conscientiousness, extraversion, agreeableness, and neuroticism - It is sometimes referred to as the OCEAN model. Here neuroticism has been renamed in the service as emotional range as it was thought more ‘generally applicable’ (IBM Cloud Docs, 2017).

Needs: ‘The twelve categories of needs that are reported by the service are described in marketing literature as desires that a person hopes to fulfil when considering a product or service’ (IBM Cloud Docs, 2017). (They are referring to: Kotler, P. and Armstrong, G. 2013. Principles of Marketing; Ford, K. 2005. Brands Laid Bare: Using Market Research for Evidence-Based Brand Management.)

Values: ‘computes the five basic human values proposed by Schwartz and validated in more than twenty countries’ (IBM Cloud Docs, 2017).

You can see the results of these fields below.

A more detailed look at the Big Five

The service goes on to break-down the Big Five model into 10 features each making it 50 feature model. Because the more features the better right.

If my graphs aren’t nice enough for you here’s a nice 👌’sun burst’ visualisation of all the data shown above generated by the site.

Consumer Preferences

As stated previously inferences can be made about the kinds of choices individuals are likely to make based on specific personality traits. Using the features described above, models have been created by IBM to fit specific consumption preferences to specific personality types. Though not shown overtly on the demo page there is the option to download a JSON file with all the generic tests carried out. These can be listed as follows:

Shopping

Likely to be sensitive to ownership cost when buying automobiles: 1.
Likely to prefer safety when buying automobiles: 0.
Likely to prefer quality when buying clothes: 1.
Likely to prefer style when buying clothes: 0.
Likely to prefer comfort when buying clothes: 1.
Likely to be influenced by brand name when making product purchases: 0.
Likely to be influenced by product utility when making product purchases: 1.
Likely to be influenced by online ads when making product purchases: 0.
Likely to be influenced by social media when making product purchases: 0.
Likely to be influenced by family when making product purchases: 0.
Likely to indulge in spur of the moment purchases: 0.
Likely to prefer using credit cards for shopping: 1.

Health and activity

Likely to eat out frequently: 0.
Likely to have a gym membership: 0.
Likely to like outdoor activities: 1.

Environmental concern

Likely to be concerned about the environment: 1.

Entrepreneurship

Likely to consider starting a business in next few years: 0.5.

Movie

Likely to like romance movies: 0.
Likely to like adventure movies: 1.
Likely to like horror movies: 0.
Likely to like musical movies: 0.
Likely to like historical movies: 1.
Likely to like science-fiction movies: 1.
Likely to like war movies: 1.
Likely to like drama movies: 0.
Likely to like action movies: 1.
Likely to like documentary movies: 1.

Music

Likely to like rap music: 0.
Likely to like country music: 0.5.
Likely to like R&B music: 0.5.
Likely to like hip hop music: 0.
Likely to attend live musical events: 0.
Likely to have experience playing music: 0.
Likely to like Latin music: 1.
Likely to like rock music: 1.
Likely to like classical music: 1.

Reading

Likely to read often: 1.
Likely to read entertainment magazines: 0.
Likely to read non-fiction books: 1.
Likely to read financial investment books: 1.
Likely to read autobiographical books: 0.

Volunteering

Likely to volunteer for social causes: 1.

Some of these may seem silly, but in a real-world scenario you would create your own models to cater for your own specific needs. This part isn’t as cheap an endeavour and without existing data you would need to gather your own data for your needs.

Conclusion

How accurate is this?

Short answer: In this example, not at all.

As the demo website states the sample is far too short to create an accurate analysis. However, say given a complete user profile the results are argued to become more effective. I found a few references to horoscopes in online discussions concerning the service (Quora).

Looking for other examples of this kind of service to get a comparison, the website created by Cambridge University https://applymagicsauce.com/ is probably the most similar available online. However, I didn’t try that one out in the end because it wanted access to my social media data.

With this kind of service the accuracy will always be hard to measure. Even though it is relying on numerical computations we are still receiving qualitative results. For example, agreeableness is much more a relative measure than a count of objects. Using tried and tested psychological frameworks means the service probably does have some merit however.

Final thoughts

Whether this is an accurate description of Mark’s personality or even the text is not the argument. This example was used to start to think about how tools trying to ascertain behavioural insights are integrating with society. The demo page emphasises that a represented sample be used. A person’s social media data tends to be thought of as a representative perhaps for its capacity for eclectic expression. This service and those like it will always be an assessment of a representation of a person not of them themselves. The affordances of these services to those seeking insights depend on its accuracy but to those who are the subject of inquiry this is not always the case. For example, in the field of recruitment, services like that provided by hirevue use machine learning to gain insights on candidates scoring them via various metrics. Here, for the subject of inquiry accuracy is less important than a favourable outcome (i.e. getting the job). In this way, there is possibility for these tools to shape the behaviour of individuals. What happens when this becomes more widely used in society for decision making processes and is also further democratised? This could make for interesting future inquiry especially with regards to more civic matters.

References

Wang, Y. and Kosinski, M. 2018. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of Personality and Social Psychology. [Online]. [Accessed 22 March 2018]. 114(2), pp.246-257. Available from: https://psyarxiv.com/hv28a/
IBM Cloud Docs, 2017. The science behind the service. [Online]. [Accessed 22 March 2018]. Available from: https://console.bluemix.net/docs/services/personality-insights/science.html#science
IBM Watson Developer Cloud, 2017. Personality Insights. [Online]. [Accessed 22 March 2018]. Available from: https://personality-insights-demo.ng.bluemix.net/

Other resources

IBM research reference list
Costa, P., and McCrae. R. 2008. Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) Manual. Odessa, FL: Psychological Assessment Resources (1992). Available from: https://www.researchgate.net/publication/285086638_The_revised_NEO_personality_inventory_NEO-PI-R
Apply Magic Sauce personality insight app
Michal Kosinski - The End of Privacy, Keynote at CeBIT’17
http://www.michalkosinski.com/home/publications
The Power of Big Data and Psychographics

Digital Citizens

Finishing summary

Content analysis of YouTube comments on Zuckerberg’s congressional hearing

Approach

Abandonment

Conclusion

References

Supporting resources

Data, politics and democracy part 4: Reflections

Privacy Paradox?

The regulation of and by Facebook

Standards for democracy

Concluding discussion

References

Data, politics and democracy part 3: Analysis of The Guardian’s content

Strategy

Analysis

Article Keywords

Comparisons with Twitter data

Time series data over the last 8 years

Read Next: Part 4: Reflections

References

Supporting resources

Data, politics and democracy part 2: Twitter reactions to Facebooks so called ‘data leak’

Introduction

Twitter over Facebook

Approach

Word embeddings

Noise within the data

Approach summary

Findings

Hashtags

Word embeddings

Co-occurrences

Summary

Read Next: Part 3: Analysis of The Guardian’s content

References

Supporting resources:

Additional reading:

Data, politics and democracy part 1: Introduction

The plan

Read Next: Part 2: Twitter reactions to Facebooks ‘data leak’

Additional resources

Twitter Spam and Ham

Spam tweets within the dataset

Language Barriers

Visualising Topics

Introduction

Topic Table

Word Clouds

Gephi Graphs

Interactive D3.js graphs

Considerations

Conclusion

References

Further Resources

Notes on Digital Methods

Methods page: Repository of some collected methods

Sampling methods

References

Mark Zuckerberg personality insights

Introduction

Natural Language Understanding

Personality Insights

Personality, Needs, Values

A more detailed look at the Big Five

Consumer Preferences

Conclusion

References

Other resources