Jekyll2018-06-18T02:23:49+00:00/digitalcitizens/Digital CitizensNew Media MA independent research blog concerning citizen media and democracy in the digital age.Karl SimsFinishing summary2018-05-08T01:48:29+00:002018-05-08T01:48:29+00:00/digitalcitizens/posts/2018/finishing-summary<p>This final article will conclude by summarising the project discussing some of the overarching issues and processes that have been concerned with along the way.</p>
<p>Over the course of the project, aims and ideas have naturally evolved. reflecting on the proposed research question some evaluations of the project trajectory can be made. As proposed this can be detailed as follows.</p>
<blockquote>
<p>Democratic Citizenship Amongst Automated Agents and Environments; What Challenges and Opportunities Exist for Citizens?</p>
</blockquote>
<p>This can most accurately be described as a guide to the kind of project pursued, understandably broad and a bit vague. As progressions have been made, key concentrations were on the Facebook and Cambridge Analytica scandal. This was due to it being perceived as an interesting currently evolving piece of current affairs which addressed related themes. Specifically issues related to personalised media content, data privacy and their place within elections.</p>
<p>Considering this original question, opportunities seem to be in the hands of campaigners rather than citizens (or at least were). Citizens seem to be facing a greater mass of challenges in this environment. Namely, in the deciphering of political messages that may have more effective persuasive qualities could be putting them in a position where common truths are less available. How this issue evolves in the future will present a more detailed picture.</p>
<p>It has been noted the almost avoidance of most deliberative theory here, theory that is often central to discussions and constitutive to discussions of the public sphere. Perhaps I don’t feel Habermasian notions of the public sphere are reflected online or this ideal is not my ideal. Major social media platforms have been considered here as they are the sites that pertain the most users and thus the most citizens. If there is power in internet for bringing about democratic change it doesn’t seem to be through long considered rational public deliberation.</p>
<p><strong>Conclusions conclusion</strong></p>
<p>At times, it seems research priorities have been found more in exploration of methods over the topics they were supposedly studying. Looking at the project overall, attention has been given to the development of skills particularly related to natural language processing. This has been formative and fed back into more issue centred investigations. Incorporating the structure of a blog over a more traditional academic format has provided opportunities for experimentation and overall has been a rewarding experience. Any good practices built upon here will be taken forward into future research projects.</p>Karl SimsThis final article will conclude by summarising the project discussing some of the overarching issues and processes that have been concerned with along the way.Content analysis of YouTube comments on Zuckerberg’s congressional hearing2018-05-06T01:49:33+00:002018-05-06T01:49:33+00:00/digitalcitizens/posts/2018/content-analysis-of-youtube-comments-on-zuckerbergs-congressional-hearing<p>Noting the complete reliance of this project on computational methods, this post details steps taken in working with more traditional forms of content analysis. Staying with the theme of the Facebook and Cambridge Analytica scandal it looks at comments made on YouTube in response to Mark Zuckerberg’s Live hearing with congress.</p>
<h2 id="approach">Approach</h2>
<p><strong>Pretext</strong></p>
<p>Having learnt about computer programming before more traditional content analysis, I hadn’t taken that much of an interest in it. Systematically processing media in accordance with a specific set of instructions sounds a lot like the work of machines. Not only that the interpretability of such ‘code books’ is a lot more subjective than a computer executing source code. However, after recently reading serval essays on the subject, interest was sparked in investigating its utilisation in a manner that combines the capacities of human and machine labour effectively. Zamith and Lewis (2015) provided comparisons between algorithmic and human approaches to coding data which informed strategies. The work of Graham (2008) was used as a starting point for an investigation of civic discussions online whilst branching off into their citations for further reference.</p>
<p><strong>Investigation</strong></p>
<p>Sticking with the theme of the Facebook and Cambridge Analytica scandal it was decided that an analysis of YouTube comments threads from videos broadcasting the US congressional of Mark Zuckerberg would be performed. These were collected using YouTubes Data API equating to roughly <code class="highlighter-rouge">2000</code> threads across 2 videos (<code class="highlighter-rouge">hJdxOqnCNp8</code>, <code class="highlighter-rouge">6ValJMOpt7s</code>). These included the possibilities of replies within each. Using these sought to provide evidence that mapped discussions within the topic. Though YouTube comments can be just as inflammatory as Twitter posts their visibility poses more opportunities for the responses of other users over a more sustained period. Because of less restrictions on content length, sustained and lengthy arguments are possible through this medium. As threads contain responses from a variety of different actors, interactions can be graphed effectively. This is to be considered in a developing framework.</p>
<h2 id="abandonment">Abandonment</h2>
<p>Ultimately, to make this process worthwhile, it seemed there needed to be a longer and more in-depth period of reflection made prior to analysing the content. Developing a framework in which to apply to the data was obviously the most time-consuming part. A general strategy of its application was to build an interface that aimed to optimise analysis by automating things like conditional logic (e.g. discussions of data regulation -> political talk) as well as any tasks that can be done by machines (e.g. constructing conversation graphs). In this process, there was a tendency to design the interface server in the most generalizable way possible which was another thing taking up too much time.</p>
<p><img src="/digitalcitizens/assets/imgs/content-analysis-of-youtube-comments/interface.jpg" /></p>
<figcaption>Screenshot of interface currently developed</figcaption>
<h2 id="conclusion">Conclusion</h2>
<p>This process wasn’t essential for the investigation of the Facebook and Cambridge Analytica scandal instead it was something of interest. It sought to address concerns that tools research being used were having too much of a formative effect on the types of investigations that were being made – I.e. only analysis that machines can perform well. Though not complete the word done has provided a starting point that may go on to be used in another project.</p>
<h3 id="references">References</h3>
<ul>
<li>Graham, T. 2008. Needle in a Haystack. <em>Javanost – The Public</em>. <strong>15</strong>(2). Pp.17-36.</li>
<li>Zamith, R. and Lewis, S. 2015. Content Analysis and the Algorithmic Coder: What Computational Social Science Means for Traditional Modes of Media Analysis. <em>The ANNALS of the American Academy of Political and Social Science</em>. [Online]. <strong>659</strong>(1). pp.307-318. Available from: <a href="http://journals.sagepub.com/doi/abs/10.1177/0002716215570576#articleCitationDownloadContainer">http://journals.sagepub.com/doi/abs/10.1177/0002716215570576#articleCitationDownloadContainer</a></li>
</ul>
<h2 id="supporting-resources">Supporting resources</h2>
<ul>
<li><a href="https://github.com/winstonjay/digitalcitizens/tree/master/api_tools/youtube">digitalcitizens/api_tools/youtube at master · winstonjay/digitalcitizens · GitHub</a></li>
<li><a href="https://github.com/winstonjay/digitalcitizens/tree/master/content_analyser">digitalcitizens/content_analyser at master · winstonjay/digitalcitizens · GitHub</a></li>
</ul>Karl SimsNoting the complete reliance of this project on computational methods, this post details steps taken in working with more traditional forms of content analysis. Staying with the theme of the Facebook and Cambridge Analytica scandal it looks at comments made on YouTube in response to Mark Zuckerberg’s Live hearing with congress.Data, politics and democracy part 4: Reflections2018-05-01T22:23:43+00:002018-05-01T22:23:43+00:00/digitalcitizens/posts/2018/data-politics-and-democracy-part-4<p>Investigations here have predominantly focused on what has been shared, not how, or by which individuals specifically. This is translated into analysis of both the initial news coverage from the Guardian and the posts made via Twitter. Now this event in will be discussed in relation theories on digital privacy, also reviewing the regulatory conditions for Facebook and the underlying logic of web 2.0. It will finish by assessing some normative implications for democracies and the roles different forms of media play in addressing such issues.</p>
<h2 id="privacy-paradox">Privacy Paradox?</h2>
<p>It’s clear from both the news coverage and data collected here, that there was an apparent breach of trust in how data was shared, and through such strategies such as <code class="highlighter-rouge">#deleteFacebook</code>, users sought to express this. What is less clear is how much of an effect this has had on both long-term privacy attitudes and whether or not this incident will lead to any meaningful actions by either Facebook users or the site itself.</p>
<p><strong>Privacy attitudes</strong></p>
<p>Looking at what people say about their privacy and what they in actually do, often doesn’t add up. The <em>privacy paradox</em> describes the disconnect between people’s willingness to disclose personal information online given the levels of concern they express (Young, A. and Quan-Haase, A. 2013. p.479). Generally, people say they value things like privacy, freedom, and security but despite this, there are many situations which they are willing to waver certain rights in varying forms of exchange. These rights are at times negotiated, at other eroded by changes by societal norms, or personal omissions given more tacitly.</p>
<p>A longitudinal study examining privacy attitudes and self-disclosure patterns of Facebook users over the last 5 years, found that although ‘heavy users’ had seen a marked increase in concern, the opinions of ‘light users’ has remained approximately the same (Tsay-Vogel, M., Shanahan, J. and. Signorielli, N. 2018). Not only that, but increases in concerns seem to be plateauing for heavy users. Authors have argued that this supports the hypothesis that the ongoing exposure and habitual nature of social media use has effected perceptions not only what levels of self-disclosure are normal, but what is expected.</p>
<p><strong>Sharing in changing contexts</strong></p>
<p>Despite this, studies have also shown that individuals do take active steps to negotiate their privacy within the constraints of Facebooks available settings (Marwick, A. and Boyd, D. 2014; Young, A. and Quan-Haase, A. 2013). As a prerequisite, Facebook requires users to share at least some information for them to connect with other users. Helen Nissenbaum describes the importance of the context in which information Is disclosed, essentially that privacy strategies should be coherent to and informed by the conditions in which data was indented to be used (2010). As discussed in part 2 of this series, Facebook’s default settings dictate that most posted information is shared between ‘friends’, which to a large extent sets the tone of social exchange. Through tacit knowledge it is understood that their information is correspondingly used by Facebook as it sees fit, by personalising content, selling adds etc. It is in this context that privacy is negotiated, not only between general end users but equally developers, advertisers and within company itself. Given this, it is not surprising that there are disjunctions perceived acceptable standards.</p>
<p>To place this issue fully, privacy considerations need to consider the scope and topology of how data flows. The idea of <em>networked privacy</em>, gives the recognition that information is often intertwined within relationships of users making it difficult for individuals to fully negotiate how information may be shared (Marwick, A. and Boyd, D. 2014). This was explicit within this scandal as the original dataset collected by Aleksandr Kogan relied on the prior relaxed conditions that enabled Facebook users to share the personal data of all their friends through taking an online survey. Naturally, one could not argue that any of them would have predicted where this data would end up, especially using technology that didn’t then exist. Marwick and Boyd argue that these changing and co-constructed contexts that are framed by networked privacy can collapse the rules of contextual integrity that Nissenbaum describes (2014. pp.1063-1064). As context is interpretable in several ways, understanding initial consent once data has been exchanged multiple times loses its meaning.</p>
<p><strong>Established concerns</strong></p>
<p>Establishing privacy online as a paradox doesn’t offer explanation to the contradictions it claims. As has been discussed, individuals take active steps to mitigate data collection and though the increases in the levels of concern are stagnating, apprehension very much exists. Positioning the Cambridge Analytica scandal against the background of stories on privacy over the last 8 years (part 3: fig 5), the Idea that varying institutions are gathering large amounts of data on them is not something that is alien to citizens. While the Guardian is typically addressing a specific kind of left leaning reader, these stories often have led to wide spread coverage across differing media. Some people may not be concerned with their privacy; however, the overall picture is of greater sustenance to its related issues in social and broadcast media.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-4/privacy_opinions.png" /></p>
<figcaption>xkcd: Privacy Opinions. from: <a href="https://xkcd.com/1269/">https://xkcd.com/1269/</a></figcaption>
<h2 id="the-regulation-of-and-by-facebook">The regulation of and by Facebook</h2>
<p>Literature written by Tarleton Gillespie has recently considered the <em>‘regulation of and by platforms’</em> (2018), in particular considering how user generated content is managed. With reference to Section 230 of the US’s Communication Decency Act (1996), an argument is made that current content regulation, whilst protecting social media providers was instead designed for Internet service providers and search engines (pp.255-260). Stating the impossibility of platform impartiality, an emphasis is placed on importance of their own governance and curation online spaces (p.262). Regulations for tech companies are outdated, ill-suited and sometimes non-existent with respect to contemporary concerns. Be it the beta testing of self-driving cars on public roads, deciding how data is collected and sold, and the negotiations of speech limitations within platforms, the underlying theme expressed by law, particularly in the US, is that tech companies can just regulate themselves and innovation should not be stifled. The issue raised in this scandal concerns both the regulation of data and content produced for political contexts along with their underlying strategies.</p>
<p><strong>Political regulation</strong></p>
<p>With the implementation of the EU’s General Data Protection Regulations (GDPR), for European users at least, Facebook will supposedly have to uphold the new standards presented. This includes; attempts to make terms and conditions more transparent, the right to be forgotten, rights to access data concerning themselves and several other safeguards for citizens [<a href="https://ico.org.uk/for-organisations/guide-to-the-general-data-protection-regulation-gdpr/">ref</a>. Though Facebook claims to be on board with the regulation, <a href="https://www.reuters.com/article/us-facebook-privacy-eu-exclusive/exclusive-facebook-to-put-1-5-billion-users-out-of-reach-of-new-eu-privacy-law-idUSKBN1HQ00P">reports</a> claim the company is shifting the data of over a billion users to more liberal regulatory environments. This is obviously counter evidence to their claims.</p>
<p><strong>Self-regulation</strong></p>
<p>Even without intervention by political institutions, Facebook has strong incentives to adapt their policies to meet user preferences. As Gillespie also notes, even with their level of monopolisation, companies like Facebook still don’t want to lose large numbers of users to competitors (2018, p.262). To do this it needs to keep the users happy. The trouble it is not just the end users that it needs to satisfy.</p>
<p>In analysing Facebooks attitude to regulation, it is important to take note of the multitude of actors it is aiming to satisfy. Bucher and Helmond (2018) bring consideration to the affordances of social media platforms in relation to their different types of consumers; Including advertisers, investors, end users, developers, and the platforms themselves. Looking to a variety of definitions of affordance, they emphasise the inherent reciprocity that is involved – that is, not just what technology affords users but equally what users afford technology. In Facebook, an obvious example of this would be through user actions providing data points for the sites personalisation algorithms. Throughout Facebook trying to satisfy the combination of its ethos, end users, developers, advertisers and investors, conflicts have naturally arisen. Facebook may want users feel their data is shared minimally, whilst to advertisers they might wish assure that maximum access is granted. Users themselves may also be in conflict with their relationship to data practices; they might not want to share their data but contrastingly prefer more personalised content.</p>
<p><strong>Democratic Governance</strong></p>
<p>Though Facebook may have enjoyed praise for their apparent democratising potential (e.g. superficially, the Arab Spring), the plight of claims for facilitating fake news, political polarisation, this scandal, and their apparent failure to take meaningful responsibility, has led to discontentment in users, politicians and investors alike. Facebook is a place where a combination of public, private and corporate interests competes for the social and connective assets that the site commands (Van Dijck, J. 2012). The problems of this online space reflect the issues faced by democracies in general. Could democratic input from users produce an environment of greater accountability and increased satisfaction? Well it turns out 2009 Mark of 200 million users thought so…</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/2GuHVZx4OwU" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen=""></iframe>
<p>Reportedly, the experiment didn’t work out, though it’s not sure it can be said they tried. The company sites low turnout but <a href="https://www.theverge.com/2018/4/5/17176834/mark-zuckerberg-facebook-democracy-governance-vote-failure">critics have argued</a> that users were only given a limited choice of two slightly different terms of service, not how the company was structurally run.</p>
<h2 id="standards-for-democracy">Standards for democracy</h2>
<p>A key argument of this scandal, even stated by the whistle blower themselves, was that democracy had been in some way undermined. This idea will now be examined briefly.</p>
<p>Using Jesper Strömbäcks four models of democracy: <em>procedural</em>, <em>competitive</em>, <em>participatory</em>, <em>deliberative</em> (2005), as reference to standards with normative implications, we can position at what levels democracy might be being disrupted. Arguably, it is towards the least publicly involved the end of the spectrum that the discussion caused by the Cambridge Analytica scandal concerns (The order of perceived public involvement being from procedural to deliberative). Here considerations will be predominantly concerned with democracy as styled within the competitive standard.</p>
<p><strong>Competitive democracy</strong></p>
<p>The standard of competitive democracy holds that there be proper competition and choice between political elites, enabling thorough scrutiny from an informed electorate. In this sense, <em>‘it is the political elites that act, whereas the citizens react’</em> (2005, p.334). Citizens therefore select which of the political elites they think will give them the best product, <em>’as in a marketplace of goods’</em> (2005, p.334). Strömbäck goes on to state that for this to be so, fact can fiction be disguisable along with the purposes of different kinds of media content (2005, p.334).</p>
<p><strong>Personalised campaigning</strong></p>
<p>Key arguments surrounding the personalisation of media can be found in the works of writers like Eli Parser (2011), and Cass Sunstein (2001). One shared theme is the erosion of a shared reality, instead replaced by polarised homogenised spaces (filter bubbles / echo chambers). Helen Nissenbaum posits that considering individualised voter targeting; an argument for could be to maintain the freedom of competition within political campaigns and an argument against is that the process of personalisation would distort decision making processes of voters (2010). At Mark Zuckerberg’s congressional hearing, US Senator Chuck Grassley made a point of stressing that campaigns from both sides of the isle have progressively made use of the latest technologies within campaigns to achieve the upper hand. This is something that is true in other countries also.
Most election campaigns require candidates to debate publically on a common set of issues. Without the personalisation of all media, though the background of these issues may be distorted, there needs to be at least some shared reality for candidate’s campaigns to gain momentum amongst undecided voters. Though it is probably not ideal, one could make the case of this being within the standard of competitive democracy. This is of course if truth is still upheld, something that seems tenuous here, posing more involved epistemic questions.</p>
<p><strong>Roles of different media</strong></p>
<p>Between social and broadcast media, it seems apparent that there are differences in the types of issues that can be effectively communicated. Social media seem to have efficacy in communicating shared experiences less reliant on concrete facts. Contrastingly the inherent structure and elevated broadcasting ability of traditional media renders them more adept in co-ordinating and presenting more complex stories. An implication for journalism in the idea of competitive democracy is that it can act as a watchdog and hold the political elites to account (Strömbäck, J. 2005. p.341).</p>
<p>With the example given by this issue: The Guardian/Observer could piece together and co-ordinate actors in a considered manner. Responses made on Twitter the with <code class="highlighter-rouge">#deleteFacebook</code> established communication channels for users to share sentiments which then extended and added to the initial story. We could ask, without the social media response would there be as much pressure for the US congress or Facebook to respond. Perhaps other actors may have perused the issue, it’s hard to tell. The question here is does the current media ecosphere support of the needs and desires of citizens and have actors have been held to account. Do scandals indicate functionality beyond the exposed dysfunction? the fact there are scandals often means there are attempts to address their issues. Apologies have been made and steps have been laid down to address the issue, but Facebook has so far managed to get away without fines or punishment.</p>
<h2 id="concluding-discussion">Concluding discussion</h2>
<p>In some respects, its unconvincing that Facebooks services were used in an un-indented manner. Facebook makes money by providing tools to communicate to curated audiences. This is what Cambridge Analytica was doing. The only differences were that 1) overtly political content was used in targeting instead of commercial and 2) the data used originated from a time of different internal policies. Facebook has seen this was unsavoury and is offering ways to address the issue. Though there is bound to be more data like this out there and similar situations may happen again. The tone has been set to what users currently think is acceptable.
The control and accumulation of data is the key here. This is more generalizable to the underlying logic of the web 2.0 business model. The cyclic nature of processing and archiving data performed in collaboration between users and platforms, what Robert Gehl has described as ‘affective processing’ (2011) describes their state of function. We could cite the often-referenced Tim Berners-Lee as someone calling for the re-decentralisation of the internet. Work like that done at the <a href="https://theodi.org/">Open data institute</a> is obviously needed, there is a strong case for letting users have more control over their data. This is particularly important regarding its increase use within AI, something that magnifies the inequalities of data access.</p>
<p><strong>Conclusion</strong></p>
<p>Considering the methodologies used in this series, First with Twitter and then the articles published by The Guardian, provided perspective on subtopics within the issue. The TSNE visualisations and word co-occurrences used with the Twitter data mapped related terms, whilst hashtag counts more concretely identified trends. Comparisons between The Guardian articles of the corresponding timeline help to offer context to other recent personal data scandals. This exploration was beneficial in establishing a starting point for more in depth discussions. Main points that have been discussed here include standards for democracy, negotiations of data and privacy and the underpinning strategies of Facebook as a platform. Though what is presented here may be incomplete in parts, it provides a research and experience that may be helpful to personal future projects.</p>
<p>Considering the Cambridge Analytica scandal as it has evolved has shed light on numerous cultural foundations. Issues of privacy, data and election regulations in relation to technology have been recurrent in recent history. As technologies evolve or attitudes change, renegotiations consistently need to take place for aspirations of consensus to be approached, yet arguably never attained. Research considering this may very well be important in the formations of decisions made.</p>
<h3 id="references">References</h3>
<ul>
<li>Bucher, T. and. Helmond, A. 2018. The Affordances of Social Media Platforms. In: Burgess, J., Marwick, A. and Poell, T. ed. The SAGE Handbook of Social Media. London: SAGE Publications. pp.233-253.</li>
<li>Gehl, R. 2011. The archive and the processor: The internal logic of Web 2.0. <em>New Media & Society</em> [Online.] <strong>13</strong>(8). pp.1228-1244. [Accessed 5 May 2018]. Available from: <a href="http://journals.sagepub.com/doi/abs/10.1177/1461444811401735">http://journals.sagepub.com/doi/abs/10.1177/1461444811401735</a></li>
<li>Gillespie, T. 2018. regulation of and by platforms. In: Burgess, J., Marwick, A. and Poell, T. ed. The SAGE Handbook of Social Media. London: SAGE Publications. pp.254-278.</li>
<li>Marwick, A. and Boyd, D. 2014. Networked privacy: How teenagers negotiate context in social media. New Media & Socitety. [Online]. <strong>16</strong>(7). pp.1051-1067. Available from: <a href="http://journals.sagepub.com/doi/abs/10.1177/1461444814543995">http://journals.sagepub.com/doi/abs/10.1177/1461444814543995</a></li>
<li>Nissenbaum, H. 2010. <em>Privacy in Context: Technology, Policy, and the Integrity of Social Life</em>. California: Stanford University Press.</li>
<li>Tsay-Vogel, M., Shanahan, J. and. Signorielli, N. 2018. Social media cultivating perceptions of privacy: A 5-year analysis of privacy attitudes and self-disclosure behaviours among Facebook users. <em>New Media & Society</em> [Online]. <strong>20</strong>(1) pp.141-161. Available from: <a href="http://journals.sagepub.com/doi/abs/10.1177/1461444816660731">http://journals.sagepub.com/doi/abs/10.1177/1461444816660731</a></li>
<li>Strömbäck, J. 2005. In Search of a Standard: Four models of democracy and their normative implications for journalism. Journalism Studies. 6(3), pp. 331-345.</li>
<li>Van Dijck, J. 2012. Facebook as a Tool for Producing Sociality and Connectivity. <em>Television & New Media</em>. [Online]. <strong>13</strong>(2) 160–176. Available from: <a href="http://journals.sagepub.com/doi/abs/10.1177/1527476411415291">http://journals.sagepub.com/doi/abs/10.1177/1527476411415291</a></li>
<li>Young, A. and Quan-Haase, A. 2013. Privacy protection strategies on Facebook, Information, <em>Communication & Society</em>. [Online]. <strong>16</strong>(4), pp.479-500. Available from: <a href="https://www.tandfonline.com/doi/abs/10.1080/1369118X.2013.777757">https://www.tandfonline.com/doi/abs/10.1080/1369118X.2013.777757</a></li>
</ul>Karl SimsInvestigations here have predominantly focused on what has been shared, not how, or by which individuals specifically. This is translated into analysis of both the initial news coverage from the Guardian and the posts made via Twitter. Now this event in will be discussed in relation theories on digital privacy, also reviewing the regulatory conditions for Facebook and the underlying logic of web 2.0. It will finish by assessing some normative implications for democracies and the roles different forms of media play in addressing such issues.Data, politics and democracy part 3: Analysis of The Guardian’s content2018-04-12T22:23:30+00:002018-04-12T22:23:30+00:00/digitalcitizens/posts/2018/data-politics-and-democracy-part-3<p>Here will detail investigations into content created by the Guardian, a key player in the dissemination of the Facebook/Cambridge Analytica story. It will use computational methods such as keyword extraction, also making comparisons with the previously collected data from Twitter.</p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/latest.js?config=TeX-MML-AM_CHTML" async=""></script>
<h2 id="strategy">Strategy</h2>
<p>Seeing that The Guardian was one of the main news organisations to break the initial story, arguably it is relevant to look at patterns within their coverage. Acting as <em>The Fourth Estate</em>, it’s common to assume that, news organisations are expected to hold governments and other public entities to account, how has this been done here? Whilst additionally taking into consideration the results of the Twitter data analysis’, comparisons between the two content types will be made. As part of a series, a picture is gradually being built up from a variety of different perspectives. Later this information will be used to discuss the content of the two media types, and the topic more generally.</p>
<p>Making use of the Guardians <a href="http://open-platform.theguardian.com/">Open Platform API</a>, all the articles between the <code class="highlighter-rouge">17/03/2018</code> - <code class="highlighter-rouge">24/03/2018</code> were collected for analysis. This collection period starts on the date of the initial story and finishes when the previously sampled Twitter data ends its collection span. Symmetrical query terms were likewise used, here translated in terms of the Guardians search API, this is literally written as <code class="highlighter-rouge">facebookANDcambridge analytica</code>. The content collected was then analysed using a variety of different methods, looking at both manual and machine generated structures within the data. Keywords were generated automatically using the RAKE algorithm, other approaches like TF-IDF scores over n-grams, and topic generation as discussed in previous articles were also utilised.</p>
<p><strong>RAKE algorithm</strong></p>
<p>Presented in Rose, S. Engel, D. Cramer, N. and Cowley, W. (2010), the <em>Rapid Automatic Keyword Extraction</em> (RAKE) algorithm does as the title describes, generating keyword phrases from individual documents – ideally short texts like abstracts. A characteristic of this algorithm is that it weights longer sequences more heavily, resulting in more greedy results. More details about how this is implemented can be found in the <a href="https://winstonjay.github.io/digitalcitizens/methods/">methods page of this blog</a>.</p>
<p>To depict keywords across the whole corpora, a method described in authors paper for finding the most ‘essential’ terms was used conjunctively. The calculation of this can be summarised as follows:</p>
<script type="math/tex; mode=display">essentiality = (\frac{\text{edf}_k}{\text{rdf}_k}) \text{edf}_k</script>
<p>Where the <em>edf</em> (extraction document frequency) is the number of documents the candidate was extracted from as a keyword and <em>rdf</em> (reference document frequency) is the number of times a candidate appeared across the collection. With this approach, perhaps we will be able to get an alternative but reasonable portrayal of the keywords within these articles.</p>
<p>Due to its relative simplicity, the RAKE algorithm and other related functions were implemented as needed here in the Python programming language and can be found in the <code class="highlighter-rouge">text_tools</code> section of this blogs GitHub repository (<a href="https://github.com/winstonjay/digitalcitizens/blob/master/text_tools/rake.py">digitalcitizens/rake</a>).</p>
<h2 id="analysis">Analysis</h2>
<h3 id="article-keywords">Article Keywords</h3>
<p><strong>Manually generated</strong></p>
<p>Looking at the human generated keywords tells about not only about the articles but also the internal practices of the organisation. Through being the query terms given, <code class="highlighter-rouge">Cambridge Analytica</code> and <code class="highlighter-rouge">Facebook</code> naturally top the chart, most of the top items are meta keywords that describe overarching collections of content like <code class="highlighter-rouge">Technology</code> or <code class="highlighter-rouge">UK news</code>. The categorisation of news types geographically can tell us more about the priorities of the content producers (e.g. ‘US news’). This can be described here as <em>UK news</em> <code class="highlighter-rouge">></code> <em>World news</em> <code class="highlighter-rouge">></code> <em>US news</em>. Because of this, although the US 2016 elections where a key part of the topic, we are more likely to see British issues like Brexit reflected.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-3/tags.png" /></p>
<figcaption> <strong>Figure 1:</strong> Guardian Keywords tagged by Organisation.</figcaption>
<p>With knowledge of what this data concerns we could use these tags to inversely query content over a longer period within the paper. Possible topics of interest can be listed as, <code class="highlighter-rouge">Data protection</code>, <code class="highlighter-rouge">privacy</code>, <code class="highlighter-rouge">social media</code>. Seeing how these topics have evolved over time might be an interesting line to follow in establishing the publishing patterns. This will be discussed at a later point in this article.</p>
<p><strong>RAKE results</strong></p>
<p>Comparing the RAKE key phrases with the human generated ones, the differences in style are apparent. The human keywords provide quite clear and considered meta data whose core function it to group content systematically. Here, though following qualitatively similar topics, we find structure in the natural language of the news reporters. For such a simple approach, it does appear to give good results. One noticeable caveat its failure to capture single word key phrases well – questionably <code class="highlighter-rouge">Brexit</code> is missing here. We can see also the tendency for it to be greedy and present longer phrases such as <code class="highlighter-rouge">50m Facebook profiles</code> over more simply <code class="highlighter-rouge">Facebook</code>.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-3/rakekw.png" /></p>
<figcaption><strong>Figure 2:</strong> Guardian Keywords generated by RAKE algorithm.</figcaption>
<p>Questions also arose in how much of the input text the algorithm should be applied to. Though best use cases for it are often described as shorter texts, like abstracts, the characteristics of the <a href="https://www.collinsdictionary.com/dictionary/english/standfirst"><code class="highlighter-rouge">standfirst</code></a> content used to describe articles had a tendency to cram words together without using stop words often. A purely constructed example of this might be: <em>Canadian Whistle blower breaks Cambridge Analytica data scandal story</em>. As this to many standards doesn’t contain any stop words, it would be a single keyword candidate, ultimately leading to unhelpful and long results. Because of this, full articles were given as inputs instead which performed more successfully.</p>
<h3 id="comparisons-with-twitter-data">Comparisons with Twitter data</h3>
<p>Generating TF-IDF scores for unigrams and bigrams across both datasets, comparisons of the content can be made (<span style="color:#018ed5"> █ The Guardian</span>, <span style="color:#e91f63;"> █ Twitter</span>). With similar pre-processing steps being taken with each, an initial observation finds differences in structure and consistency. Thus, the twitter data has far more low information words and media specific language (e.g. retweeted). Query terms within the Twitter data seem to be far more impactful to the scores generated. As an item of content only needs one occurrence of a query term to be collected, the ratio of query terms to non-query terms is obviously higher in shorter texts. A possible way to combat this might be to normalise their scores based on content length, though this has not been applied here.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-3/ngrams1.png" /></p>
<figcaption><strong>Figure 3:</strong>Top 20 unigrams and bigrams for Gaurdian articles.</figcaption>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-3/twitter.png" /></p>
<figcaption><strong>Figure 4:</strong>Top 20 unigrams and bigrams for Twitter dataset.</figcaption>
<p>As noted previously the Guardian content is more concerned with UK affairs, e.g. the <code class="highlighter-rouge">vote leave</code> campaign. Perhaps having more American users, Brexit related affairs do not show up at all within the Twitter rankings displayed. Though there might not have been time for stories to circulate and articles here do not represent total media coverage, one of the key actors Aleksandr Kogan seems to not have received as much attention from Twitter. To some extent we could argue that the main issue is the general social practices facilitated by Facebook between data and political organisations. Facebook Is the only actor here that has a consistent relationship with society, and naturally should receive the most attention.</p>
<p>Though better representations could have been achieved with the Twitter results it was decided here not to pursue even more cleaning steps. As noted before, this dataset was especially messy. Working with the Guardian articles, though needing some pre-processing steps was a real breath of fresh air.</p>
<h3 id="time-series-data-over-the-last-8-years">Time series data over the last 8 years</h3>
<p>Working with some of the top keywords generated by the paper, additional data was collected from the Guardians API. This queried posts between <code class="highlighter-rouge">2010</code> and the present. A times-series of articles per query can be detailed below.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-3/timeseries.png" /></p>
<figcaption><strong>Figure 5:</strong>Time series of query result counts</figcaption>
<p>Using the tiles of the articles to determine subtopics within the query the most related bi-grams can also be shown below. Only <code class="highlighter-rouge">privacy</code> and <code class="highlighter-rouge">data protection</code> we displayed as the others were more generalised and of less interest.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-3/privacy.png" /></p>
<figcaption><strong>Figure 5:</strong>Most common bi-grams per query (privacy, data protection)</figcaption>
<p>It seems ‘phone hacking’ and ‘Edward Snowden’ are top of the list. Looking back at the time series data there only seems to be a small peak in 2011 at the time of the phone hacking scandal whilst around the time of the Snowden leak there seems to be a greater peak in data protection, privacy and the internet. Facebook seems to have got most attention mid 2016 then peeked up again recently. It’s unclear to my knowledge what happened in 2015 regarding data protection, but it seems like someone must have had a wild month or two. Another point of interest is that since 2016 Facebook maintained more coverage that the Internet more generally, something that feeds into the narrative that the internet is becoming merely the walled gardens of the giant social media platforms.</p>
<hr />
<h4 id="read-next-part-4-reflections">Read Next: <a href="/digitalcitizens/posts/2018/data-politics-and-democracy-part-4">Part 4: Reflections</a></h4>
<hr />
<h3 id="references">References</h3>
<ul>
<li>Rose, S., Engel, D., Cramer, N., and Cowley, W. 2010. Automatic Keyword Extraction from Individual Documents. In: Berry, M.W. and Kogan, J. ed. <em>Text Mining: Applications and Theory</em>. UK: Wiley-Blackwell. pp.1-20. <a href="https://www.researchgate.net/publication/227988510_Automatic_Keyword_Extraction_from_Individual_Documents">Also available online</a></li>
</ul>
<h3 id="supporting-resources">Supporting resources</h3>
<ul>
<li><a href="https://github.com/winstonjay/digitalcitizens/tree/master/api_tools/the_guardian">digitalcitizens/api_tools/the_guardian at master · winstonjay/digitalcitizens · GitHub</a></li>
<li><a href="https://github.com/winstonjay/digitalcitizens/blob/master/text_tools/rake.py">digitalcitizens/rake.py at master · winstonjay/digitalcitizens · GitHub</a></li>
<li><a href="https://github.com/winstonjay/digitalcitizens/blob/master/notebooks/guardian_fb_ca.ipynb">digitalcitizens/guardian_fb_ca.ipynb at master · winstonjay/digitalcitizens · GitHub</a></li>
</ul>Karl SimsHere will detail investigations into content created by the Guardian, a key player in the dissemination of the Facebook/Cambridge Analytica story. It will use computational methods such as keyword extraction, also making comparisons with the previously collected data from Twitter.Data, politics and democracy part 2: Twitter reactions to Facebooks so called ‘data leak’2018-04-12T22:22:45+00:002018-04-12T22:22:45+00:00/digitalcitizens/posts/2018/data-politics-and-democracy-part-2<p>Using data collected whilst the Facebook/Cambridge Analytica story was first gaining momentum, this post looks at the responses made to it via Twitter. Experimenting word embeddings and other computational methods, it aims to map key dimensions that highlight the contextual relationships between different sentiments across the dataset.</p>
<script src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.4/latest.js?config=TeX-MML-AM_CHTML" async=""></script>
<h2 id="introduction">Introduction</h2>
<p>This section of the series can be outlined as follows: First a brief rationale is given on why Twitter is being used as a site of research rather than Facebook, second analytical methods will be introduced, thirdly findings will be discussed and finally a brief conclusion will be made.</p>
<h3 id="twitter-over-facebook">Twitter over Facebook</h3>
<p>It is probably fitting to answer why Twitter is being used to explore an issue most concerning Facebook users. One answer is simply, because it is easier and cheaper. Another is arguably that Twitter is far more clearly configured for the expression of public sentiment.</p>
<p><strong>Collection cost</strong></p>
<p>It is cheaper to collect the kind of public response data in mind on Twitter rather than Facebook due to the design of their data collection services. For aggregating posts real-time, Twitter has its <a href="https://developer.twitter.com/en/docs/tweets/filter-realtime/overview">Streaming API</a> which is open for anyone to use. The closest thing Facebook has to this is its <a href="https://developers.facebook.com/docs/public_feed/">Public Feed API</a>. Access to this is restricted to a limited set of pre-approved ‘media publishers’. To access Facebook user data at scale, you must either pay, have special institutional privileges, be providing a widely-used service or pretty much trick users into sharing it with you.</p>
<p><strong>What’s being shared</strong></p>
<p>Regarding the character of the content created within each site, there are specific design aspects to each that could have a formative effect on what is shared. This presents differences in the usefulness of each for this investigation. On Facebook, a user principally addresses their ‘friends’. On Twitter, it’s their ‘followers’. This is reflected in among other things the default post visibility settings. Facebook asks, ‘what’s on your mind?’. Twitter asks, ‘what’s happening?’. Ultimately, the difference in what each site purveys as its uses are, Facebook is about connecting people and Twitter is about connecting people to current affairs.</p>
<p><strong>Short-comings</strong></p>
<p>Though it may provide means for investigating sentiment on an international scale, public posts made on Twitter are undoubtedly not a reflective sample of all public opinion, even of the opinions of all Twitter users. Though it provides large amounts of data, its quality is often hard to determine. As we will see in this section, it can be especially messy at times. These themes and other critical engagements are discussed in more detail in a previous post (<a href="#TODO">Notes on Digital Methods</a>).</p>
<p>For this and the reasons discussed above, though not perfect, Twitter seems like a more appropriate tool to learn about public interaction with current affairs.</p>
<h2 id="approach">Approach</h2>
<p>Approximately 500,000 tweets were collected using Twitters <a href="https://developer.twitter.com/en/docs/tweets/filter-realtime/overview">streaming API </a> between the 20th and the 23rd of March 2018, filtering for the query terms <code class="highlighter-rouge">Facebook</code> and <code class="highlighter-rouge">Cambridge Anaylitica</code>. This was just after the Guardian’s <a href="https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-facebook-influence-us-election">initial story</a> was released and was gaining traction across social and broadcast media.
Along with counting hashtag frequencies and word co-occurrences, visualisations generated from word embeddings will be used to form a distant reading of the semantic relationships within the dataset. This experimentation provides a contextual overview of the response to help identify specific attributes for moving forward.</p>
<h3 id="word-embeddings">Word embeddings</h3>
<p>Vector representations of words, or <a href="https://en.wikipedia.org/wiki/Vector_space_model">vector space models</a>, aim to map the semantic similarity of words in continuous vector space. This has advantages over the more traditional bag-of-words model as it provides denser representations of terms. Instead of treating individual terms as unique identifiers, we can embed contextual information within them. For example, the similarities cats and kittens do and don’t have. These come in two essential styles, count based and neural embeddings. Within this investigation they will be used to compare related terms within the dataset.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-2/dataspace.jpg" /></p>
<p>The intuitions behind word embeddings depend on the <a href="https://en.wikipedia.org/wiki/Distributional_semantics#Distributional_Hypothesis">distributional hypothesis</a>, which implies that semantically similar words occur in similar contexts. As J.R. Firth summarises <em>’you shall know word by the company it keeps’</em> (1957; cited in Jurafsky, D. and James, M. 2009. p692).</p>
<p>Though definition of what constitutes a context can vary, in this article it will be employed in two distinct ways. One will assume context is created by a window of neighbouring words, for instance 2 either side. This will be used to build a neural model. The other will assume all words within a Tweet have a shared context and shall be used to measure co-occurrence in a count based manner.</p>
<p><strong>Count based methods</strong></p>
<p>For my own notes, an illustrative example is given. This example is slight variation on that provided in Grefenstette, E. 2017. The technique used in vector representations shown here also forms the basis for neural embeddings described later.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \text{... the cute kitten purred ...}\\
& \text{... the old furry cat meowed and purred ...}\\
& \text{... the small furry kitten meowed ...}\\
& \text{... an loud furry old dog barked ...}\\
\end{align} %]]></script>
<p>Say we target the words <code class="highlighter-rouge">kitten</code>, <code class="highlighter-rouge">cat</code>, and <code class="highlighter-rouge">dog</code>. Using the examples above and ignoring stop words (low information words like: ‘the’, ‘a’), we can list the witnessed context words for each as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \textbf{kitten}: & & cute, purred, small, furry, meowed\\
& \textbf{cat} : & & old, furry, purred\\
& \textbf{dog} : & & loud, furry, old, barked \\
\end{align} %]]></script>
<p>After this small example our complete set of context vocabulary would be: <code class="highlighter-rouge">{cute, purred, small, furry, meowed, old, loud, barked}</code>. Using this generated vocabulary, one way we can create a vector representation for each of our target words is as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{kitten}&=\left[ \begin{array} &1& 1& 1& 1& 1& 0& 0& 0 \end{array} \right]\\
\text{cat} &=\left[ \begin{array} &0& 1& 0& 1& 1& 1& 0& 0 \end{array} \right]\\
\text{dog} &=\left[ \begin{array} &0& 0& 1& 0& 0& 1& 1& 1 \end{array} \right]\\
\end{align} %]]></script>
<p>To do this we denote the presence of the context words in the order described above with either a <code class="highlighter-rouge">0</code> (false) or <code class="highlighter-rouge">1</code> (true), depending if they appear in the same context as our target word. This is useful as we can now compute the similarity between each word, for instance with cosine similarity.</p>
<script type="math/tex; mode=display">cosine(\pmb u, \pmb v) = \frac {\pmb u \cdot \pmb v}{||\pmb u|| \cdot ||\pmb v||}</script>
<p><small>(The numerator of the equation here is the <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a> of the two vectors and the denominator is the product of the 2 <a href="https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm">Euclidian norms</a>.)</small></p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
cosine(\text{kitten}, \text{dog}) & \approx 0.33\\
cosine(\text{cat}, \text{dog}) & \approx 0.25\\
cosine(\text{cat}, \text{kitten}) & \approx 0.68\\
\end{align} %]]></script>
<p>Computing this, as is expected from this completely constructed example, <code class="highlighter-rouge">kitten</code> is most like <code class="highlighter-rouge">cat</code>. WOW! how did that happen? Also, because the fact that both <code class="highlighter-rouge">cat</code> and <code class="highlighter-rouge">dog</code> have <code class="highlighter-rouge">old</code> in their context, <code class="highlighter-rouge">dog</code> is more like <code class="highlighter-rouge">cat</code> than <code class="highlighter-rouge">kitten</code>.</p>
<p><strong>Neural embeddings</strong></p>
<p>Beyond count based methods neural embeddings have also been widely employed to predict vector representations. Using this method embeddings are normally represented by a matrix of target and context words. Two of the main modelling strategies here, are the Continuous Bag-of-words model (CBOW) and the Skip-gram model. They function in pretty much opposite ways. Here is an illustration comparing both:</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-2/skipgram.jpg" />
<small>Image from: (Mikolov, T. Chen, K. et al. 2013) <a href="https://arxiv.org/abs/1301.3781">https://arxiv.org/abs/1301.3781</a></small></p>
<p>The CBOW model tries to predict the target word from a given set of context words and the Skip-gram model tries to predict the context words given a target word. In this post the Skip-gram model is used, implemented with the machine learning framework <a href="https://www.tensorflow.org/">Tensorflow</a>. This was done with reference to their demonstration <a href="https://github.com/tensorflow/tensorflow/blob/r1.7/tensorflow/examples/tutorials/word2vec/word2vec_basic.py">word2vec_basic.py</a>. Alterations to the original file have been made to pre-process the data differently and carry out some additional steps.</p>
<p><strong>Visualising vector representations</strong></p>
<p>Word vectors produce high dimensional data. To make sense of the representations visually we can project them into lower dimensional space. This will be done here using t-distributed stochastic neighbour embedding (t-SNE) implemented using the Python package <a href="http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html">Scikit-Learn</a>.</p>
<p><strong>Counting Co-occurrences</strong></p>
<p>The Idea of a word co-occurrence matrix is something widely used in natural language processing. As a simple extra piece of analysis, the top co-occurrences of some terms of interest from within the dataset will be presented. The terms of interest can be described as: <code class="highlighter-rouge">data</code>, <code class="highlighter-rouge">privacy</code>, <code class="highlighter-rouge">people</code>, <code class="highlighter-rouge">delete</code>, <code class="highlighter-rouge">trust</code>, <code class="highlighter-rouge">users</code>. This is presented in a collection of bar charts.</p>
<h3 id="noise-within-the-data">Noise within the data</h3>
<p>As discussed in a previous post <a href="/digitalcitizens/posts/2018/twitter-spam-and-ham">Twitter Spam and Ham</a>, the dataset collected here was especially noisy. As a reminder; this came in the form of spam targeting the trending topic but also the fact that tweets came from a wide variety of languages. The topic followed was an internationally discussed issue, however, it didn’t help that the query terms were organisation names instead of words belonging to the English language.</p>
<h3 id="approach-summary">Approach summary</h3>
<p>The methods for investigating the data here are experimental in a way that is trying to learn about methodological approaches and the data simultaneously. The main strategy being employed here is, using distributional semantics to link common terms. In doing this see what more can be understood the discussions within the sample.</p>
<h2 id="findings">Findings</h2>
<h3 id="hashtags">Hashtags</h3>
<p>Looking at hashtag frequencies, we can see that <code class="highlighter-rouge">deleteFacebook</code> was the most popular – something that was also widely reported by news organisations. Whilst filtering spam it was noted many tweets contained nothing but this repeatedly. Some of these Tweets looked like they come from automated accounts, others more natural. This did pose some questions of trend manipulation, but after counting the document frequencies and not repeats within tweets the trend did not change. Below the top ten tags can be seen with the query terms (‘Facebook’, ‘CambridgeAnalytica’) filtered from the collection. Taking this approach there doesn’t seem to be anything too overt to help explore the topic in any new directions.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-2/tags.png" /></p>
<h3 id="word-embeddings-1">Word embeddings</h3>
<p>Running the Tensorflow model described previously for 10,000 iterations it finished with an average loss of around 4% in predicting contexts of given terms within the training data. Evaluation of how well the model performs in practice is obviously harder to determine. The size and quality of the data put into the model is obviously not going to give a good representation of the English language. It may however have the potential to map corpora specific terms and ideas. Below is the t-SNE visualisation of the word embeddings created. Hopefully what we should expect to see is that terms that are similar end up clustering nearby each other. What is projected is also the <code class="highlighter-rouge">most common 250 terms</code> within the dataset after the removal of stopwords.</p>
<p><img class="big-img" src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-2/tweets4_tsne.png" /></p>
<p>It is argued that t-SNE visualisations tend themselves to be easily misread (Wattenberg, et al., 2016), hopefully that won’t be the case here lol. Running the visualisation program several times, the intuitive understanding is, that while global positioning within the graph tends to vary somewhat, locally relevant terms produce more repeatable results. For example, the small group of un-filtered French stop words <code class="highlighter-rouge">vous, mais, pour, dans</code> always clustered. Logically French words are not likely to appear in the same context as English ones, so that seems correct. Common bigrams such as <code class="highlighter-rouge">fake news</code>, <code class="highlighter-rouge">social platforms</code>, <code class="highlighter-rouge">public security</code> seem to have clustered also. As well as variations in tense, pluralisation etc.</p>
<p><strong>Annotating the Space</strong></p>
<p>Zooming in on the bottom left of the graph seems to locate the main areas of interest in this inquiry. Qualitative annotations have been made to bring further structure to the space. This is employed as a method of communication, not classification. Being an issue centred on the use of data, dimensions presented here can be said to emerge. These can be described as: the business or the economics of data, its politics or how data is used and regulated, and the individual securities of privacies of users. These obviously overlap and are in an active state of interplay.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-2/annotated.jpg" /></p>
<p>The annotation of <code class="highlighter-rouge">sociality</code> was included as it points to language use that is clustered due to the corpora being from social media. The use of words <code class="highlighter-rouge">like</code>, <code class="highlighter-rouge">share</code>, <code class="highlighter-rouge">follow</code>, <code class="highlighter-rouge">post</code>, though they have become more used generally in language, would not be as present within a book for instance. Recounting this, and the way the dataset was especially subject to spam, is a reminder that the logic social media platforms operate on doesn’t stop. Even during the expression of dissent or outrage, not only are the platforms themselves profiting from these expressions, users are also incorporating their logics to promote their own position.</p>
<h3 id="co-occurrences">Co-occurrences</h3>
<p>The co-occurrences presented here are as expected. Utterances of <code class="highlighter-rouge">trust</code> are most common with utterances of <code class="highlighter-rouge">breach</code>. This method simply provides another perspective for visualising points of interest.</p>
<p><img src="/digitalcitizens/assets/imgs/data-politics-and-democracy-part-2/cooccurances.png" /></p>
<h2 id="summary">Summary</h2>
<p>Beyond being a DIY exercise, investigating a topic through Twitter that has already been covered extensively within the news doesn’t yield much new information worth noting. We can see people are talking about a breach of trust, #deleteFacebook and the data scandal in relation to politics – as was reported. Evaluating the methods used has also provided challenges. Particularly with the neural embeddings, as in reduced dimensional space, the data in visualisations are subject to compression. Overall this has provided a way to think about the initial reactions to the topic visually and differently to other approaches that have been taken within this blog. Its results will be used for further discussion in a later post.</p>
<hr />
<h4 id="read-next-part-3-analysis-of-the-guardians-content">Read Next: <a href="/digitalcitizens/posts/2018/data-politics-and-democracy-part-3">Part 3: Analysis of The Guardian’s content</a></h4>
<hr />
<h3 id="references">References</h3>
<ul>
<li>Grefenstette, E. 2017. <em>Lecture 2a- Word Level Semantics</em>. [Online]. Available from: <a href="https://github.com/oxford-cs-deepnlp-2017/lectures/blob/master/Lecture%202a-%20Word%20Level%20Semantics.pdf">lectures/Lecture 2a- Word Level Semantics.pdf at master · oxford-cs-deepnlp-2017/lectures · GitHub</a></li>
<li>Jurafsky, D. and James, M. 2009. <em>Speech and Language Processing</em>. Second Ed. London, UK: Pearson Education Ltd. (<a href="https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf">Third Ed is available here online</a>)</li>
<li>Maaten, Laurens van der, and Geoffrey Hinton. 2009. Visualizing data using t-SNE. <em>Journal of Machine Learning Research</em>. pp.2579-2605. [Online]. Available from:<a href="http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf">http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf</a></li>
<li>Mikolov, T. Chen, K. et al. 2013. <em>Efficient Estimation of Word Representations in Vector Space</em>. [Online]. Available from: <a href="https://arxiv.org/abs/1301.3781">https://arxiv.org/abs/1301.3781</a></li>
<li>Wattenberg, et al., 2016. <em>How to Use t-SNE Effectively</em>, Distill. [Online]. Available from: <a href="http://doi.org/10.23915/distill.00002">http://doi.org/10.23915/distill.00002</a></li>
<li><a href="https://www.tensorflow.org/tutorials/word2vec">Vector Representations of Words - TensorFlow</a></li>
<li><a href="http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/">Word2Vec Tutorial - The Skip-Gram Model · Chris McCormick</a></li>
</ul>
<h3 id="supporting-resources">Supporting resources:</h3>
<ul>
<li><a href="https://github.com/winstonjay/digitalcitizens/blob/master/text_tools/word2vec_basic.py">digitalcitizens/word2vec_basic.py at master · winstonjay/digitalcitizens · GitHub</a></li>
<li><a href="https://github.com/winstonjay/digitalcitizens/blob/master/notebooks/ca_fb_notes.ipynb">digitalcitizens/ca_fb_notes.ipynb at master · winstonjay/digitalcitizens · GitHub</a></li>
</ul>
<h3 id="additional-reading">Additional reading:</h3>
<ul>
<li><a href="http://annabellelukin.edublogs.org/files/2013/08/Firth-JR-1962-A-Synopsis-of-Linguistic-Theory-wfihi5.pdf">Firth, John R. “A synopsis of linguistic theory, 1930-1955.” (1957): 1-32.</a></li>
<li><a href="http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf">Mikolov, Tomas, et al. “Distributed representations of words and phrases and their compositionality.” Advances in neural information processing systems. 2013.</a></li>
<li><a href="https://www.datacamp.com/community/tutorials/lda2vec-topic-model">LDA2vec: Word Embeddings in Topic Models (article) - DataCamp</a></li>
<li><a href="https://github.com/MaxwellRebo/awesome-2vec">GitHub - MaxwellRebo/awesome-2vec: Curated list of 2vec-type embedding models</a></li>
<li><a href="http://nlp.town/blog/anything2vec/">Anything2Vec, or How Word2Vec Co nquered NLP</a></li>
</ul>Karl SimsUsing data collected whilst the Facebook/Cambridge Analytica story was first gaining momentum, this post looks at the responses made to it via Twitter. Experimenting word embeddings and other computational methods, it aims to map key dimensions that highlight the contextual relationships between different sentiments across the dataset.Data, politics and democracy part 1: Introduction2018-04-12T22:22:17+00:002018-04-12T22:22:17+00:00/digitalcitizens/posts/2018/data-politics-and-democracy-part-1<p>Presenting a series of forthcoming posts related to the recent Facebook and Cambridge Analytica scandal, an introduction is made to the investigation whilst recording its aims ahead.</p>
<p>The scandal involving Cambridge Analytica’s apparent miss-use of Facebook data is an especially relevant piece of current affairs for an investigation of contemporary civic issues. It belongs to the wider issues surrounding data regulations, the economic foundations of web 2.0 and normative democratic ideals.</p>
<p>I won’t bother in going over all the exact story details here, for a point of reference <a href="https://www.theguardian.com/news/2018/mar/17/cambridge-analytica-facebook-influence-us-election">here</a> Is probably a good start with a whole collection of related articles <a href="https://www.theguardian.com/news/series/cambridge-analytica-files">here</a>.</p>
<h2 id="the-plan">The plan</h2>
<p>Instead of cramming everything in a single post like what has been done previously, this article will be split into four parts. This will develop a more in-depth sustained inquiry, giving at least one full article for reflection.</p>
<p>The sections planned, including this current post, can be outlined and summarised as follows:</p>
<p><strong>Part 1: Introduction</strong></p>
<p>You are here. This will be whatever this page is right now.</p>
<p><strong>Part 2: Twitter reactions to Facebooks ‘data leak’</strong></p>
<p>Initial reactions made via twitter will be studied whilst experimenting with word embedding’s. This aims to map contextual relationships within the response.</p>
<p><strong>Part 3: Analysis of The Guardian’s content</strong></p>
<p>This will detail investigations into content created by the Guardian, a key player in the dissemination of the Facebook/Cambridge Analytica story. It will use computational methods such as keyword extraction, also making comparisons with the previously collected data from Twitter.</p>
<p><strong>Part 4: Reflections</strong>
Discussions will be made in relation theories of digital privacy, also reviewing the regulatory conditions for Facebook and the underlying logic of web 2.0. It will finish by assessing some normative implications for democracies and the roles different forms of media play in addressing such issues.</p>
<hr />
<h4 id="read-next-part-2-twitter-reactions-to-facebooks-data-leak">Read Next: <a href="/digitalcitizens/posts/2018/data-politics-and-democracy-part-2">Part 2: Twitter reactions to Facebooks ‘data leak’</a></h4>
<hr />
<h3 id="additional-resources">Additional resources</h3>
<ul>
<li><a href="https://www.youtube.com/watch?v=X5g6IJm7YJQ">Cambridge Analytica whistleblower Christopher Wylie appears before MPs - watch live - YouTube</a></li>
<li><a href="https://www.youtube.com/watch?v=6ValJMOpt7s">Mark Zuckerberg testifies on Capitol Hill (full Senate hearing) - YouTube</a></li>
<li><a href="https://www.youtube.com/watch?v=N_zlN7BXFm8">Facebook’s F8 developer conference 2018 replay - YouTube</a></li>
</ul>Karl SimsPresenting a series of forthcoming posts related to the recent Facebook and Cambridge Analytica scandal, an introduction is made to the investigation whilst recording its aims ahead.Twitter Spam and Ham2018-04-01T17:00:43+00:002018-04-01T17:00:43+00:00/digitalcitizens/posts/2018/twitter-spam-and-ham<p>Noting the wrangling needed to make Twitter data more useable and some make do solutions employed. This was supposed to be part of an upcoming post but has been separated to make the rest of the original article flow better.</p>
<p><img src="/digitalcitizens/assets/imgs/twitter-spam-and-ham/tweep.jpg" /></p>
<p>Though Twitter data generally is quite noisy and requires a lot of cleaning for analysis, what was collected for an upcoming post <a href="#TODO">Part 2: Twitter reactions to Facebooks ‘data leak’</a> Initially was pretty much unusable. This was due to massive amounts of spam targeting and also because the query terms were not rooted in any particular language. This post is mainly recorded for future reference when working with Twitter data.</p>
<h3 id="spam-tweets-within-the-dataset">Spam tweets within the dataset</h3>
<p><strong>The problem:</strong></p>
<p>Initial analysis of the tweets revealed that the tweets filtered for had been significantly targeted by spam and automated accounts. Without building a compressive spam filtering model this was most concretely revealed through examination of hashtag co-occurrences. A illustrative example of this case of spam could be 1000 tweets with exactly the hashtags:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#foo #tweet4tweet #bar #MontyPython
#fooBar #flyingCircus #CambridgeAnalytica
</code></pre></div></div>
<p>One set like this pertained over 2000 tweets with the same 6 hashtags an only 100 unique words between them. This was particularly disruptive to any computational analysis using any kind of frequency measure.</p>
<p><strong>Make do solution:</strong></p>
<p>Just applying some simple conditional logic we can decide whether a tweet is spam if has: either too many hashtags or if having above a certain threshold and whose set appears above a given limit within the tweet collection. This can be expressed in Python code as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">is_spam</span><span class="p">(</span><span class="n">tags</span><span class="p">:</span> <span class="nb">tuple</span><span class="p">,</span> <span class="n">ceil</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">floor</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">40</span><span class="p">):</span>
<span class="k">return</span> <span class="p">(</span><span class="bp">True</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">tags</span><span class="p">)</span> <span class="o">></span> <span class="n">ceil</span> <span class="k">else</span>
<span class="bp">True</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">tags</span><span class="p">)</span> <span class="o"><</span> <span class="n">floor</span>
<span class="ow">and</span> <span class="n">collection_freq</span><span class="p">[</span><span class="n">tags</span><span class="p">]</span> <span class="o">></span> <span class="n">limit</span> <span class="k">else</span>
<span class="bp">False</span><span class="p">)</span>
</code></pre></div></div>
<p>The parameters given the algorithm ended up filtering <code class="highlighter-rouge">7%</code> of tweets. <a href="https://arxiv.org/pdf/1703.03107.pdf">This paper</a> studying bots on twitter in more detail estimates that around <code class="highlighter-rouge">9-15%</code> of all tweets currently come from spam accounts, so while almost in the same ball park this could have been more aggressive. Of 36925 tweets removed only 27 tag sets were regarded as spam.</p>
<p>Whether the parameters or method use was accurately effective is hard to say, it did render the dataset a lot more usable and seemingly less noisy. This problem of spam vs not spam is one of the classic examples given when teaching classification algorithms and merely using a something like a simple logistic regression model, more effective results could be made. However, due to the Twitters API terms of service permitting the sharing of datasets its quite hard to find labeled data to train a model on. There may be pre built solutions however, the problem may be very corpora specific however.</p>
<h3 id="language-barriers">Language Barriers</h3>
<p><strong>The problem:</strong></p>
<p>Only being fluent in english and only knowing just enough French and Spanish to identify things like stop words (common low information words), tweets from other languages are not much use to me. Forgetting to collect the language information was something that caused more work here.</p>
<p><strong>Make do solution:</strong></p>
<p>Depending how aggressive an approach seemed relevant, several techniques seemed to work ok. An initial step was to remove certain character ranges. Pretty much everything out side of ascii was removed with the regular expression matching <code class="highlighter-rouge">[^\x00-\x7F]+</code>. This meant removing emoji and all sorts of things but they were not of interest here.</p>
<p>Another strategy was just adding more items to the stop word list and include words from different languages. The Python package <a href="https://www.nltk.org/">NLTK</a>, provides around 2,400 stop words for 11 languages, that wasn’t to time consuming to employ.</p>Karl SimsNoting the wrangling needed to make Twitter data more useable and some make do solutions employed. This was supposed to be part of an upcoming post but has been separated to make the rest of the original article flow better.Visualising Topics2018-03-27T21:59:16+00:002018-03-27T21:59:16+00:00/digitalcitizens/posts/2018/visualising-topics<p>This post experiments with different ways of visualising topics within a dataset.</p>
<h2 id="introduction">Introduction</h2>
<p>In <em>Theme Detection in Social Media</em> Daniel Angus (2017) presents available tools for visualising textual data with the 2 software packages <a href="https://info.leximancer.com/">Leximancer</a> and <a href="http://www.discursis.com/">Discursis</a>. Both tools seemed interesting, however, both were from closed source commercial projects, which was a problem. Leximancer was especially interesting, as it used a probabilistic model for generating networks of ‘concepts’ for analysis. However, only providing a 7-day free trial, I was not prepared to commit an extended period of time to understand the software in depth. Instead It seemed more fruitful to build upon previously used ‘topic extraction’ methods.</p>
<p>Before we continue the use of the terms <em>’topics’</em>, <em>’concepts’</em> and <em>themes</em>, should probably be clarified. In the literature supporting Latent Dirichlet allocation (LDA) (Blei, D.M., Ng, A.Y. and Jordan, M.I. 2003) and previous analysis within this blog, use the term ‘topic’ is used to describe the probabilistic distribution across words, each word in the corpora belonging to each topic set to a varying degree. Throughout this article LDA is used whilst selecting the top <code class="highlighter-rouge">n</code> scoring words in each topic set to create topic representations. However, Leximancer’s supporting paper describes their representations that seem similar as ‘concepts’ - though looking at Daniel Angus’s table made with the software (fig 31.1. p.536), I’m not sure how the name ‘Tony Abbot’ is a concept.</p>
<p><img src="/digitalcitizens/assets/imgs/visualising-topics/angus.png" /></p>
<p>Both Angus and the Leximancer paper use the term ‘theme’ to describe the annotations that the researcher gives the presented concept sets. So, the ‘theme’ of the example above is government. This seems more fitting than what I described previously as ‘qualitative descriptions’ and will be used in the future.</p>
<p>For final clarification between the terms ’topics’ and ‘concepts’… they will be used interchangeably due to how Leximancer has described its product, but it’s probably better to think of both in terms of being ‘topics’. This clarification has mainly been for myself as It was getting confusing reading the different uses between papers.</p>
<p>The main body of this article will now consider ways of generating presentable forms of topic visualisations though my own experimentations.</p>
<h3 id="topic-table">Topic Table</h3>
<p>In a previous post <a href="https://winstonjay.github.io/digitalcitizens/posts/2018/finding-civic-discussions-on-twitter">Finding civic discussions on Twitter - Digital Citizens</a>, topics generated were represented in a table with descriptions of the term groups. This was straight and to the point but it didn’t articulate all the information available efficiently.</p>
<p><img src="/digitalcitizens/assets/imgs/visualising-topics/topics0.png" /></p>
<p>For a start, we could use the weights of each term within the topic to indicate local importance. This brings us on to our first more graphic visualisation.</p>
<h3 id="word-clouds">Word Clouds</h3>
<p>Though they are probably a bit overused and to be honest I don’t like them, word clouds do a good job of highlighting specific terms within a body of text. Using the feature weights from the LDA model we can scale and adjust the opacity of each topics terms to show the most prominent. This was done with a simple python program that just generates some HTML to display this. For something more transportable maybe generating an SVG would be better and not much harder.</p>
<div class="topics">
<p>
<span style="font-size:12.8px; opacity:0.6989157915826636">want</span>
<span style="font-size:14.8px; opacity:0.7845941476540668">legal</span>
<span style="font-size:10.4px; opacity:0.6006399574753432">card</span>
<span style="font-size:12.1px; opacity:0.6705163172527103">million</span>
<span style="font-size:10.5px; opacity:0.605197795958624">russian</span>
<span style="font-size:11.2px; opacity:0.6340726192400996">obama</span>
<span style="font-size:17.1px; opacity:0.879902191607669">vote</span>
<span style="font-size:18.6px; opacity:0.9434409614054629">immigr</span>
<span style="font-size:12.1px; opacity:0.6702907403521112">deport</span>
<span style="font-size:10.2px; opacity:0.5918130898153987">resid</span>
<span style="font-size:10.8px; opacity:0.6176086017222846">question</span>
<span style="font-size:13.0px; opacity:0.709607902893066">citizen</span>
<span style="font-size:13.8px; opacity:0.740644194118733">path</span>
<span style="font-size:19.5px; opacity:0.9796645503441791">illeg</span>
<span style="font-size:10.7px; opacity:0.6115169199001079">appli</span>
<span style="font-size:12.0px; opacity:0.6683611318636999">dreamer</span>
<span style="font-size:12.7px; opacity:0.6954606675409953">ask</span>
<span style="font-size:27.4px; opacity:1">citizenship</span>
<span style="font-size:10.1px; opacity:0.586188464000722">provid</span>
<span style="font-size:11.1px; opacity:0.6297640970635148">presid</span>
<span style="font-size:12.3px; opacity:0.6786408853090833">democrat</span>
<span style="font-size:16.0px; opacity:0.8344898409523911">trump</span>
<span style="font-size:10.9px; opacity:0.6223836545013627">tax</span>
<span style="font-size:14.6px; opacity:0.7737654338936838">american</span>
<span style="font-size:14.3px; opacity:0.7623223771372962">daca</span>
</p>
<p>
<span style="font-size:14.2px; opacity:0.7589629017380537">privat</span>
<span style="font-size:11.2px; opacity:0.6319205800646737">unit</span>
<span style="font-size:12.8px; opacity:0.6984587489565326">presid</span>
<span style="font-size:8.6px; opacity:0.5263714657278553">order</span>
<span style="font-size:9.0px; opacity:0.5419088646816945">district</span>
<span style="font-size:8.7px; opacity:0.5291984613427372">associ</span>
<span style="font-size:8.9px; opacity:0.5378016079352352">press</span>
<span style="font-size:10.3px; opacity:0.5946661300805604">target</span>
<span style="font-size:12.2px; opacity:0.6737478932078246">major</span>
<span style="font-size:8.0px; opacity:0.5">polici</span>
<span style="font-size:9.2px; opacity:0.5480288168070815">pull</span>
<span style="font-size:14.7px; opacity:0.7795117564859888">today</span>
<span style="font-size:9.5px; opacity:0.5635463881694369">polic</span>
<span style="font-size:12.4px; opacity:0.6844247371354533">attorney</span>
<span style="font-size:14.4px; opacity:0.7650056026047374">new</span>
<span style="font-size:13.2px; opacity:0.7146095426142192">report</span>
<span style="font-size:9.9px; opacity:0.5784586290179466">repres</span>
<span style="font-size:13.6px; opacity:0.7353676304824931">general</span>
<span style="font-size:13.7px; opacity:0.7380326496673724">offic</span>
<span style="font-size:15.6px; opacity:0.8170507135710767">pleas</span>
<span style="font-size:18.9px; opacity:0.9540452656858309">state</span>
<span style="font-size:9.0px; opacity:0.5426181323933845">join</span>
<span style="font-size:12.4px; opacity:0.6835757083002943">lawsuit</span>
<span style="font-size:14.7px; opacity:0.7795329904110836">trump</span>
<span style="font-size:23.0px; opacity:1">citizen</span>
</p>
<p>
<span style="font-size:15.6px; opacity:0.8159796525700841">amend</span>
<span style="font-size:18.0px; opacity:0.9181275058641637">american</span>
<span style="font-size:17.9px; opacity:0.9131965637256256">arm</span>
<span style="font-size:14.7px; opacity:0.778624034217763">militari</span>
<span style="font-size:20.6px; opacity:1">need</span>
<span style="font-size:21.2px; opacity:1">everi</span>
<span style="font-size:15.3px; opacity:0.8030989601617486">respons</span>
<span style="font-size:19.2px; opacity:0.9660185842038498">weapon</span>
<span style="font-size:14.4px; opacity:0.7678360942488228">pay</span>
<span style="font-size:16.0px; opacity:0.8322542159957396">state</span>
<span style="font-size:14.3px; opacity:0.7625383382582072">assault</span>
<span style="font-size:15.7px; opacity:0.8194277479159369">kill</span>
<span style="font-size:16.2px; opacity:0.843014792021333">elect</span>
<span style="font-size:15.7px; opacity:0.8208733348945001">want</span>
<span style="font-size:18.5px; opacity:0.9357030320397816">protect</span>
<span style="font-size:15.8px; opacity:0.8256631009543693">constitut</span>
<span style="font-size:17.3px; opacity:0.8887011432879071">govern</span>
<span style="font-size:14.8px; opacity:0.7839367739795386">democraci</span>
<span style="font-size:16.4px; opacity:0.850280338420814">nra</span>
<span style="font-size:19.5px; opacity:0.9807894790617788">vote</span>
<span style="font-size:19.4px; opacity:0.9731766768911939">peopl</span>
<span style="font-size:32.0px; opacity:1">citizen</span>
<span style="font-size:16.1px; opacity:0.8389928389615045">use</span>
<span style="font-size:25.5px; opacity:1">gun</span>
<span style="font-size:24.9px; opacity:1">right</span>
</p>
<p>
<span style="font-size:12.5px; opacity:0.6891106657334628">wrong</span>
<span style="font-size:15.5px; opacity:0.8131877116063265">live</span>
<span style="font-size:13.5px; opacity:0.7309697508942592">year</span>
<span style="font-size:13.5px; opacity:0.7275532469582027">world</span>
<span style="font-size:22.9px; opacity:1">citizenship</span>
<span style="font-size:11.8px; opacity:0.6590709283374999">born</span>
<span style="font-size:12.1px; opacity:0.6701786860608457">american</span>
<span style="font-size:12.5px; opacity:0.6890431919784514">peopl</span>
<span style="font-size:14.6px; opacity:0.7737449032703531">nation</span>
<span style="font-size:11.7px; opacity:0.6541161386924143">class</span>
<span style="font-size:15.8px; opacity:0.8240753961725249">like</span>
<span style="font-size:13.1px; opacity:0.7106928350890896">home</span>
<span style="font-size:12.3px; opacity:0.6806412650223472">right</span>
<span style="font-size:13.4px; opacity:0.7237478607594958">uk</span>
<span style="font-size:15.9px; opacity:0.8272329098283686">eu</span>
<span style="font-size:11.8px; opacity:0.6574449832427551">great</span>
<span style="font-size:11.7px; opacity:0.6531275630706719">india</span>
<span style="font-size:19.3px; opacity:0.9709444447257627">countri</span>
<span style="font-size:15.7px; opacity:0.8223020046012834">work</span>
<span style="font-size:28.7px; opacity:1">citizen</span>
<span style="font-size:12.8px; opacity:0.7020638309814173">make</span>
<span style="font-size:11.6px; opacity:0.6514125523627093">know</span>
<span style="font-size:12.2px; opacity:0.6761524513240597">british</span>
<span style="font-size:11.8px; opacity:0.6597917797095625">help</span>
<span style="font-size:12.1px; opacity:0.6727695845936965">dual</span>
</p>
<p>
<span style="font-size:29.7px; opacity:1">citizen</span>
<span style="font-size:20.5px; opacity:1">abid</span>
<span style="font-size:14.5px; opacity:0.7701850241780777">good</span>
<span style="font-size:16.0px; opacity:0.8316605188115855">american</span>
<span style="font-size:11.7px; opacity:0.6535585731441063">time</span>
<span style="font-size:13.5px; opacity:0.7305268089925516">countri</span>
<span style="font-size:16.1px; opacity:0.8385672622419913">peopl</span>
<span style="font-size:14.3px; opacity:0.7613909364010201">senior</span>
<span style="font-size:13.5px; opacity:0.7273005886820005">tri</span>
<span style="font-size:14.4px; opacity:0.7681973203648307">want</span>
<span style="font-size:11.8px; opacity:0.6594057177579501">come</span>
<span style="font-size:26.3px; opacity:1">law</span>
<span style="font-size:12.3px; opacity:0.6808409963545365">yes</span>
<span style="font-size:13.0px; opacity:0.7097775148895489">need</span>
<span style="font-size:12.1px; opacity:0.6722177673111994">everi</span>
<span style="font-size:14.4px; opacity:0.765452086471164">make</span>
<span style="font-size:14.9px; opacity:0.7883644282798128">say</span>
<span style="font-size:16.2px; opacity:0.8427772042655686">know</span>
<span style="font-size:15.2px; opacity:0.7996592433665178">gun</span>
<span style="font-size:14.9px; opacity:0.7857205978130463">think</span>
<span style="font-size:14.0px; opacity:0.7520767399907083">stop</span>
<span style="font-size:13.6px; opacity:0.7332726073184557">right</span>
<span style="font-size:12.3px; opacity:0.6804387513574146">becom</span>
<span style="font-size:14.4px; opacity:0.7681012738368265">crimin</span>
<span style="font-size:18.5px; opacity:0.9376426614101575">like</span>
</p>
</div>
<p>Each topic set is bounded by its own box. I haven’t added annotations to describe the topics but this could also easily be done. Straight away you can see not only the most prominent terms but also more clearly the ones that occur across different topics.</p>
<h3 id="gephi-graphs">Gephi Graphs</h3>
<p>Trying to represent these connections between the topics further we can construct a network of topics linked by shared terms. Topic roots have been labeled A through E respectively. I don’t think I’ve done that great a job but here is a visualisation of this in action.</p>
<p><img src="/digitalcitizens/assets/imgs/visualising-topics/gelphi.jpg" /></p>
<p>Using in-degree to increase the scale of nodes and edges differentiates the more common terms across topics. However, beyond the small directional arrows this is not fully communicating the fact that the terms belong to topics.</p>
<p>A property of the network that has been constructed is that (if my graph theory knowledge isn’t failing me) it is a bipartite graph. This means that the nodes within it can be split into two distinct sets. Presented here is the fact that that topics can only connect to terms and vice versa. This represented below in the disjointed red and green nodes.</p>
<center>
<img src="/digitalcitizens/assets/imgs/visualising-topics/bipartide.png" />
</center>
<p>It may also useful to represent the different types of nodes more effectively within the final network visualisation.</p>
<p>Though Gephi makes available a wide variety of graph algorithms and visualisations at a touch of a button, I’m not aware of any methods to share these interactively. A big problem with complex graphs is that they’re often quite messy and things overlap. The ability to adjust and move things about is often helpful.</p>
<h3 id="interactive-d3js-graphs">Interactive D3.js graphs</h3>
<p>Because of the advances in web technology it now easier than ever to make all sorts of dynamic content in websites. <a href="https://d3js.org/">D3.js</a> is a quite a popular JavaScript framework designed for manipulating HTML documents based on data. Though not providing tools to do everything for you it focuses providing a strong foundation for creating your own ‘data driven documents’. After exporting data created in a previous python program into JSON format we now have an easy and relatively fast way of creating visual representations in the browser.</p>
<div class="big-img graph-box" id="graph1"></div>
<script src="https://d3js.org/d3.v3.min.js"></script>
<script>
var json = {"links":[{"group":0,"source":0,"target":1,"weight":0.8097809886258567},{"group":0,"source":0,"target":2,"weight":0.4796645503441791},{"group":0,"source":0,"target":3,"weight":0.44344096140546285},{"group":0,"source":0,"target":4,"weight":0.3799021916076691},{"group":0,"source":0,"target":5,"weight":0.33448984095239104},{"group":0,"source":0,"target":6,"weight":0.2845941476540668},{"group":0,"source":0,"target":7,"weight":0.2737654338936838},{"group":0,"source":0,"target":8,"weight":0.26232237713729617},{"group":0,"source":0,"target":9,"weight":0.24064419411873295},{"group":0,"source":0,"target":10,"weight":0.20960790289306602},{"group":0,"source":0,"target":11,"weight":0.1989157915826636},{"group":0,"source":0,"target":12,"weight":0.1954606675409953},{"group":0,"source":0,"target":13,"weight":0.17864088530908329},{"group":0,"source":0,"target":14,"weight":0.17051631725271035},{"group":0,"source":0,"target":15,"weight":0.17029074035211123},{"group":0,"source":0,"target":16,"weight":0.16836113186369991},{"group":0,"source":0,"target":17,"weight":0.13407261924009956},{"group":0,"source":0,"target":18,"weight":0.12976409706351483},{"group":0,"source":0,"target":19,"weight":0.12238365450136267},{"group":0,"source":0,"target":20,"weight":0.1176086017222846},{"group":0,"source":0,"target":21,"weight":0.11151691990010791},{"group":0,"source":0,"target":22,"weight":0.10519779595862402},{"group":0,"source":0,"target":23,"weight":0.1006399574753431},{"group":0,"source":0,"target":24,"weight":0.09181308981539874},{"group":0,"source":0,"target":25,"weight":0.08618846400072204},{"group":1,"source":26,"target":10,"weight":0.6242404910757975},{"group":1,"source":26,"target":27,"weight":0.454045265685831},{"group":1,"source":26,"target":28,"weight":0.3170507135710768},{"group":1,"source":26,"target":5,"weight":0.2795329904110836},{"group":1,"source":26,"target":29,"weight":0.2795117564859888},{"group":1,"source":26,"target":30,"weight":0.2650056026047373},{"group":1,"source":26,"target":31,"weight":0.2589629017380537},{"group":1,"source":26,"target":32,"weight":0.2380326496673724},{"group":1,"source":26,"target":33,"weight":0.2353676304824931},{"group":1,"source":26,"target":34,"weight":0.2146095426142191},{"group":1,"source":26,"target":18,"weight":0.19845874895653254},{"group":1,"source":26,"target":35,"weight":0.18442473713545324},{"group":1,"source":26,"target":36,"weight":0.18357570830029438},{"group":1,"source":26,"target":37,"weight":0.17374789320782455},{"group":1,"source":26,"target":38,"weight":0.13192058006467372},{"group":1,"source":26,"target":39,"weight":0.09466613008056034},{"group":1,"source":26,"target":40,"weight":0.07845862901794665},{"group":1,"source":26,"target":41,"weight":0.06354638816943693},{"group":1,"source":26,"target":42,"weight":0.048028816807081494},{"group":1,"source":26,"target":43,"weight":0.04261813239338457},{"group":1,"source":26,"target":44,"weight":0.04190886468169454},{"group":1,"source":26,"target":45,"weight":0.03780160793523521},{"group":1,"source":26,"target":46,"weight":0.029198461342737236},{"group":1,"source":26,"target":47,"weight":0.02637146572785526},{"group":1,"source":26,"target":48,"weight":0.0},{"group":2,"source":49,"target":10,"weight":1.0},{"group":2,"source":49,"target":50,"weight":0.729330256231359},{"group":2,"source":49,"target":51,"weight":0.7041287140725272},{"group":2,"source":49,"target":52,"weight":0.549031934191717},{"group":2,"source":49,"target":53,"weight":0.5270004038233025},{"group":2,"source":49,"target":4,"weight":0.4807894790617789},{"group":2,"source":49,"target":54,"weight":0.4731766768911939},{"group":2,"source":49,"target":55,"weight":0.46601858420384973},{"group":2,"source":49,"target":56,"weight":0.43570303203978156},{"group":2,"source":49,"target":7,"weight":0.4181275058641637},{"group":2,"source":49,"target":57,"weight":0.41319656372562563},{"group":2,"source":49,"target":58,"weight":0.388701143287907},{"group":2,"source":49,"target":59,"weight":0.35028033842081396},{"group":2,"source":49,"target":60,"weight":0.343014792021333},{"group":2,"source":49,"target":61,"weight":0.3389928389615045},{"group":2,"source":49,"target":27,"weight":0.33225421599573957},{"group":2,"source":49,"target":62,"weight":0.32566310095436934},{"group":2,"source":49,"target":11,"weight":0.32087333489450004},{"group":2,"source":49,"target":63,"weight":0.31942774791593687},{"group":2,"source":49,"target":64,"weight":0.3159796525700841},{"group":2,"source":49,"target":65,"weight":0.3030989601617486},{"group":2,"source":49,"target":66,"weight":0.2839367739795387},{"group":2,"source":49,"target":67,"weight":0.278624034217763},{"group":2,"source":49,"target":68,"weight":0.2678360942488228},{"group":2,"source":49,"target":69,"weight":0.26253833825820716},{"group":3,"source":70,"target":10,"weight":0.8613922505351445},{"group":3,"source":70,"target":1,"weight":0.6227735677869802},{"group":3,"source":70,"target":71,"weight":0.47094444472576275},{"group":3,"source":70,"target":72,"weight":0.32723290982836867},{"group":3,"source":70,"target":73,"weight":0.3240753961725249},{"group":3,"source":70,"target":74,"weight":0.3223020046012835},{"group":3,"source":70,"target":75,"weight":0.3131877116063265},{"group":3,"source":70,"target":76,"weight":0.27374490327035306},{"group":3,"source":70,"target":77,"weight":0.23096975089425925},{"group":3,"source":70,"target":78,"weight":0.22755324695820264},{"group":3,"source":70,"target":79,"weight":0.22374786075949585},{"group":3,"source":70,"target":80,"weight":0.21069283508908968},{"group":3,"source":70,"target":81,"weight":0.2020638309814174},{"group":3,"source":70,"target":82,"weight":0.18911066573346283},{"group":3,"source":70,"target":54,"weight":0.18904319197845149},{"group":3,"source":70,"target":51,"weight":0.18064126502234726},{"group":3,"source":70,"target":83,"weight":0.17615245132405963},{"group":3,"source":70,"target":84,"weight":0.17276958459369654},{"group":3,"source":70,"target":7,"weight":0.1701786860608457},{"group":3,"source":70,"target":85,"weight":0.15979177970956246},{"group":3,"source":70,"target":86,"weight":0.15907092833749994},{"group":3,"source":70,"target":87,"weight":0.15744498324275513},{"group":3,"source":70,"target":88,"weight":0.15411613869241422},{"group":3,"source":70,"target":89,"weight":0.15312756307067182},{"group":3,"source":70,"target":90,"weight":0.1514125523627093},{"group":4,"source":91,"target":10,"weight":0.9035680967930848},{"group":4,"source":91,"target":92,"weight":0.7609625143679942},{"group":4,"source":91,"target":93,"weight":0.5201978300890095},{"group":4,"source":91,"target":73,"weight":0.4376426614101575},{"group":4,"source":91,"target":90,"weight":0.3427772042655685},{"group":4,"source":91,"target":54,"weight":0.33856726224199124},{"group":4,"source":91,"target":7,"weight":0.3316605188115856},{"group":4,"source":91,"target":50,"weight":0.29965924336651784},{"group":4,"source":91,"target":94,"weight":0.28836442827981285},{"group":4,"source":91,"target":95,"weight":0.28572059781304626},{"group":4,"source":91,"target":96,"weight":0.27018502417807766},{"group":4,"source":91,"target":11,"weight":0.2681973203648306},{"group":4,"source":91,"target":97,"weight":0.26810127383682647},{"group":4,"source":91,"target":81,"weight":0.26545208647116403},{"group":4,"source":91,"target":98,"weight":0.26139093640102},{"group":4,"source":91,"target":99,"weight":0.25207673999070834},{"group":4,"source":91,"target":51,"weight":0.23327260731845567},{"group":4,"source":91,"target":71,"weight":0.23052680899255162},{"group":4,"source":91,"target":100,"weight":0.22730058868200048},{"group":4,"source":91,"target":53,"weight":0.20977751488954885},{"group":4,"source":91,"target":101,"weight":0.18084099635453657},{"group":4,"source":91,"target":102,"weight":0.1804387513574146},{"group":4,"source":91,"target":52,"weight":0.17221776731119942},{"group":4,"source":91,"target":103,"weight":0.1594057177579501},{"group":4,"source":91,"target":104,"weight":0.15355857314410631}],"nodes":[{"name":"A","root":true,"weight":0.0},{"name":"citizenship","root":false,"weight":0.8879713373824474},{"name":"illeg","root":false,"weight":0.7442065079898236},{"name":"immigr","root":false,"weight":0.7326398876832487},{"name":"vote","root":false,"weight":0.793293448002927},{"name":"trump","root":false,"weight":0.7529209189860497},{"name":"legal","root":false,"weight":0.6819182216252548},{"name":"american","root":false,"weight":0.8176092676471843},{"name":"daca","root":false,"weight":0.6748065819266462},{"name":"path","root":false,"weight":0.667884481735625},{"name":"citizen","root":false,"weight":1.0},{"name":"want","root":false,"weight":0.7768118866947804},{"name":"ask","root":false,"weight":0.6534568473860606},{"name":"democrat","root":false,"weight":0.6480860920187518},{"name":"million","root":false,"weight":0.6454918213620816},{"name":"deport","root":false,"weight":0.6454197919901832},{"name":"dreamer","root":false,"weight":0.6448036451814404},{"name":"obama","root":false,"weight":0.6338549176253343},{"name":"presid","root":false,"weight":0.7075269421636929},{"name":"tax","root":false,"weight":0.6301224929575675},{"name":"question","root":false,"weight":0.6285977621382969},{"name":"appli","root":false,"weight":0.6266526160958992},{"name":"russian","root":false,"weight":0.6246348450952127},{"name":"card","root":false,"weight":0.6231794733692361},{"name":"resid","root":false,"weight":0.6203609501441716},{"name":"provid","root":false,"weight":0.6185649406080931},{"name":"B","root":true,"weight":0.0},{"name":"state","root":false,"weight":0.7820565431873914},{"name":"pleas","root":false,"weight":0.6922819869344433},{"name":"today","root":false,"weight":0.680295354023977},{"name":"new","root":false,"weight":0.6756633675486157},{"name":"privat","root":false,"weight":0.6737338617045263},{"name":"offic","root":false,"weight":0.6670505847097469},{"name":"general","root":false,"weight":0.666199612569096},{"name":"report","root":false,"weight":0.6595713096400557},{"name":"attorney","root":false,"weight":0.6499329443032753},{"name":"lawsuit","root":false,"weight":0.6496618393561032},{"name":"major","root":false,"weight":0.6465237017792291},{"name":"unit","root":false,"weight":0.6331677460678959},{"name":"target","root":false,"weight":0.6212719596391811},{"name":"repres","root":false,"weight":0.6160967128896992},{"name":"polic","root":false,"weight":0.6113350580435514},{"name":"pull","root":false,"weight":0.6063801140060568},{"name":"join","root":false,"weight":0.6046524184892539},{"name":"district","root":false,"weight":0.6044259409215366},{"name":"press","root":false,"weight":0.6031144452724365},{"name":"associ","root":false,"weight":0.6003673588297989},{"name":"order","root":false,"weight":0.599464665699612},{"name":"polici","root":false,"weight":0.5910439448395668},{"name":"C","root":true,"weight":0.0},{"name":"gun","root":false,"weight":0.842367427907458},{"name":"right","root":false,"weight":0.8436320394462936},{"name":"everi","root":false,"weight":0.7881178520800938},{"name":"need","root":false,"weight":0.7854537130795702},{"name":"peopl","root":false,"weight":0.8053841665884455},{"name":"weapon","root":false,"weight":0.7398491896322031},{"name":"protect","root":false,"weight":0.7301690753326737},{"name":"arm","root":false,"weight":0.7229824938519646},{"name":"govern","root":false,"weight":0.7151608164380466},{"name":"nra","root":false,"weight":0.7028925991025831},{"name":"elect","root":false,"weight":0.7005726242192335},{"name":"use","root":false,"weight":0.699288367046025},{"name":"constitut","root":false,"weight":0.6950320240708577},{"name":"kill","root":false,"weight":0.69304100211014},{"name":"amend","root":false,"weight":0.6919399844911012},{"name":"respons","root":false,"weight":0.6878270270857746},{"name":"democraci","root":false,"weight":0.6817083144372711},{"name":"militari","root":false,"weight":0.6800118938060612},{"name":"pay","root":false,"weight":0.6765671770022691},{"name":"assault","root":false,"weight":0.6748755408663798},{"name":"D","root":true,"weight":0.0},{"name":"countri","root":false,"weight":0.7742877722605799},{"name":"eu","root":false,"weight":0.6955332826079442},{"name":"like","root":false,"weight":0.7778661103579779},{"name":"work","root":false,"weight":0.6939587862590073},{"name":"live","root":false,"weight":0.6910484847255536},{"name":"nation","root":false,"weight":0.6784539295968762},{"name":"year","root":false,"weight":0.6647953176098859},{"name":"world","root":false,"weight":0.663704387501576},{"name":"uk","root":false,"weight":0.6624892826881136},{"name":"home","root":false,"weight":0.6583206586106471},{"name":"make","root":false,"weight":0.7296687116203519},{"name":"wrong","root":false,"weight":0.6514292167176017},{"name":"british","root":false,"weight":0.6472915056206054},{"name":"dual","root":false,"weight":0.646211316269847},{"name":"help","root":false,"weight":0.642067349702604},{"name":"born","root":false,"weight":0.6418371733351484},{"name":"great","root":false,"weight":0.6413179898389446},{"name":"class","root":false,"weight":0.6402550504091011},{"name":"india","root":false,"weight":0.6399393865227763},{"name":"know","root":false,"weight":0.7383774901647662},{"name":"E","root":true,"weight":0.0},{"name":"law","root":false,"weight":0.8340282743636811},{"name":"abid","root":false,"weight":0.75714926353461},{"name":"say","root":false,"weight":0.6831221168141335},{"name":"think","root":false,"weight":0.682277910481741},{"name":"good","root":false,"weight":0.6773172181057171},{"name":"crimin","root":false,"weight":0.6766518519805734},{"name":"senior","root":false,"weight":0.6745091618848935},{"name":"stop","root":false,"weight":0.6715350288269791},{"name":"tri","root":false,"weight":0.6636237107261994},{"name":"yes","root":false,"weight":0.648788613492759},{"name":"becom","root":false,"weight":0.6486601719084698},{"name":"come","root":false,"weight":0.6419440755561171},{"name":"time","root":false,"weight":0.6400770131360906}]};
</script>
<script src="/digitalcitizens/assets/js/topicgraph0.js"></script>
<p>Arguably we can see more clearly example information such as topic’s C and E are related by terms linked to gun control. Also, that topic B is the least connected to the rest of the topics. Differentiating between different node types as discussed previously has helped this. The obvious downside to this approach is that you must be able to code the visualisations yourself.</p>
<p>Now, you can merely drag nodes around a bit, it might be better to add other forms of user interactivity. Future time investments could be spent creating something like this: <a href="http://bl.ocks.org/NPashaP/cd80ab54c52f80c4d84cad0ba9da72c2">bP Example - Double Vertical bP with labels - bl.ocks.org</a> as an improvement to the graphs constructed here. As well as this, combinations of tools could be used together. For example, if it is difficult to implement or find specific graph algorithms, Gephi could be used to generate this.</p>
<h3 id="considerations">Considerations</h3>
<p><strong>TODO</strong></p>
<h2 id="conclusion">Conclusion</h2>
<p>This has been a start to thinking about representing topics/themes within text visually. It is by no means comprehensive and the methods shown here can be improved on. These approaches would be beneficial in communicating data in future projects. How this will be done obviously depends on the nature of both the nature of the data analysed and of the idea being communicated.</p>
<h3 id="references">References</h3>
<ul>
<li>Angus, 2017. Theme Detection in Social Media. In: Sloan, L. and Quan-Haase, A. ed. <em>The SAGE Handbook of Social Media Research Methods</em>. London U.K: SAGE Publications. pp.530-544.</li>
<li>A. E. Smith and M. S. Humphreys. 2006. Evaluation of Unsupervised Semantic Mapping of Natural Language with Leximancer Concept Mapping. <em>Behaviour Research Methods</em>, <strong>38</strong> (2), 262-279. <a href="https://info.leximancer.com/science/">Science — Leximancer</a></li>
<li>http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf</li>
</ul>
<h3 id="further-resources">Further Resources</h3>
<ul>
<li><a href="https://www.youtube.com/watch?v=fThhbt23SGM&t=1847s">Mike Bostock - Design is a Search Problem - YouTube</a></li>
<li><a href="https://gephi.org/">Gephi - The Open Graph Viz Platform</a></li>
<li><a href="https://d3js.org/">D3.js - Data-Driven Documents</a></li>
</ul>Karl SimsThis post experiments with different ways of visualising topics within a dataset.Notes on Digital Methods2018-03-24T16:19:27+00:002018-03-24T16:19:27+00:00/digitalcitizens/posts/2018/notes-on-digital-methods<p>The following article notes some considerations of social research within digital environments.</p>
<p>Online mediums have brought with them different possibilities for social research, particularly providing new opportunities for quantification. Brought about by changes in both the scale and topology of network interactions, the tools available and nature of research possible, research landscapes are changing. Social sciences have seen push to become more empirical though these approaches, whilst computational sciences have seen a softening of theirs and a more natural embracement of imprecision particularly in systems for social settings. The following article notes some considerations of social research within digital environments. This are predominantly unstructured notes on ideas, the methods page in the tool bar serves as a repository for some specific approaches of interest.</p>
<hr />
<h4 id="methods-page-repository-of-some-collected-methods">Methods page: <a href="/digitalcitizens/methods/">Repository of some collected methods</a></h4>
<hr />
<p><strong>Methods for conducting social research online must evolve their sensibilities to match the technologies they are inquiring.</strong> Richard Rodgers director of the Digital Methods Initiative notes the turn in research seen in the transition from web 1.0 to web 2.0. Whereas research in the web 1.0 era mostly involved scrappers and link analysis, web 2.0 research has produced predominantly API based research centred around the dominant platforms (2018. p.93-94). This periodised trend is reflective of wider user migrations from an open information network, to more centralised social networks.</p>
<p><strong>The affordances of social media sites each configure user’s capacities for action differently.</strong> (Bucher, T. and. Helmond, A. 2018) likewise: <em>‘Platforms don’t just mediate public discourse, they constitute it”’</em> (Gillespie, T. 2018)</p>
<p><strong>Social interaction online takes place principally in automated environments where human and non-human agency is an active state of interplay.</strong> The dynamics and capacity for non-human influence varies from site to site. Facebook for instance has markedly more algorithmic curation when compared to Twitter. Twitter though providing more control over content curation makes it easier to create bots. A recent study argued something like 9-15% of all tweets may come from automated accounts.</p>
<h2 id="sampling-methods">Sampling methods</h2>
<p><strong>Random:</strong> x% of the total population selected randomly.</p>
<p><strong>Snowball:</strong> Iteratively build sample from developing connections from initial set. Network/graph based.</p>
<p><strong>Topic-based:</strong> Filter for specific conditions (Keywords, users, hashtags).</p>
<p><strong>Marker-based:</strong> Filter for specific meta-data such as location, language.</p>
<h2 id="references">References</h2>
<ul>
<li>Gerlitz, C. and Rieder, B. 2013. Mining One Percent of Twitter: Collections, Baselines, Sampling. M/C Journal. [Online]. 16(2). [Accessed 17 March 2018]. Available from: http://journal.media-culture.org.au/index.php/mcjournal/article/view/620</li>
<li>Rogers, R. 2018. Digital methods for cross-platform analysis. In: Burgess, J., Marwick, A. and Poell, T. ed. The SAGE Handbook of Social Media. London: SAGE Publications. pp.91-110.</li>
<li>Bucher, T. and. Helmond, A. 2018. The Affordances of Social Media Platforms. In: Burgess, J., Marwick, A. and Poell, T. ed. The SAGE Handbook of Social Media. London: SAGE Publications. pp.233-253.</li>
<li>Boyd, D. and Crawford, K. 2012. Critical questions for big data. Information, Communication & Society. [Online]. 15(5). pp.662-679. [Accessed 10 April, 2018]. Available from: https://doi.org/10.1080/1369118X.2012.678878</li>
<li>Neff, G. and Nagy, P. 2016. Automation, Algorithms, and Politics: Symbiotic Agency and the Case of Tay. International Journal of Communication. [Online]. 10(1), 4915–4931. [Accessed 10 November 2016] Available from: http://ijoc.org/index.php/ijoc/article/view/6277</li>
</ul>Karl SimsThe following article notes some considerations of social research within digital environments.Mark Zuckerberg personality insights2018-03-21T19:47:54+00:002018-03-21T19:47:54+00:00/digitalcitizens/posts/2018/mark-zuckerberg-personality-insights<p>Considering the attention drawn to data privacy after the recent Facebook and Cambridge Analytica fiasco, it seems relevant to explore some available tools for gathering insights on online publics. This article will experiment with IBM’s off the shelf personality insights tool to example the kinds of features that can be constructed from user data. It will use Mark Zuckerberg’s response to Cambridge Analytica’s apparent miss-use of Facebook data as a sample source.</p>
<h2 id="introduction">Introduction</h2>
<p>The combination of modern psychology, big data and deep learning has opened possibilities for advertisers, political campaigns and others to personally target individuals on a massive scale. Research as such done by Michal Kosinski and others has continued to show a variety ways social media data can be used to predict personal attributes such as sexual orientation (Wang, Y. and Kosinski, M. 2018), age, gender and personality (<a href="http://www.michalkosinski.com/home/publications">more citations?</a>). Though Cambridge Analytica’s impact on the 2016 U.S. election may be overstated, questions still arise to the effects of micro targeting can have on the functioning of democracies and to what extent have user’s consented to this subjection.</p>
<p>In this article IBM’s services are used as an example to demonstrate some generic models for creating insights from personal data. Mark Zuckerberg’s recent PR response posted publically on Facebook is given as example that will serve as the basis for the personality insights. Here it the sample text for your reference, if you would like to read it:</p>
<iframe src="https://www.facebook.com/plugins/post.php?href=https%3A%2F%2Fwww.facebook.com%2Fzuck%2Fposts%2F10104712037900071&width=500" width="500" height="294" style="border:none;overflow:hidden;margin-bottom:24px;" scrolling="no" frameborder="0" allowtransparency="true"></iframe>
<p>Beyond being a bit of fun, the example here aims to show how cheaply available insight tools are becoming whilst questioning there increasing use in society.</p>
<h2 id="natural-language-understanding">Natural Language Understanding</h2>
<p>Before going on to the results of the personality insights we can also quickly use IBM’s <a href="https://natural-language-understanding-demo.ng.bluemix.net/">Natural Language Understanding</a> API to get a brief outline of the document. Copy and pasting the Marks post into the demo site we get as follows:</p>
<p><img src="/digitalcitizens/assets/imgs/mark-zuckerberg-personality-insights/toshare.png" /></p>
<p>As well as summarising the object of the text sample, we can see <code class="highlighter-rouge">I want to share</code> as key the subject/action. This is describing the sematic roles of the document, what about the emotional content?</p>
<p><img src="/digitalcitizens/assets/imgs/mark-zuckerberg-personality-insights/emotion.png" /></p>
<p>Interestingly the emotion is put forward as mixture of joy and sadness. Perhaps sadness because of the news, by optimistically joyful about prospects of your future with Facebook 💁.</p>
<h2 id="personality-insights">Personality Insights</h2>
<p>Personality insights can arguably be used to gauge what kind of consumer you will be, the kind of products you will be more likely to buy and more increasingly what political messages may sway you. IBM’s <a href="https://personality-insights-demo.ng.bluemix.net/">personality insights demo page</a> describes the service as follows:</p>
<blockquote>
<p>Gain insight into how and why people think, act, and feel the way they do. This service applies linguistic analytics and personality theory to infer attributes from a person’s unstructured text.</p>
</blockquote>
<p>Again, just copy and pasting Marks post into the demo yields a range of results, the first thing we are greeted by is high level summary shown below:</p>
<p><img src="/digitalcitizens/assets/imgs/mark-zuckerberg-personality-insights/watson1.png" /></p>
<p>I always thought Mark was unlikely to be influenced by social media during product purchases but how did the application know that? Well according the what is described in the science of the services, specific personality profiles that are constructed suggest certain consumer behaviour. We can consider what this means a bit more with some of the other data the demo provides.</p>
<h3 id="personality-needs-values">Personality, Needs, Values</h3>
<p>Diving into some of the data we see personality represented in three main categories:</p>
<p><strong>Big Five model</strong>: This is one of the most widely studied personality models in clinical psychology. It describes a person in terms of <em>openness, conscientiousness, extraversion, agreeableness, and neuroticism</em> - It is sometimes referred to as the OCEAN model. Here neuroticism has been renamed in the service as emotional range as it was thought more ‘generally applicable’ (IBM Cloud Docs, 2017).</p>
<p><strong>Needs</strong>: <em>‘The twelve categories of needs that are reported by the service are described in marketing literature as desires that a person hopes to fulfil when considering a product or service’</em> (IBM Cloud Docs, 2017). (They are referring to: Kotler, P. and Armstrong, G. 2013. Principles of Marketing; Ford, K. 2005. Brands Laid Bare: Using Market Research for Evidence-Based Brand Management.)</p>
<p><strong>Values</strong>: <em>‘computes the five basic human values proposed by <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.220.3674&rep=rep1&type=pdf">Schwartz</a> and validated in more than twenty countries’</em> (IBM Cloud Docs, 2017).</p>
<p>You can see the results of these fields below.</p>
<p><img src="/digitalcitizens/assets/imgs/mark-zuckerberg-personality-insights/types.png" /></p>
<h3 id="a-more-detailed-look-at-the-big-five">A more detailed look at the Big Five</h3>
<p>The service goes on to break-down the Big Five model into 10 features each making it 50 feature model. Because the more features the better right.</p>
<p><img src="/digitalcitizens/assets/imgs/mark-zuckerberg-personality-insights/big5.png" /></p>
<p>If my graphs aren’t nice enough for you here’s a nice 👌’sun burst’ visualisation of all the data shown above generated by the site.</p>
<p><img src="/digitalcitizens/assets/imgs/mark-zuckerberg-personality-insights/sunburst.png" /></p>
<h3 id="consumer-preferences">Consumer Preferences</h3>
<p>As stated previously inferences can be made about the kinds of choices individuals are likely to make based on specific personality traits. Using the features described above, models have been created by IBM to fit specific consumption preferences to specific personality types. Though not shown overtly on the demo page there is the option to download a JSON file with all the generic tests carried out. These can be listed as follows:</p>
<p><strong>Shopping</strong></p>
<ul>
<li><span style="opacity:1">Likely to be sensitive to ownership cost when buying automobiles: 1.</span></li>
<li><span style="opacity:0.25">Likely to prefer safety when buying automobiles: 0.</span></li>
<li><span style="opacity:1">Likely to prefer quality when buying clothes: 1.</span></li>
<li><span style="opacity:0.25">Likely to prefer style when buying clothes: 0.</span></li>
<li><span style="opacity:1">Likely to prefer comfort when buying clothes: 1.</span></li>
<li><span style="opacity:0.25">Likely to be influenced by brand name when making product purchases: 0.</span></li>
<li><span style="opacity:1">Likely to be influenced by product utility when making product purchases: 1.</span></li>
<li><span style="opacity:0.25">Likely to be influenced by online ads when making product purchases: 0.</span></li>
<li><span style="opacity:0.25">Likely to be influenced by social media when making product purchases: 0.</span></li>
<li><span style="opacity:0.25">Likely to be influenced by family when making product purchases: 0.</span></li>
<li><span style="opacity:0.25">Likely to indulge in spur of the moment purchases: 0.</span></li>
<li><span style="opacity:1">Likely to prefer using credit cards for shopping: 1.</span></li>
</ul>
<p><strong>Health and activity</strong></p>
<ul>
<li><span style="opacity:0.25">Likely to eat out frequently: 0.</span></li>
<li><span style="opacity:0.25">Likely to have a gym membership: 0.</span></li>
<li><span style="opacity:1">Likely to like outdoor activities: 1.</span></li>
</ul>
<p><strong>Environmental concern</strong></p>
<ul>
<li><span style="opacity:1">Likely to be concerned about the environment: 1.</span></li>
</ul>
<p><strong>Entrepreneurship</strong></p>
<ul>
<li><span style="opacity:0.5">Likely to consider starting a business in next few years: 0.5.</span></li>
</ul>
<p><strong>Movie</strong></p>
<ul>
<li><span style="opacity:0.25">Likely to like romance movies: 0.</span></li>
<li><span style="opacity:1">Likely to like adventure movies: 1.</span></li>
<li><span style="opacity:0.25">Likely to like horror movies: 0.</span></li>
<li><span style="opacity:0.25">Likely to like musical movies: 0.</span></li>
<li><span style="opacity:1">Likely to like historical movies: 1.</span></li>
<li><span style="opacity:1">Likely to like science-fiction movies: 1.</span></li>
<li><span style="opacity:1">Likely to like war movies: 1.</span></li>
<li><span style="opacity:0.25">Likely to like drama movies: 0.</span></li>
<li><span style="opacity:1">Likely to like action movies: 1.</span></li>
<li><span style="opacity:1">Likely to like documentary movies: 1.</span></li>
</ul>
<p><strong>Music</strong></p>
<ul>
<li><span style="opacity:0.25">Likely to like rap music: 0.</span></li>
<li><span style="opacity:0.5">Likely to like country music: 0.5.</span></li>
<li><span style="opacity:0.5">Likely to like R&B music: 0.5.</span></li>
<li><span style="opacity:0.25">Likely to like hip hop music: 0.</span></li>
<li><span style="opacity:0.25">Likely to attend live musical events: 0.</span></li>
<li><span style="opacity:0.25">Likely to have experience playing music: 0.</span></li>
<li><span style="opacity:1">Likely to like Latin music: 1.</span></li>
<li><span style="opacity:1">Likely to like rock music: 1.</span></li>
<li><span style="opacity:1">Likely to like classical music: 1.</span></li>
</ul>
<p><strong>Reading</strong></p>
<ul>
<li><span style="opacity:1">Likely to read often: 1.</span></li>
<li><span style="opacity:0.25">Likely to read entertainment magazines: 0.</span></li>
<li><span style="opacity:1">Likely to read non-fiction books: 1.</span></li>
<li><span style="opacity:1">Likely to read financial investment books: 1.</span></li>
<li><span style="opacity:0.25">Likely to read autobiographical books: 0.</span></li>
</ul>
<p><strong>Volunteering</strong></p>
<ul>
<li><span style="opacity:1">Likely to volunteer for social causes: 1.</span></li>
</ul>
<p>Some of these may seem silly, but in a real-world scenario you would create your own models to cater for your own specific needs. This part isn’t as cheap an endeavour and without existing data you would need to gather your own data for your needs.</p>
<h2 id="conclusion">Conclusion</h2>
<p><strong>How accurate is this?</strong></p>
<p>Short answer: In this example, not at all.</p>
<p>As the demo website states the sample is far too short to create an accurate analysis. However, say given a complete user profile the results are argued to become more effective. I found a few references to horoscopes in online discussions concerning the service (<a href="https://www.quora.com/How-accurate-is-IBMs-Watson-Personality-Insights-application">Quora</a>).</p>
<p>Looking for other examples of this kind of service to get a comparison, the website created by Cambridge University <a href="https://applymagicsauce.com/">https://applymagicsauce.com/</a> is probably the most similar available online. However, I didn’t try that one out in the end because it wanted access to my social media data.</p>
<p>With this kind of service the accuracy will always be hard to measure. Even though it is relying on numerical computations we are still receiving qualitative results. For example, agreeableness is much more a relative measure than a count of objects. Using tried and tested psychological frameworks means the service probably does have some merit however.</p>
<p><strong>Final thoughts</strong></p>
<p>Whether this is an accurate description of Mark’s personality or even the text is not the argument. This example was used to start to think about how tools trying to ascertain behavioural insights are integrating with society. The demo page emphasises that a represented sample be used. A person’s social media data tends to be thought of as a representative perhaps for its capacity for eclectic expression. This service and those like it will always be an assessment of a representation of a person not of them themselves. The affordances of these services to those seeking insights depend on its accuracy but to those who are the subject of inquiry this is not always the case. For example, in the field of recruitment, services like that provided by <a href="https://www.hirevue.com/">hirevue</a> use machine learning to gain insights on candidates scoring them via various metrics. Here, for the subject of inquiry accuracy is less important than a favourable outcome (i.e. getting the job). In this way, there is possibility for these tools to shape the behaviour of individuals. What happens when this becomes more widely used in society for decision making processes and is also further democratised? This could make for interesting future inquiry especially with regards to more civic matters.</p>
<h3 id="references">References</h3>
<ul>
<li>Wang, Y. and Kosinski, M. 2018. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. <em>Journal of Personality and Social Psychology</em>. [Online]. [Accessed 22 March 2018]. <strong>114</strong>(2), pp.246-257. Available from: <a href="https://psyarxiv.com/hv28a/">https://psyarxiv.com/hv28a/</a></li>
<li>IBM Cloud Docs, 2017. <em>The science behind the service</em>. [Online]. [Accessed 22 March 2018]. Available from: <a href="https://console.bluemix.net/docs/services/personality-insights/science.html#science">https://console.bluemix.net/docs/services/personality-insights/science.html#science</a></li>
<li>IBM Watson Developer Cloud, 2017. <em>Personality Insights</em>. [Online]. [Accessed 22 March 2018]. Available from: <a href="https://personality-insights-demo.ng.bluemix.net/">https://personality-insights-demo.ng.bluemix.net/</a></li>
</ul>
<h3 id="other-resources">Other resources</h3>
<ul>
<li><a href="https://console.bluemix.net/docs/services/personality-insights/references.html#fast2008">IBM research reference list</a></li>
<li>Costa, P., and McCrae. R. 2008. Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) Manual. Odessa, FL: Psychological Assessment Resources (1992). Available from: <a href="https://www.researchgate.net/publication/285086638_The_revised_NEO_personality_inventory_NEO-PI-R">https://www.researchgate.net/publication/285086638_The_revised_NEO_personality_inventory_NEO-PI-R</a></li>
<li><a href="https://applymagicsauce.com/">Apply Magic Sauce personality insight app</a></li>
<li><a href="https://www.youtube.com/watch?v=DYhAM34Hhzc">Michal Kosinski - The End of Privacy, Keynote at CeBIT’17</a></li>
<li><a href="http://www.michalkosinski.com/home/publications">http://www.michalkosinski.com/home/publications</a></li>
<li><a href="https://youtu.be/n8Dd5aVXLCc">The Power of Big Data and Psychographics</a></li>
</ul>Karl SimsConsidering the attention drawn to data privacy after the recent Facebook and Cambridge Analytica fiasco, it seems relevant to explore some available tools for gathering insights on online publics. This article will experiment with IBM’s off the shelf personality insights tool to example the kinds of features that can be constructed from user data. It will use Mark Zuckerberg’s response to Cambridge Analytica’s apparent miss-use of Facebook data as a sample source.