Geo-tagged Twitter collection and visualization system.
Mobile social media are generating valuable data for analyzing human behavior and events in the real world. Twitter is one of the most popular social media for sharing short text messages called tweets. According to a report published in July 2012 there are over 500 million Twitter user accounts and more than 400 million tweets are posted every day. Although only 0.77% of all tweets are geo-tagged with coordinates, mobile devices are used in 60% of all instances of access to Twitter.
To analyze human behavior and events in the real world using geo-tweet data, it is necessary to collect a certain amount of data. However, we cannot collect a sufficient number of datasets through commonly used data-collecting methods, as will be discussed below. Furthermore, the continuous collection of large amounts of data is expensive. In addition, having many researchers collecting the same geo-tweet data) is not efficient. Therefore, the final goal of this study is to develop a system for archiving geo-tweet data and sharing it among researchers.
In this research study, we developed a distributed system for collecting tweet data aggregated to geographic grids. We also developed a geo-tweet data spatio-temporal visualization tool and conducted a data-collection and visualization experiment.
This paper is organized as follows. Related studies are introduced in the following section. A method for collecting data is proposed in the next section. Then comes a description of the implementation of the data-collection and the visualization system. The results of the data-collection experiment using the proposed system are described in the following section. Concluding remarks and suggestions for future research are presented in the last section.
Many previous studies have used data collected from Twitter. Most of these studies have focused on social networks and communications; only few focus on geographic locations. A major concern in the studies is the relationship between tweets and events in the real world. For instance, Sankaranarayanan et al. (2009) proposed a method for capturing tweets about news in the real world and developed a system for visualizing those news tweets on maps. Sakaki, Okazaki, and Matsuo (2010) analyzed the spatio-temporal variation of tweets about earthquakes and typhoons and, based on the analysis, proposed a method for detecting such events and predicting where they occurred in real time. Fujisaka, Lee, and Sumiya (2010) proposed a method for detecting local events based on the increase and decrease in the number of tweets in geographic grid cells. Becker, Naaman, and Gravano (2011) proposed a method to distinguish between tweets about real-world events and non-event tweets based on the text of the tweets. Van Liere (2010) focused on the spatial relationships of the locations of re-tweets (i.e., tweets that quote other users' tweets) and discussed the patterns of information diffusion.
There are research studies that focus on developing applications. Field and Brien (2010) proposed a map-based system using Twitter which supports collaborative real-time mapping and the organization and display of information for mass user events. MacEachren et al. (2011) proposed a map-based, interactive web application that enables information foraging and sense-making using tweet indexing and displays based on place, time, and concept characteristics. Nakaji and Yanai (2010) proposed a method for selecting representative photos of real-world events from photographs attached to geo-located tweets, and, in the process, realized a map-based visualization of tweets along with those photographs.
Table 1 shows the number of tweets acquired and used in some of the previous studies and in this study. Even though a simple comparison is impossible because Sakaki et al. included only geo-tweets containing specific words (earthquake, typhoon), the tabulated data indicate that the method proposed in this paper results in a greater number of geo-tweets collected in an area with a high spatio-temporal density.
Methods and materials
We used the Twitter Application Programing Interface (API) to collect geo-tweets. In this section, we describe some commonly used methods for collecting geo-tweets which also utilize the Twitter API and present some of their disadvantages. We then propose a new method for collecting geo-tweets using the API and evaluate the method by comparing its performance with that of the commonly used methods.
Twitter provides many kinds of Web APIs. These APIs generate tweet text and various attributes of the text, as listed below:
* Tweet text;
* Tweet ID;
* User ID;
* Destination user ID (only for tweets with "@user ID");
* User profile (including location name input by user); and
* Location coordinates (only for tweets tagged with the location coordinates).
The two APIs commonly used to collect public tweets are:
* About 10% of all public tweets are sent continuously in real time while connected to the API.
* By setting a spatial filter with the geographic coordinates of an area, tweets within the specified area are acquired.
* Target tweets for spatial filtering are those that have GPS- or WiFi-based location information.
* The location coordinates of each tweet are always acquired as a result of an API request with a spatial filter.
* By setting spatial and/or temporal search conditions, tweets within a specified area and/or within a specified period are acquired.
* The target tweets for spatial search are those tweets whose user profiles have geocoded location name; in includes tweets that have GPS- or WiFi-based location information.
* The location coordinates of each tweet are not always acquired as a result of an API location-based search request.
* The search period is limited to the 5 days before the current date.
These APIs are at the heart of three commonly used methods for collecting geo-tweets:
(1) Caching data in real time by using streaming API with a spatial filter.
The advantages of this method are that it collects the location coordinates of all tweets and the data are collected in real time. The disadvantage is that the number of target tweets is relatively small.
(2) Caching data in real time by using streaming API without a spatial filter and geocoding the location names included in the user profile information.
The advantage of this method is that the number of target tweets is large compared to that obtained using an API with a spatial filter. The disadvantage is that the location names of user profiles in tweet data are often difficult to geocode using commonly used geocoding APIs, such as the Google Maps geocoding API. The reason for this is that the location names are filled out by the users without any constraint, and they are often very approximate, not formal, or do not exist in the real world, as described by Hecht et al. (2011).
(3) Collecting data by accessing Search API at certain intervals with spatial and temporal search conditions.
The advantage of this method is that the number of target tweets is relatively large compared to that of the previous two methods. Major disadvantage is that the location coordinates of each tweet are not always acquired. Another disadvantage is that due to the limitations of the Twitter API, it is impossible to collect all the tweets in areas with a large number of geo-tweets, such as a city center. This is because the number of tweets acquired under the same search condition is limited to 1500, while the minimum radius of a spatial filter is 1 km, and the minimum period of a temporal filter is 1 day. In many city centers, the number of tweets under the minimum spatio-temporal condition is often tar more than 1500.
We propose a method for collecting general public geo-tweet data which specifies a geographic area and a date. Our method extends the commonly used method (3) described in the previous section. To recap, this method has the following disadvantages:
* Even though the acquired tweets are from a geographically specific area, their location coordinates are not always collected.
* It is impossible to collect all the tweets in an area such as a city center where the number of geo-tweets is large.
To solve these problems, we propose the following:
(1) Divide the target geographic area into small areas and collect tweets using Search API in each small area. In this research, we use the Double Grid Square of the Japanese Standard Grid Square, which is one of the standardized geographic grids used for collecting public statistics in Japan. The width and height of each cell of the grid are 1.5' and 1', respectively. The area is about 2 km x 2 kin. Aggregating geo-tweets to the Standard Grid Square facilitates overlaying them with other spatial data in Japan.
(2) Use the tweet ID instead of a date--time period as a condition of the Search API. A tweet ID is an integer ID attached to all tweets in ascending sequence since the Twitter service began.
The procedures defined below take into account the following:
* The Twitter Search API returns only 100 tweets as "1 page" at a time from all the tweets that meet the specified search conditions;
* The rest of the tweets are acquired by accessing Search API, specifying a page number; and
* The maximum page number is 15. In other words, the number of tweets collected under the same search condition is only 1500.
We assume that the user specified the following values for data collection:
* Target date: a date (1 day) of tweets to collect; and
* Target area: geographic area in which to collect tweets.
In other words, in the proposed method, data are collected for a specific date. For data collection over multiple days, the process is repeated by specifying each date. Two procedures were developed for collecting geo-tweets based on these strategies:
Procedure A: Acquiring a tweet ID of almost the last tweet of the next day of the target date Input: target date [d.sub.q], target cell of a grid [c.sub.q] Output: the tweet ID of almost the last tweet of the next day of the target date [id.sub.q] access Twitter Search API specifying date as the next day of [d.sub.q] and area as [c.sub.q], and acquire a list of tweets T in ascending sequence of tweet IDs (100 tweets are acquired according to the specification of the API) set [id.sub.max] to the tweet ID of the last item in T (i.e. the smallest tweet ID in T) while true access Twitter Search API specifying max tweet ID as [id.sub.max] and area as [c.sub.q], and acquire a list of tweets T in ascending sequence of tweet IDs (100 tweets are acquired according to the specification of the API) for each tweet in T if the date of the tweet is [d.sub.q] return [id.sub.max] as [id.sub.q], and exit the procedure else set [id.sub.max] to the tweet ID of the tweet
Because the result of the API is sampled, we cannot acquire exactly the last tweet of the next day of the target date. However, we can acquire and use almost the last tweet. Before performing Procedure B, we divide the target area into Double Grid Squares of the Japanese Standard Grid Square. Procedure B is then executed for every cell of the grid. In the following, the target cell refers to one of these grid cells.
Procedure B: Data collection using tweet ID Input: target date [d.sub.q], target cell of a grid [c.sub.q], the tweet ID of almost the last tweet of the next day of the target date [id.sub.q] Output: A set of tweets [T.sub.r] set [id.sub.max] to [id.sub.q] while true access Twitter Search API specifying max tweet ID as [id.sub.max] and area as [c.sub.q], and acquire a list of tweets T in ascending sequence of tweet 1Ds (100 tweets are acquired according to the specification of the API) for each tweet in T if the date of the tweet is the day before [d.sub.q] return [T.sub.r] and exit the process else if the date of the tweet equals [d.sub.q], Add the tweet to T. set [id.sub.max] to the tweet ID of the last item in T
We conducted an experiment to compare the performance of commonly used methods and the proposed method for collecting geo-tweet data under the following conditions:
* Area: about 2 km x 2 km around Tokyo Station (Japanese Standard Grid Square Code: 533946005)
* Period: 1 day
In the experiment, we conducted a data re-collection process (described in the next section) twice. As shown in Table 2, the proposed method collected more than three times as many data as commonly used methods.
Proposed data-collection and visualization systems
We implemented data-collection and visualization systems based on the method proposed in the previous section.
The data-collection system was implemented as a distributed system using PHP and Perl. The system architecture and the processes executed on each server are as follows. Single management server:
* Recording tweet IDs of the boundary between days;
* Pilot data collection for monitoring Twitter API status (described below).
Multiple data-collection servers:
* Data-collection process and data re-collection process (described below).
Using the Twitter API involves some practical issues:
(1) Rate limitation of Twitter Search API. The access rate from the same 1P address is limited, and a connection is refused when the limit is exceeded.
(2) Stability of Twitter Search API. Twitter API is provided as a best-effort service, and it often becomes unstable. In addition, when the API is unstable, it often returns no explicit error, but the number of tweets in the returned result becomes much smaller than usual. Therefore, we cannot determine the status of the API just by detecting errors from the API.
To solve problem (1), we chose the following strategy:
* Data collection by distributed system; and
* Accessing Search API from multiple servers with multiple different IP addresses. Each server collects the data of multiple grid cells allocated by the management server, and sends the collected data to the management server periodically.
To solve problem (2), we applied the following process.
* Pilot data collection Jot monitoring Twitter API status. In order to determine the status of Twitter Search API, especially that of a location-based search, we continuously monitored the number of geo-tweets collected in a certain grid cell. A grid cell in which the number of geo-tweets is consistently large was considered suitable for monitoring. According to our experiment, when Twitter API is not working normally, the number of geo-tweets a day becomes tar smaller than 10% of the average value. Therefore, data collection was halted, and an alert email was sent automatically when the number of tweets collected at a given time in the grid cell for monitoring was smaller than 10% of the average value at the same time of the previous days. The data-collection process of the data-collection servers was restarted when the API returned to a stable state.
* Re-collection of data that the system failed to collect. Check posted date time of collected tweets of specified date and grid cell, and if there are certain periods when tweets were not collected at all because the Twitter API was not stable, collect the data for those periods again.
* Repeat request when an API error occurred. When receiving an error from the API, repeat the same request for the specified times.
Data visualization system
We developed a data visualization system that aggregates and visualizes the geo-tweets collected. The data aggregation system was implemented by PHP and Perl, while the graph and animation drawing interfaces were implemented by Flash/Flex and used the Google Maps API.
(1) Data aggregation:
* Counting the number of tweets according to a specified spatial and/or temporal unit.
(2) Graph drawing:
* Drawing graphs of the daily or hourly variation in the number of tweets in each geographic grid cell;
* Overlaying the graph images on the corresponding areas on a map.
(3) Animation drawing:
* Creating an animated version of the density grid map of the hourly variation in the number of tweets.
We conducted a data-collection experiment using the proposed system. Four server machines were used to check the operations of the distributed system, even though a single server would have sufficed for the purpose of this experiment. The pilot data collection took place at an area of 2 km x 2 km near Tokyo Station (Double Grid Square Code 533946005 of the Japanese Standard Grid Square). The number of geo-tweets in this area is usually the largest in Japan. The re-collection process was repeated twine, and the number of times a request was repeated when an API error occurred was set to two. The access ratio to Twitter Search API was 300 times per hour per server. Table 3 shows what type of data were collected.
The target area and the grid cells are shown in Figures 7 and 8. The following section presents details of the collected data.
Figure I shows the daily variation in the number of geo-tweets collected in the entire target area. The number collected per day during the weekend is smaller than that collected during the week. The daily average of the number of geo-tweets on a weekday is 174,983 and that on a weekend is 160,369. Figure 2 shows the daily variation in the number of geo-tweets collected in the Odaiba area (measuring about 2 km x 2 km), a popular shopping and amusement area in Tokyo. The number of geo-tweets collected per day at this area during the weekend is larger than that collected during the week.
Figure 3 shows the hourly variation in the number of geo-tweets collected in the entire target area on the 2 days on which the number of collected tweets was largest (Thursday, the blue line) and smallest (Sunday, the red line). The smallest number of tweets was collected at around 04:00. There was a small peak at around 12 noon and a large one at midnight. The same pattern of peaks was observed every day. The graph seems to represent common patterns of daily life. The peak around 12:00 probably corresponds to the lunch break. The number increases through the evening and then peaks around 00:00. The low around 04:00 indicates that most people are sleeping at that time
Figure 4 shows a graph of the hourly variation in the number of geo-tweets collected around Tokyo station (an area measuring about 2 km x 2 km) over 14 days, i.e., the duration of the experiment. There was a small peak around 04:00 on 1 day (7 July) and no peak around the same time on the other days. On 7 July, a small earthquake occurred at around 4 am.
Figure 5 shows the number of tweets per user. Most users posted fewer than four geo-tweets. Figure 6 shows the number of grid cells where each user posted geo-tweets. More than half the users posted geo-tweets in at least two different cells. One user posted geo-tweets in as many as 56 different cells. Although it is difficult to understand why some users post so many tweets, it is well-known (see Wilkinson 2008) that a small number of very active users can make the most contributions in social media. It is possible that in our case early adopters were responsible for many of the geo-tagged tweets posted.
This section explains the figures created by our data visualization system. When the map is zoomed out, the distribution of the number of tweets over the map is visualized as a density map, as shown in each map in Figure 9. Each cell of the grid is colored depending on the number of geo-tweets inside it. When one zooms in on the map (as is the case in Figures 7 and 8), a column or a line chart is overlaid on each cell as shown in the figures. Figures 7 and 8 show the daily and the hourly variation in the number of geo-tweets. The user can specify the period of days to visualize by utilizing a multiple line graph. In Figure 8, each grid cell shows hourly variation in tweets for 1 day over 14 days. By clicking one of the cells, a large image of the graph in the cell pops up to watch it in detail.
Figures 7 and 8 reflect the many events that occurred in the real world during the period of study. For example, in many graphs in Figure 7 there is a peak on 21 September. A typhoon hit Tokyo on that day. The peaks seen at around 18:00 in many cells in Figure 8 occurred on the day of a typhoon. At that time, almost all trains and buses were delayed and overcrowded, and commuters were stranded. The figures also show localized events. For example, in Figure 7, there is a peak on 14 September, in the cell that is fourth from the top and fourth on the right-hand side of the grid. There was a demonstration in the area on that day. The reason behind the clustering of the hourly change in geo-tweets in the center of the maps in Figure 9 is unclear given that tweeting occurs continuously throughout the day. But at the periphery of the maps, the hourly change in the number of tweets is obvious. The change can be recognized more clearly by an animation of the set of maps visualized by our system which clearly demonstrates the hourly changes in tweets in the study area.
In this study, we developed a distributed system for collecting Twitter geo-tagged data. The proposed method can collect several times more data than commonly used methods. We also developed a spatio-temporal visualization system to display the data and showed some characteristics of the collected data using the system. Future research will focus on:
* Data collection: We plan to scale up the system to enlarge the area for collecting geo-tweet data.
* Data visualization: We plan to develop a function for visualizing the collected data focusing on the relationships between locations and communication among users and the relationships between locations and particular events.
Becket, H., M. Naaman, and L. Gravano. 2011. "Beyond Trending Topics: Real-World Event Identification on Twitter." In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media, July 17-21, 438-441. Barcelona: AAAI.
Field, K., and J. O. Brien. 2010. "Cartoblography: Experiments in Using and Organising the Spatial Context of Micro-Blogging." Transactions in GIS 14 (s1): 5-23.
Fujisaka, T., R. Lee, and K. Sumiya. 2010. "Exploring Urban Characteristics Using the Movement History of Mass Mobile Microbloggers." In Proceedings of the Eleventh Workshop on Mobile Computing Systems and Applications, Annapolis, MD, February 22-23.
Hecht, B., L. Hong, B. Suh, and E. H. Chi. 2011. "Tweets from Justin Bieber's Heart: The Dynamics of the Location Field in User Profiles." In Proceedings of the SIGCH1 Conference on Human Factors in Computing Systems (CHI 2011), May 7-12, 237-246. Vancouver, BC: ACM.
MacEachren, A. M., A. Jaiswal, A. C. Robinson, S. Pezanowski, A. Savelyev, P. Mitra, X. Zhang, and J. Blanford. 2011. "SensePlace2: GeoTwitter Analytics Support for Situational Awareness." In Proceedings of the 2011 IEEE Conference on Visual Analytics Science and Technology (VAST), October 23-28, 181-190. Providence, RI: IEEE.
MacEachren, A. M., A. C. Robinson, A. Jaiswal, S. Pezanowski, A. Savelyev, J. Blanford, and P. Mitra. 2011. "Geo-Twitter Analytics: Applications in Crisis Management." In Proceedings of the 25th International Cartographic Conference, Paris, July 3-8.
Nakaji, Y., and K. Yanai. 2012. "Visualization of Real-world Events with Geotagged Tweet Photos." In Proceedings of the 2012 IEEE International Conference on Multimedia and Expo Workshops, July 9-13, 272-277. Melbourne: IEEE.
Sakaki, T., M. Okazaki, and Y. Matsuo. 2010. "Earthquake Shakes Twitter Users: Real-time Event Detection by Social Sensors." In Proceedings of the 18th International World Wide Web Conference, Madrid, April 20-24, 2009.
Sankaranarayanan, J., B. E. Teitler, H. Samet, M. D. Lieberman, and J. Sperling. 2009. "TwitterStand: News in Tweets." In Proceedings of the 17th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, November 4-6, 42-51. Seattle, WA: ACM
Van Liere, D. 2010. "How Far Does a Tweet Travel? Information Brokers in the Twitterverse." In Proceedings of the International Workshop on Modeling Social Media, 1-4. Toronto, ON: ACM.
Wilkinson, D. M. 2008. "Strong Regularities in Online Peer Production." In Proceedings of the 9th ACM conference on Electronic commerce, July 8-12, 302-309. Chicago, IL: ACM.
Hideyuki Fujita *
Graduate School of Information Systems. The University of Electro-Communications, 1-5-1, Chofugaoka, Chofu, Tokyo 182- 8585. Japan
(Received 14 February 2013; accepted 17 April 2013)
* Email: email@example.com
Table 1. Number of geo-tweets collected. Target Period Number Number area (days) of tweets of users 1. Sakaki et al. (2010) Japan 60 621 -- 2. Fujisaka et al. (2010) Japan 7 129,403 4131 3. Van Liere (2010) World 0.5 3339 6424 This research Central 14 3,476,059 216,430 Tokyo Table 2. Number of collected tweets. Common method (1) streaming API 31,711 Common method (2) search API 1500 Proposed method 97,787 Table 3. Description of the data collected. Target area About 20 km x 20 km square around center of Tokyo (2nd Grid Square Code 533935,533936,533945,533946) Period 2 weeks (from 25 July 2011 0:00 JST) Number of tweets 3,476,059 Number of users 216,430
|Printer friendly Cite/link Email Feedback|
|Publication:||Cartography and Geographic Information Science|
|Date:||Jun 1, 2013|
|Previous Article:||Web map-based POI visualization for spatial decision support.|
|Next Article:||ScaleMaster 2.0: a ScaleMaster extension to monitor automatic multi-scales generalizations.|