Before, I used the Google BigQuery for retrieving the github archive data. Today I used only R for that. First of all, I download the data and decompress the one.
# get data
now <- format(as.POSIXct("2014-03-14 15:00:00"), "%Y-%m-%d-%H")
url <- sprintf("http://data.githubarchive.org/%s.json.gz", now)
tmpgz <- "/Users/myname/Downloads/github.json.gz"
tmpjson <- "/Users/myname/Downloads/github.json"
download.file(url, tmpgz)
system(paste0("gunzip ", tmpgz))
Data cleaning
A part of data has no specific column, e.g.language, description, so I judge whether data have them or not. With the information, I subset data. Also, I restrict the data only for “PushEvent”. I add the extra column, type, which means that the repository is a package or not.
# data cleaning
library(rjson)
res <- scan(tmpjson, what="character", sep="\n")
parsed <- lapply(as.list(res), fromJSON)
library(plyr)
condition <- ldply(parsed, function(x){
data.frame(type=x$type,
lang=!is.null(x$repository$language),
desc=!is.null(x$repository$description))})
dat <- ldply(parsed[condition$type=="PushEvent" & condition$lang & condition$desc],
function(x){
data.frame(created_at=x$created_at,
isFork=x$repository$fork,
forks=x$repository$forks,
url=x$repository$url,
description=x$repository$description,
owner=x$repository$owner,
name=x$repository$name,
language=x$repository$language,
stringsAsFactors=FALSE
)}
)
dat <- subset(dat, language=="R")
# summarise
library(dplyr)
result <- dat %.%
group_by(owner, name, description, url, isFork) %.%
summarise(forks=sum(forks), recent_activity=max(created_at))
result$type <- laply(sprintf("https://raw.github.com/%s/%s/master/DESCRIPTION",
result$owner, result$name),
function(x)httr:::http_status(httr::GET(x))$category)
result$type <- ifelse(result$type=="success", "package", "other")
result$url <- paste0("<a href='", result$url, "' target='_blank'>URL</a>")
result <- result %.% arrange(desc(type), desc(recent_activity))
Visulalize
For the visualization, I use dataTable in the rCharts package. Here's the result.
library(rCharts)
dTable(result,sScrollX="600px", sScrollY="400px",
bPaginate=TRUE, sPaginationType = "full_numbers",
bScrollInfinite = T,bScrollCollapse = T)
No comments:
Post a Comment