Tidewater Big Data Enthusiasts Presentation Repository
Introduction and background
We are a group of people in the Tidewater area who are interested in exploring, sharing, and understanding Big Data. We have a mixture of people with interests and expertise in various aspects of big data, what it is, how it works, how it affects each of us, and assist people establishing personal networks of friends and colleagues with similar interests.
A collection of things that I hope the group will explore, includes:
As we talk and share ideas, other topics will come up and we will follow them to see where they go.
Come ready to share ideas, experiences, and interest in all things Big Data.
- What is "Big Data," how does it affect me, and how am I supplying "Big Data??" Probably include a simple demonstration of using Pig to process Medicare data, and then displaying the results via R.
- Where can I get my hands on some Big Data?? Medicare payments, pharmaceutical payments, census data, ZCATs, etc. Some sort of demo on getting data from these places, how the data has to be munged, what kinds of problems exist with all data sets.
- What kinds of tools are available for processing Big Data?? The world doesn't end at Hadoop or Casandra. There are other tools/applications that might be a better fit.
- How do I visualize all this data?? Getting Big data is fun. Analyzing can be a challenge. When it is all over, how can the data be made real with some sort of visualization techniques.
- What are the challenges with real-time Big Data?? Firstly, what does real-time mean?? Secondly, what kinds of tools are available to handle masses of real-time data.
- How does the "Internet of Things" affect what we call Big Data?? As more and more things (cars, phones, refrigerators, wearable devices) are wired, and more and more data is being collected, how does that affect what we do with Big Data??
Presentations we have had along the way
2015 Oct. 27: A Presentation on Microsoft Azure and AWS Elastic Cloud Computing (EC2)
Steve Jones will be giving a short presentation on "Big Data in the Cloud: Microsoft Azure." Azure is a cloud computing platform and infrastructure, created by Microsoft, for building, deploying and managing applications and services through a global network of Microsoft-managed and Microsoft partner hosted datacenters.
After that, we'll have a hands-on introduction to Amazon Web Services (AWS) Elastic Cloud Computing (EC2). In order to access the AWS EC2, you have to create an account which includes giving them credit card information. Nothing that we'll do will result in charges on your card. They want the card in case you start using their service a lot, or use it to store data on the "cloud."
As part of the hands-on session we'll be using a virtual machine to run RStudio. RStudio is a GUI front end to R. R is programming environment for statistical computing and graphics and is a mainstay for Big Data analysis.
Low resolution report
High resolution report
2015 Nov. 24: Medicare Payments to the Tidewater Area
In 2013, Medicare was used by about 53 million people. In 2014, there were about 11 million records made available by the Centers for Medicare and Medicaid Services on some of those 53 million people.
At our next meeting, we will explore some of these data records. We'll use a Hadoop/Hive back end to provide raw data to an R script, resulting in a PDF report showing the financial impact of selected procedures in various ZIP codes in the Tidewater area.
Bring your laptop, or smart phone. This is a hands on Big Data exploration in the wild!!
The final report.
2015 Dec. 22: No meet-up. Too close to Christmas
2016 Jan. 26: Tools and techniques to visualize Big Data
It is very difficult to come to grips with large data sets. Even the relatively small Medicare data set that we looked at in November (9.3 million records) can be overwhelming. We'll take a look at how to visualize large data sets (and have some hands-on activity) when we take ideas from Nethan Yau's book "Data Points: Visualization That Means Something."
2016 Feb. 23: Publicly Available Sources of Big Data
We constantly hear that we are awash in Big Data. We create Big Data. We are a part of Big Data. Everything that we do is somehow a part of Big Data.
What we don't hear is how can I get my hands on some of this Big Data stuff. Where is it? What does it look like? How can I get my hands dirty playing with Big Data?
We will take a look at Big Data sources. The kinds of Big Data sources that are available. What does Big Data look like. And, how can I find more Big Data.
Report about looking at ExIF data
Report on how many Vs are there in Big Data
A dump of selected Tweets
2016 Mar. 22: Big Data Variety, or I've got your number
Doug Laney has been credited with identifying the original Big Data 3Vs: volume, velocity, and variety. He characterized these as being part of the 3D Data Management problem that was "breaking" traditional relational database management systems. His 3Vs caught on and now there are more Vs than you might care to count. This meet-up we'll be talking about one of his initial Vs: variety.
We'll focus on three simple types of numbers: phone numbers, credit card numbers, and Social Security numbers. We'll look at how many different ways these simple numbers can and are requested on the Internet, and then we'll extend those ideas to tougher problems like street addresses.
2016 Apr. 26: Big Data Velocity, or Where and Why is this Data Coming so Fast
Doug Laney has been credited with identifying the original Big Data 3Vs: volume, velocity, and variety. His 3Vs caught on and now there are more Vs than you might care to count. This meet-up we'll be talking about one of his initial Vs: velocity.
We'll start off with velocity in Laney's context, and then look at if from a current and more recent point of view. We'll talk about tweets, Twitter, and some of the things that you can learn about tweeters. We'll take a peek at the a very small portion of the "tweeterverse" and then go live to see what is out there. There might be more to a tweet than meets the eye.
Real time tweet sample
2016 May 24: Using Big Data to Get Us from Where We Think We Are to Where We Might Want to Go
We'll explore some of the Internet Movie Data Base of over 6,000,000 records. We try and from where we thought we were (starting with Levenshtein's algorithm) to where we might want to be using a PosrgreSQL relational database. Levenshtein's algorithm is behind many of today's spell checker functions. We'll use it to make sense of some misspelled actors, and use those actors to make recommendations of movies that are sort of like the ones we already like. And we'll do all this live
We'll talk about the PostgresQL a little, then how we'll use the Internet Movie Database (IMDb) as a source of interesting data. Data that is interesting in it's own right, and data that we can ask interesting questions about. After we've talked about the prototype recommender system, we'll test it live.
2016 June 28: Using Big Data to Connect the Dots from One Place to Another
We'll explore the world of graphical databases. Databases that don't have tables, or rows, or columns, and don't use the structured query language (SQL). These are databases are behind things like Google Maps, finding board of director relationships between companies, first level technical support questions, and disease diagnosis.
We'll use our friend, the Internet Movie Database and the Neo4J software to play the ``Six Degrees of Kevin Bacon'' game, to find things like the actor (or actress) who has appeared in the most movies, and to find the movie that was most influential in cinematographic arts. These are the same questions that are used to map out companies and terrorist cells.
Live run of Internet mapping
2016 July 26: Using Big Data Tools when there are Holes in the Data
We'll explore the world of columnar databases. Databases that have rows and columns, but the intersection of a row and a column can have 0 or more values. The values can be versioned, timestamped for automatic deletion, and other neat features. We'll look at HBase (one of many databases built on top of Hadoop), to explore some of the data in the Internet Movie Database. HBase is used by Adobe, LinkedIn, Netflix, Spotify, and others.
2016 Aug 30: Using Big Data to look at Political Party Platforms
The political season is upon us. All of the major parties have declared their platforms and how they will change the country for the better.
We'll use some Big Data textual analysis tools and techniques to look at the platforms from the five major parties. Sentiment analysis will quantify how positive or negative a platform is. Vocabulary analysis will tell us what grade level the platform is speaking to. We'll look at how vocabulary, sentiment, and state population demographics could be combined to predict how a state might vote.
Lots of balls in the air. It will be interesting to see how they land.
Exploration with Text Mining
2016 Sep 27: Where did we come from? Where might we go?
The US population is constantly moving. From the East westward. From the North southward, and then back. Is there a way to visualize these movements over time?
We'll dive into the US Census databases looking for population density data. We expect that we'll easily find data for recent censuses, earlier ones may be a challenge. The data should have at least these attributes: location, magnitude, and time. It may also have these: age, gender, race, or type of household. We'll see what data is available, and how decide on how to visualize the changes over time.
Come join use while we swim through this data lake.
2016 Oct 25: Big Data Potential of GDELT
The Global Database of Events, Language, and Tone (GDELT) is the largest, most comprehensive, and highest resolution open database of human society ever created. Creating a platform that monitors the world's news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages, every moment of every day and that stretches back to January 1, 1979 through present day, with daily updates. The GDELT has created a database of a quarter billion georeferenced records covering the entire world over 30 years.
We'll take a peek into GDELT and see what we can do with a standalone application, and also what can be done using Google's Big Table technology.
2016 Nov 3: Creating Shaped Wordclouds Using R
The R "wordcloud" library provides an easy way to create an image showing how often a word
(or tag) appears in a corpus.
In a word cloud, the size
of a word indicates how often that word appears. Word cloud words can be colored as well.
While word clouds are easy to create, often the clouds could be shaped differently to
create a more lasting and profound impression.
2016 Nov 22: Connecting the Dots in Wikipedia
Dirk Gently's code is that everything is connected. This the basis of his holistic detective agency, and is his life's guiding principle. We will take Dirk's
principal idea and apply it to Wikipedia. We'll explore a portion of Wikipedia and see how many pages/links separate a specific page
from another (in graph theory this is called the shortest path, Stanley Milgram made this idea popular with his "Small-World" paper, and many people have played the "Six Degrees of Kevin Bacon" game). We'll also look at the average number of links between pages.
Along the way, we'll look at words that are common between the pages, the educational level of the pages, and other things if we have time. For fun; we'll use SQL and non-SQL databases to manage the exploration process and keep track of the results.
2017 Jan 29: Will you live see your program end?
In the world of Computer Science, there are all sorts of problems. Problems that can be easily stated, but not so easily answered. Problems for which there is a single, optimal answer that can be arrived at in a reasonable length of time. Problems for which an optimal answer can not be arrived at using the world's fastest computers before the universe comes to an end. Some of these problems are simple, and some are complex. We will be taking a look at some of these problems in general, and then look how some can be attacked using R.
2017 Mar 1: Weather or Not You Believe It
Weather is everywhere. Sounds trite to say that, but it is true and a lot of real-time data
is available for free, just for the downloading. We will look at semi-realtime weather data available from the National Oceanic and Atmospheric Administration (NOAA) collected and consolidated from over 25000 uniquely identified United States Air Force (USAF), and 3000 Weather-Bureau-Army-Navy (WBAN) weather stations world wide. Some weather stations have both USAF and WBAN identifier, while others may have only one. During this exploration, we will be creating ``heat maps'' of temperature for the states of Virginia, and North and South Carolina. These states were chosen because of local interest. The attached R script can be modified to display the same data for any collection of the US states.
2017 Apr 1: In Search of the Royal Mail Ship (RMS) Titanic
The sinking of the Royal Mail Ship (RMS) Titanic on her maiden voyage is a source of constant mystery and romance. Now after more than a century, there are still unanswered questions about the disaster that made her a part of the English lexicon. Perhaps the simplest question is: how many people (passengers and crew) were on board when she sank, and how many survived? Surprisingly, there is no definitive answer to this most simple of questions. Neither from the White Star Line (her owner), nor from the British Wreck Commissioner assigned to inquiry into her sinking. In this report, we will enumerate some of the disparate sources, and look at some data that has made its way into the R programming language.