Data Geeks Go Head to Head

For North Carolina college students, “big data” is becoming a big deal. The proof: signups for DataFest, a 48-hour number-crunching competition held at Duke last weekend, set a record for the third time in a row this year.

DataFest 2017

More than 350 data geeks swarmed Bostock Library this weekend for a 48-hour number-crunching competition called DataFest. Photo by Loreanne Oh, Duke University.

Expected turnout was so high that event organizer and Duke statistics professor Mine Cetinkaya-Rundel was even required by state fire code to sign up for “crowd manager” safety training — her certificate of completion is still proudly displayed on her Twitter feed.

Nearly 350 students from 10 schools across North Carolina, California and elsewhere flocked to Duke’s West Campus from Friday, March 31 to Sunday, April 2 to compete in the annual event.

Teams of two to five students worked around the clock over the weekend to make sense of a single real-world data set. “It’s an incredible opportunity to apply the modeling and computing skills we learn in class to actual business problems,” said Duke junior Angie Shen, who participated in DataFest for the second time this year.

The surprise dataset was revealed Friday night. Just taming it into a form that could be analyzed was a challenge. Containing millions of data points from an online booking site, it was too large to open in Excel. “It was bigger than anything I’ve worked with before,” said NC State statistics major Michael Burton.

DataFest 2017

The mystery data set was revealed Friday night in Gross Hall. Photo by Loreanne Oh.

Because of its size, even simple procedures took a long time to run. “The dataset was so large that we actually spent the first half of the competition fixing our crushed software and did not arrive at any concrete finding until late afternoon on Saturday,” said Duke junior Tianlin Duan.

The organizers of DataFest don’t specify research questions in advance. Participants are given free rein to analyze the data however they choose.

“We were overwhelmed with the possibilities. There was so much data and so little time,” said NCSU psychology major Chandani Kumar.

“While for the most part data analysis was decided by our teachers before now, this time we had to make all of the decisions ourselves,” said Kumar’s teammate Aleksey Fayuk, a statistics major at NCSU.

As a result, these budding data scientists don’t just write code. They form theories, find patterns, test hunches. Before the weekend is over they also visualize their findings, make recommendations and communicate them to stakeholders.

This year’s participants came from more than 10 schools, including Duke, UNC, NC State and North Carolina A&T. Students from UC Davis and UC Berkeley also made the trek. Photo by Loreanne Oh.

“The most memorable moment was when we finally got our model to start generating predictions,” said Duke neuroscience and computer science double major Luke Farrell. “It was really exciting to see all of our work come together a few hours before the presentations were due.”

Consultants are available throughout the weekend to help with any questions participants might have. Recruiters from both start-ups and well-established companies were also on site for participants looking to network or share their resumes.

“Even as late as 11 p.m. on Saturday we were still able to find a professor from the Duke statistics department at the Edge to help us,” said Duke junior Yuqi Yun, whose team presented their results in a winning interactive visualization. “The organizers treat the event not merely as a contest but more of a learning experience for everyone.”

Caffeine was critical. “By 3 a.m. on Sunday morning, we ended initial analysis with what we had, hoped for the best, and went for a five-hour sleep in the library,” said NCSU’s Fayuk, whose team DataWolves went on to win best use of outside data.

By Sunday afternoon, every surface of The Edge in Bostock Library was littered with coffee cups, laptops, nacho crumbs, pizza boxes and candy wrappers. White boards were covered in scribbles from late-night brainstorming sessions.

“My team encouraged everyone to contribute ideas. I loved how everyone was treated as a valuable team member,” said Duke computer science and political science major Pim Chuaylua. She decided to sign up when a friend asked if she wanted to join their team. “I was hesitant at first because I’m the only non-stats major in the team, but I encouraged myself to get out of my comfort zone,” Chuaylua said.

“I learned so much from everyone since we all have different expertise and skills that we contributed to the discussion,” said Shen, whose teammates were majors in statistics, computer science and engineering. Students majoring in math, economics and biology were also well represented.

At the end, each team was allowed four minutes and at most three slides to present their findings to a panel of judges. Prizes were awarded in several categories, including “best insight,” “best visualization” and “best use of outside data.”

Duke is among more than 30 schools hosting similar events this year, coordinated by the American Statistical Association (ASA). The winning presentations and mystery data source will be posted on the DataFest website in May after all events are over.

The registration deadline for the next Duke DataFest will be March 2018.

DataFest 2017

Bleary-eyed contestants pose for a group photo at Duke DataFest 2017. Photo by Loreanne Oh.

s200_robin.smith

Post by Robin Smith

Young Scientists, Making the Rounds

“Can you make a photosynthetic human?!” an 8th grader enthusiastically asks me while staring at a tiny fern in a jar.

He’s not the only one who asked me that either — another student asked if Superman was a plant, since he gets his power from the sun.

These aren’t the normal questions I get about my research as a Biology PhD candidate studying how plants get nutrients, but they were perfect for the day’s activity –A science round robin with Durham eighth-graders.

Biology grad student Leslie Slota showing Durham 8th graders some fun science.

After seeing a post under #scicomm on Twitter describing a public engagement activity for scientists, I put together a group of Duke graduate scientists to visit local middle schools and share our science with kids. We had students from biomedical engineering, physics, developmental biology, statistics, and many others — a pretty diverse range of sciences.

With help from David Stein at the Duke-Durham Neighborhood Partnership, we made connections with science teachers at the Durham School of the Arts and Lakewood Montessori school, and the event was in motion!

The outreach activity we developed works like speed dating, where people pair up, talk for 3-5 mins, and then rotate. We started out calling it “Science Speed Dating,” but for a middle school audience, we thought “Science Round-Robin” was more appropriate. Typically, a round-robin is a tournament where every team plays each of the other teams. So, every middle schooler got to meet each of us graduate students and talk to us about what we do.

The topics ranged from growing back limbs and mapping the brain, to using math to choose medicines and manipulating the different states of matter.

The kids were really excited for our visit, and kept asking their teachers for the inside scoop on what we did.

After much anticipation, and a little training and practice with Jory Weintraub from the Science & Society Initiative, two groups of 7-12 graduate students armed themselves with photos, animals, plants, and activities related to our work and went to visit these science classes full of eager students.

First-year MGM grad student Tulika Singh (top right) brought cardboard props to show students how antibodies match up with cell receptors.

“The kids really enjoyed it!” said Alex LeMay, middle- and high-school science teacher at the Durham School of the Arts. “They also mentioned that the grad students were really good at explaining ideas in a simple way, while still not talking down to them.”

That’s the ultimate trick with science communication: simplifying what we do, but not talking to people like they’re stupid.

I’m sure you’ve heard the old saying, “dumb it down.” But it really doesn’t work that way. These kids were bright, and often we found them asking questions we’re actively researching in our work. We don’t need to talk down to them, we just need to talk to them without all of the exclusive trappings of science. That was one thing the grad students picked up on too.

“It’s really useful to take a step back from the minutia of our projects and look at the big picture,” said Shannon McNulty, a PhD candidate in Molecular Genetics and Microbiology.

The kids also loved the enthusiasm we showed for our work! That made a big difference in whether they were interested in learning more and asking questions. Take note, fellow scientists: share your enthusiasm for what you do, it’s contagious!

Another thing that worked really well was connecting with the students in a personal way. According to Ms. LeMay, “if the person seemed to like them, they wanted to learn more.” Several of the grad students would ask each student their names and what they were passionate about, or even talk about their own passions outside of their research, and these simple questions allowed the students to connect as people.

There was one girl who shared with me that she didn’t know what she wanted to do when she grew up, and I told her that’s exactly where I was when I was in 8th grade too. We then bonded over our mutual love of baking, and through that interaction she saw herself reflected in me a little bit; making a career in science seem like a possibility, which is especially important for a young girl with a growing interest in science.

Making the rounds in these science classrooms, we learned just as much from the students we spoke to as they did from us. Our lesson being: science outreach is a really rewarding way to spend our time, and who knows, maybe we’ll even spark someone who loves Superman to figure out how to make the first photosynthesizing super-person!

Guest post by Ariana Eily , PhD Candidate in Biology, shown sharing her floating ferns at left.

 

Would You Expect a ‘Real Man’ to Tweet “Cute” or Not?

There’s nothing cute about stereotypes, but as a species, we seem to struggle to live without them.

In a clever new study led by Jordan Carpenter, who is now a postdoctoral fellow at Duke, a University of Pennsylvania team of social psychologists and computer scientists figured out a way to test just how accurate our stereotypes about language use might be, using a huge collection of real tweets and a form of artificial intelligence called “natural language processing.”

Wordclouds show the words in tweets that raters mistakenly attributed to Female authors (left) or Males (right).

Word clouds show the words in tweets that raters mistakenly attributed to Female authors (left) or Males (right). The larger the word appears, the more often the raters were fooled by it. Word color indicates the frequency of the word; gray is least frequent, then blue, and dark red is the most frequent. <url> means they used a link in their tweet.

Starting with a data set that included the 140-character bon mots of more than 67,000 Twitter users, they figured out the actual characteristics of 3,000 of the authors. Then they sorted the authors into piles using four criteria – male v. female; liberal v. conservative; younger v. older; and education (no college degree, college degree, advanced degree).

A random set of 100 tweets by each author over 12 months was loaded into the crowd-sourcing website Amazon Mechanical Turk. Intertubes users were then invited to come in and judge what they perceived about the author one characteristic at a time, like age, gender, or education, for 2 cents per rating. Some folks just did one set, others tried to make a day’s wage.

The raters were best at guessing politics, age and gender. “Everybody was better than chance,” Carpenter said. When guessing at education, however, they were worse than chance.

Jordan Carpenter is a newly-arrived Duke postdoc working with Walter Sinnott-Armstrong in philosophy and brain science.

Jordan Carpenter is a newly-arrived Duke postdoc working with Walter Sinnott-Armstrong in philosophy and brain science.

“When they saw the word S*** [this is a family blog folks, work with us here] they most often thought the author didn’t have a college degree. But where they went wrong was they overestimated the importance of that word,” Carpenter said. Raters seemed to believe that a highly-educated person would never tweet the S-word or the F-word. Unfortunately, not true! “But it is a road to people thinking you’re not a Ph.D.,” Carpenter wisely counsels.

The raters were 75 percent correct on gender, by assuming women would be tweeting words like Love, Cute, Baby and My, interestingly enough. But they got tricked most often by assuming women would not be talking about News, Research or Ebola or that the guys would not be posting Love, Life or Wonderful.

Female authors were slightly more likely to be liberal in this sample of tweets, but not as much as the raters assumed. Conservatism was viewed by raters as a male trait. Again, generally true, but not as much as the raters believed.

Youthful authors were correctly perceived to be more likely to namedrop a @friend, or say Me and Like and a few variations on the F-bomb, but they could throw the raters for a loop by using Community, Our and Original.

And therein lies the social psychology takeaway from all this: “An accurate stereotype should be one with accurate social judgments of people,” but clearly every stereotype breaks down at some point, leading to “mistaken social judgement,” Carpenter said. Just how much stereotypes should be used or respected is a hot area of discussion within the field right now, he said.

The other value of the paper is that it developed an entirely new way to apply the tools of Big Data analysis to a social psychology question without having to invite a bunch of undergraduates into the lab with the lure of a Starbucks gift card. Using tweets stripped of their avatars or any other identifier ensured that the study was testing what people thought of just the words, nothing else, Carpenter said.

The paper is “Real Men Don’t Say “Cute”: Using Automatic Language Analysis To Isolate Inaccurate Aspects Of Stereotypes.”  You can see the paper in Social Psychology and Personality Science, if you have a university IP address and your library subscribes to Sage journals. Otherwise, here’s a press release from the journal. (DOI: 10.1177/1948550616671998 )

Karl Leif BatesPost by Karl Leif Bates

Diabetes — and Privacy — Meet ‘Big Data’

“Click here to consent forever.”

If consent to participate in medical research were that simple, Joanna Radin of Yale University would have to find a new focus for her research, and I would never have found the Trent Center for Bioethics, Humanities & History of Medicine.

Luckily for us both, this is not the case. Medical consent is a very complex issue that can, as Radin’s research attests, traverse generations.

joanna-radin-headshot

Joanna Radin’s reserach focuses on the intersection of medical history, anthropology and ethics at Yale University. Source: Yale School of Medicine

Radin is an Associate Professor of Medical History at Yale, the perfect fit for the Humanities in Medicine Lecture Series taking place this month at the Trent Center. Her research nails the narrow intersection of medical history, anthropology, bioethics and data analytics. In fact, Radin’s appeal is so broad that her visit to Duke was sponsored by no less than six Duke departments, including the Departments of Computer Science, History, Electrical and Computer Engineering, Cultural Anthropology and Statistical Science.

Radin’s lecture honed in on a well-known case in the realm of bioethics and medical history: the Pima Native American tribe in Arizona, which is known for unusually high rates of diabetes and obesity. The Pima were the first Native American tribe to be granted a reservation in Arizona—30,000 acres—at the beginning of the California Gold Rush. In 1963, following nearly half a century of mass famine among the Pima, the National Institute of Health (NIH) conducted a survey for rheumatoid arthritis in the Pima tribe, instead discovering a frighteningly high frequency of diabetes.

In 1965, the NIH initiated a long-term observational study of the Pima that continued for about 40 years, though it was meant to last no more than 10. The goal of the study was to learn about diabetes in the “natural laboratory” of sorts that the Pima reservation unwittingly provided. The data collected in this study came to be known as the Pima Indian Diabetes Data set (PIDD).

Machine learning enters the story around 1987, when David Aha and colleagues at the University of California, Irvine (UCI) created the UCI Machine Learning Repository, an archive containing thousands of data sets, databases and data generators. The repository is still active today, virtually a gold mine for researchers in machine learning to test their algorithms. The PIDD is one of the oldest data sets on file in the UCI archive, “a standard for testing data mining algorithms for accuracy in predicting diabetes,” according to Radin.

pima_indian_man_miguel_a_farmer_pima_arizona_ca-1900_chs-3625

A Pima farmer in Pima, Arizona, circa 1900. Source: Wikimedia Commons

Generations’ worth of data on the Pima tribe have been publicly accessible in the UCI archive for over two decades, creating ethical controversy around the accessibility of information as personal as blood pressure, body mass index (BMI) and number of pregnancies of Pima Native Americans. Though the PIDD can help refine machine learning algorithms that could accurately predict—and prevent—diabetes, the privacy issues provoked by the publicness of the data are impossible to ignore.

This is where “eternal” medical consent enters the equation: no researcher can realistically inform a study participant of what their medical data will be used for 40 years in the future.

These are the interdisciplinary questions that Radin brought forth in her lecture, weaving together seemingly opposite fields of study in an engaging, thought-provoking presentation. No one who left that room will look at the Apple Terms & Conditions the same way again.

 

Post by Maya Iskandarani iskandarani_maya_100hed

Walla Scores Grand Prize at 17th Annual Start-Up Challenge

The finalists of Duke’s 17th Annual Start-Up Challenge have found time between classes, homework, and West Union runs to research and develop pitches aiming to solve real-world problems with entrepreneurship. The event, hosted last week at the Fuqua School of Business, featured a Trinity alum as the keynote speaker. Beating out the other seven start-up pitches for the $50,000 Grand Prize was Walla, an app founded by Judy Zhu, a Pratt senior.

Judy Zhu and the Walla team pose with their $50,000 check, which is giant in more ways than one.

Judy Zhu and the Walla team pose with their $50,000 check, which is giant in more ways than one.

Walla aims to create a social health platform for college students by addressing widespread loneliness and creating a more inclusive campus community. The app’s users post open invitations to activities, from study groups to pick-up sports, allowing students to connect over shared interests.

Walla is closely tied with Duke Medicine by providing data from user activity to medical researchers. User engagement is analyzed to supply valuable information on mental health in young adults to professionals. The app currently features 700 monthly active users, with 3000 anticipated within the next month, and many more as the app opens to other North Carolina colleges.

Tatiana Birgisson returned to Duke to talk about her own experiences creating a business while an undergrad that won the Start-Up Challenge in 2013. Birgisson’s venture, MATI energy drink, was born out of her Central Campus dorm room and, through the support of Duke I&E resources, became the major energy drink contender it is today, as a healthy alternative to Monster or Red Bull.

The $2,500 Audience Choice award went to Ebb, an app designed to empower women on their periods by keeping them informed of physical and emotional symptoms throughout the course of their cycles, and creating a community through which menstruating women can receive support from those they choose to share information with.

Tatiana Birgisson won the 2013 startup challenge with an energy drink brewed in her dorm room, now sold as MATI.

Tatiana Birgisson won the 2013 startup challenge with an energy drink brewed in her dorm room, now sold as MATI.

Other finalists included BioMetrix, a wearable platform for injury prevention; GoGlam, an application to connect working women with beauticians in Latin America; Grow With Nigeria, which provides engaging STEM experiences for students in Nigeria; MedServe; Tiba Health; Teraphic.

This year’s Start-Up Challenge was a major success, with innovative entrepreneurs coming together to share their projects on changing the world. Be sure to come out next year; I’ll post an invite on Walla!

devin_nieusma_100Post by Devin Nieusma

Students Mine Parking Data to Help You Find a Spot

No parking spot? No problem.

A group of students has teamed up with Duke Parking and Transportation to explore how data analysis and visualization can help make parking on campus a breeze.

As part of the Information Initiative’s Data+ program, students Mitchell Parekh (’19) and Morton Mo (’19) along with IIT student Nikhil Tank (’17), spent 10 weeks over the summer poring over parking data collected at 42 of Duke’s permitted lots.

Under the mentorship of graduate student Nicolas-Aldebrando Benelli, they identified common parking patterns across the campus, with the goal of creating a “redirection” tool that could help Duke students and employees figure out the best place to park if their preferred lot is full.

A map of parking patterns at Duke

To understand parking patterns at Duke, the team created “activity” maps, where each circle represents one of Duke’s parking lots. The size of the circle indicates the size of the lot, and the color of the circle indicates how many people entered and exited the lot within a given hour.

“We envision a mobile app where, before you head out for work, you could check your lot on your phone,” Mo said, speaking with Parekh at the Sept. 23 Visualization Friday Forum. “And if the lot is full, it would give you a pass for an alternate lot.”

Starting with parking data gathered in Fall 2013, which logged permit holders “swiping” in and out from each lot, they set out to map some basic parking habits at Duke, including how full each lot is, when people usually arrive, and how long they stay.

However, the data weren’t always very agreeable, Mo said.

“One of the things we got was a historical occupancy count, which is exactly what we wanted – the number of cars in the facility at a given time – but we were seeing negative numbers,” said Mo. “So we figured that table might not be as trustworthy as we expected it to be.”

Other unexpected features, such as “passback,” which occurs when two cars enter or exit under the same pass, also created challenges with interpreting the data.

However, with some careful approximations, the team was able to estimate the occupancy of lot on campus at different times throughout an average weekday.

They then built an interactive, Matlab-based tool that would suggest up to three alternative parking locations based on the users’ location and travel time plus the utilization and physical capacity of each lot.

“Duke Parking is really happy with the interface that we built, and they want us to keep working on it,” Parekh said.

“The data team worked hard on real world challenges, and provided thoughtful insights to those challenges,” said Kyle Cavanaugh, Vice President of Administration at Duke. “The team was terrific to work with and we look forward to future collaboration.”

Hectic class schedules allowing, the team hopes to continue developing their application into a more user-friendly tool. You can watch a recording of Mo and Parekh’s Sept. 23 presentation here.

The team's algorithm recommends up to three alternative lots if a commuter's preferred lot is full. In this video, suggested alternatives to the blue lot are updated throughout the day to reflect changing traffic and parking patterns. Video courtesy of Nikhil Tank.

Kara J. Manke, PhD

Post by Kara Manke