fbpx
Connect with us

The Conversation

AI companies train language models on YouTube’s archive − making family-and-friends videos a privacy risk

Published

on

theconversation.com – Ryan McGrady, Senior Researcher, Initiative for Digital Public Infrastructure, UMass Amherst – 2024-06-27 07:23:53
Your kid's silly could be fodder for ChatGPT.
Halfpoint/iStock via Getty Images

Ryan McGrady, UMass Amherst and Ethan Zuckerman, UMass Amherst

The promised artificial intelligence revolution requires data. Lots and lots of data. OpenAI and Google have begun using YouTube videos to train their text-based AI models. But what does the YouTube archive actually include?

Our team of digital media researchers at the University of Amherst collected and analyzed random samples of YouTube videos to learn more about that archive. We published an 85-page paper about that dataset and set up a website called TubeStats for researchers and journalists who need basic information about YouTube.

Now, we're taking a closer look at some of our more surprising findings to better understand how these obscure videos might become part of powerful AI systems. We've found that many YouTube videos are meant for personal use or for small groups of people, and a significant proportion were created by who appear to be under 13.

Advertisement

Bulk of the YouTube iceberg

Most people's experience of YouTube is algorithmically curated: Up to 70% of the videos users watch are recommended by the site's algorithms. Recommended videos are typically popular content such as influencer stunts, clips, explainer videos, travel vlogs and video game reviews, while content that is not recommended languishes in obscurity.

Some YouTube content emulates popular creators or fits into established genres, but much of it is personal: celebrations, selfies set to music, homework assignments, video game clips without context and kids dancing. The obscure side of YouTube – the vast majority of the estimated 14.8 billion videos created and uploaded to the platform – is poorly understood.

Illuminating this aspect of YouTube – and social media generally – is difficult because big tech companies have become increasingly hostile to researchers.

We've found that many videos on YouTube were never meant to be shared widely. We documented thousands of short, personal videos that have few views but high engagement – likes and comments – implying a small but highly engaged audience. These were clearly meant for a small audience of friends and family. Such social uses of YouTube contrast with videos that try to maximize their audience, suggesting another way to use YouTube: as a video-centered social network for small groups.

Advertisement

Other videos seem intended for a different kind of small, fixed audience: recorded classes from pandemic-era virtual instruction, school board meetings and work meetings. While not what most people think of as social uses, they likewise imply that their creators have a different expectation about the audience for the videos than creators of the kind of content people see in their recommendations.

Fuel for the AI machine

It was with this broader understanding that we read The New York Times exposé on how OpenAI and Google turned to YouTube in a race to find new troves of data to train their large language models. An archive of YouTube transcripts makes an extraordinary dataset for text-based models.

There is also speculation, fueled in part by an evasive answer from OpenAI's chief technology officer Mira Murati, that the videos themselves could be used to train AI text-to-video models such as OpenAI's Sora.

Advertisement

The New York Times story raised concerns about YouTube's terms of service and, of course, the copyright issues that pervade much of the debate about AI. But there's another problem: How could anyone know what an archive of more than 14 videos, uploaded by people all over the world, actually contains? It's not entirely clear that Google knows or even could know if it wanted to.

Kids as content creators

We were surprised to find an unsettling number of videos featuring kids or apparently created by them. YouTube requires uploaders to be at least 13 years old, but we frequently saw children who appeared to be much younger than that, typically dancing, singing or playing video .

In our preliminary research, our coders determined nearly a fifth of random videos with at least one person's face visible likely included someone under 13. We didn't take into account videos that were clearly shot with the consent of a parent or guardian.

Our current sample size of 250 is relatively small – we are working on coding a much larger sample – but the findings thus far are consistent with what we've seen in the past. We're not aiming to scold Google. Age validation on the internet is infamously difficult and fraught, and we have no way of determining whether these videos were uploaded with the consent of a parent or guardian. But we want to underscore what is being ingested by these large companies' AI models.

Advertisement

Small reach, big influence

It's tempting to assume OpenAI is using highly produced influencer videos or TV newscasts posted to the platform to train its models, but previous research on large language model data shows that the most popular content is not always the most influential in training AI models. A virtually unwatched conversation between three friends could have much more linguistic value in training a chatbot language model than a music video with millions of views.

Unfortunately, OpenAI and other AI companies are quite opaque about their training materials: They don't specify what goes in and what doesn't. Most of the time, researchers can infer problems with training data through biases in AI systems' output. But when we do get a glimpse at training data, there's often cause for concern. For example, Human Rights Watch released a report on June 10, 2024, that showed that a popular training dataset includes many photos of identifiable kids.

The history of big tech self-regulation is filled with moving goal posts. OpenAI in particular is notorious for asking for forgiveness rather than permission and has faced increasing criticism for putting profit over safety.

Concerns over the use of user-generated content for training AI models typically center on intellectual property, but there are also privacy issues. YouTube is a vast, unwieldy archive, impossible to fully review.

Advertisement

Models trained on a subset of professionally produced videos could conceivably be an AI company's first training corpus. But without strong policies in place, any company that ingests more than the popular tip of the iceberg is likely content that violates the Federal Trade Commission's Children's Online Privacy Protection Rule, which prevents companies from collecting data from children under 13 without notice.

With last year's executive order on AI and at least one promising proposal on the table for comprehensive privacy legislation, there are signs that legal protections for user data in the U.S. might become more robust.

When the Wall Street Journal's Joanna Stern asked OpenAI CTO Mira Murati whether OpenAI trained its text-to-video generator Sora on YouTube videos, she said she wasn't sure.

Have you unwittingly helped train ChatGPT?

The intentions of a YouTube uploader simply aren't as consistent or predictable as those of someone publishing a book, writing an article for a magazine or displaying a painting in a gallery. But even if YouTube's algorithm ignores your upload and it never gets more than a couple of views, it may be used to train models like ChatGPT and Gemini.

As far as AI is concerned, your family reunion video may be just as important as those uploaded by influencer giant Mr. Beast or CNN.The Conversation

Ryan McGrady, Senior Researcher, Initiative for Digital Public Infrastructure, UMass Amherst and Ethan Zuckerman, Associate Professor of Public Policy, Communication, and Information, UMass Amherst

Advertisement

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Read More

The post AI companies train language models on YouTube's archive − making family-and-friends videos a privacy risk appeared first on .com

Advertisement

The Conversation

Supreme Court kicks cases about tech companies’ First Amendment rights back to lower courts − but appears poised to block states from hampering online content moderation

Published

on

theconversation.com – Lynn Greenky, Professor Emeritus of Communication and Rhetorical Studies, Syracuse – 2024-07-01 15:26:42
How much power do social media companies have over what users post?
Midnight Studio/iStock/Getty Images Plus

Lynn Greenky, Syracuse University

The U.S. Supreme Court has sent back to lower courts the about whether states can block social media companies such as Facebook and X, formerly Twitter, from regulating and controlling what users can post on their platforms.

Laws in Florida and Texas sought to impose restrictions on the internal policies and algorithms of social media platforms in ways that influence which posts will be promoted and spread widely and which will be made less visible or even removed.

In the unanimous decision, issued on July 1, 2024, the high court remanded the two cases, Moody v. NetChoice and NetChoice v. Paxton, to the 11th and 5th U.S. Circuit Courts of Appeals, respectively. The court admonished the lower courts for their failure to consider the full force of the laws' applications. It also warned the lower courts to consider the boundaries imposed by the Constitution against interference with private speech.

Advertisement

Contrasting views of social media sites

In their arguments before the court in February 2024, the two sides described competing visions of how social media fits into the often overwhelming flood of information that defines modern digital society.

The states said the platforms were mere conduits of communication, or “speech hosts,” similar to legacy telephone companies that were required to carry all calls and prohibited from discriminating against users. The states said that the platforms should have to carry all posts from users without discrimination among them based on what they were saying.

The states argued that the content moderation rules the social media companies imposed were not examples of the platforms themselves speaking – or choosing not to speak. Rather, the states said, the rules affected the platforms' behavior and caused them to censor certain views by allowing them to determine whom to allow to speak on which topics, which is outside First Amendment protections.

By contrast, the social media platforms, represented by NetChoice, a tech industry trade group, argued that the platforms' guidelines about what is acceptable on their sites are protected by the First Amendment's guarantee of speech free from government interference. The companies say their platforms are not public forums that may be subject to government regulation but rather private services that can exercise their own editorial judgment about what does or does not appear on their sites.

Advertisement

They argued that their policies were aspects of their own speech and that they should be to develop and implement guidelines about what is acceptable speech on their platforms based on their own First Amendment rights.

Here's what the First Amendment says and what it means.

A reframe by the Supreme Court

All the litigants – NetChoice, Texas and Florida – framed the issue around the effect of the laws on the content moderation policies of the platforms, specifically whether the platforms were engaged in protected speech. The 11th U.S. Circuit Court of Appeals upheld a lower court preliminary injunction against the Florida , holding the content moderation policies of the platforms were speech and the law was unconstitutional.

The 5th U.S. Circuit Court of Appeals came to the opposite conclusion and held that the platforms were not engaged in speech, but rather the platform's algorithms controlled platform behavior unprotected by the First Amendment. The 5th Circuit determined the behavior was censorship and reversed a lower court injunction against the Texas law.

The Supreme Court, however, reframed the inquiry. The court noted that the lower courts failed to consider the full range of activities the laws covered. Thus, while a First Amendment inquiry was in order, the decisions of the lower courts and the arguments by the parties were incomplete. The court added that neither the parties nor the lower courts engaged in a thorough analysis of whether and how the states' laws affected other elements of the platforms' products, such as Facebook's direct messaging applications, or even whether the laws have any impact on email providers or online marketplaces.

Advertisement

The Supreme Court directed the lower courts to engage in a much more exacting analysis of the laws and their implications and provided some guidelines.

First Amendment principles

The court held that content moderation policies reflect the constitutionally protected editorial choices of the platforms, at least regarding what the court describes as “heartland applications” of the laws – such as Facebook's Feed and YouTube's homepage.

The Supreme Court required the lower courts to consider two core constitutional principles of the First Amendment. One is that the amendment protects speakers from being compelled to communicate messages they would prefer to exclude. Editorial discretion by entities, social media companies, that compile and curate the speech of others is a protected First Amendment activity.

The other principle that the amendment precludes the government from controlling private speech, even for the purpose of balancing the marketplace of ideas. Neither state nor federal government may manipulate that marketplace for the purposes of presenting a more balanced array of viewpoints.

Advertisement

The court also affirmed that these principles apply to digital media in the same way they apply to traditional or legacy media.

In the 96-page opinion, Justice Elena Kagan wrote: “The First Amendment … does not go on when social media are involved.” For now, it appears the social media platforms will continue to control their content.The Conversation

Lynn Greenky, Professor Emeritus of Communication and Rhetorical Studies, Syracuse University

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Read More

Advertisement

The post Supreme Court kicks cases about tech companies' First Amendment rights back to lower courts − but appears poised to block states from hampering online content moderation appeared first on .com

Continue Reading

The Conversation

Disability community has long wrestled with ‘helpful’ technologies – lessons for everyone in dealing with AI

Published

on

theconversation.com – Elaine Short, Assistant Professor of Computer Science, Tufts – 2024-07-01 07:19:34

A robotic arm helps a disabled person paint a picture.

Jenna Schad /Tufts University

Elaine Short, Tufts University

You might have heard that artificial intelligence is going to revolutionize everything, save the world and give everyone superhuman powers. Alternatively, you might have heard that it will take your job, make you lazy and stupid, and make the world a cyberpunk dystopia.

Advertisement

Consider another way to look at AI: as an assistive technology – something that helps you function.

With that view, also consider a community of experts in giving and receiving assistance: the disability community. Many disabled people use technology extensively, both dedicated assistive technologies such as wheelchairs and general-use technologies such as smart home devices.

Equally, many disabled people professional and casual assistance from other people. And, despite stereotypes to the contrary, many disabled people regularly give assistance to the disabled and nondisabled people around them.

Disabled people are well experienced in receiving and giving social and technical assistance, which makes them a valuable source of insight into how everyone might relate to AI in the future. This potential is a key driver for my work as a disabled person and researcher in AI and robotics.

Advertisement

Actively learning to live with help

While virtually everyone values independence, no one is fully independent. Each of us depends on others to grow our food, care for us when we are ill, give us advice and emotional , and us in thousands of interconnected ways. Being disabled means support needs that are outside what is typical and therefore those needs are much more visible. Because of this, the disability community has reckoned more explicitly with what it means to need help to than most nondisabled people.

This disability community perspective can be invaluable in approaching new technologies that can assist both disabled and nondisabled people. You can't substitute pretending to be disabled for the experience of actually being disabled, but accessibility can benefit everyone.

The curb-cut effect – how technologies built for disabled people help everyone – has become a principle of good design.

This is sometimes called the curb-cut effect after the ways that putting a ramp in a curb to help a wheelchair user access the sidewalk also people with strollers, rolling suitcases and bicycles.

Partnering in assistance

You have probably had the experience of someone to help you without listening to what you actually need. For example, a parent or friend might “help” you clean and instead end up hiding everything you need.

Advertisement

Disability advocates have long battled this type of well-meaning but intrusive assistance – for example, by putting spikes on wheelchair handles to keep people from pushing a person in a wheelchair without being asked to or advocating for services that keep the disabled person in control.

The disabled community instead offers a model of assistance as a collaborative effort. Applying this to AI can help to ensure that new AI tools support human autonomy rather than taking over.

A key goal of my lab's work is to develop AI-powered assistive robotics that treat the user as an equal partner. We have shown that this model is not just valuable, but inevitable. For example, most people find it difficult to use a joystick to move a robot arm: The joystick can only move from front to back and side to side, but the arm can move in almost as many ways as a human arm.

The author discusses her work on robots that are designed to help people.

To help, AI can predict what someone is planning to do with the robot and then move the robot accordingly. Previous research assumed that people would ignore this help, but we found that people quickly figured out that the system is doing something, actively worked to understand what it was doing and tried to work with the system to get it to do what they wanted.

Advertisement

Most AI systems don't make this easy, but my lab's new approaches to AI empower people to influence robot behavior. We have shown that this results in better interactions in tasks that are creative, like painting. We also have begun to investigate how people can use this control to solve problems outside the ones the robots were designed for. For example, people can use a robot that is trained to carry a cup of to instead pour the water out to water their plants.

Training AI on human variability

The disability-centered perspective also raises concerns about the huge datasets that power AI. The very nature of data-driven AI is to look for common patterns. In general, the better-represented something is in the data, the better the model works.

If disability means having a body or mind outside what is typical, then disability means not being well-represented in the data. Whether it's AI systems designed to detect cheating on exams instead detecting students' disabilities or robots that fail to account for wheelchair users, disabled people's interactions with AI reveal how those systems are brittle.

One of my goals as an AI researcher is to make AI more responsive and adaptable to real human variation, especially in AI systems that learn directly from interacting with people. We have developed frameworks for testing how robust those AI systems are to real human teaching and explored how robots can learn better from human teachers even when those teachers change over time.

Advertisement

Thinking of AI as an assistive technology, and learning from the disability community, can help to ensure that the AI systems of the future serve people's needs – with people in the driver's seat.The Conversation

Elaine Short, Assistant Professor of Computer Science, Tufts University

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Read More

The post Disability community has long wrestled with ‘helpful' technologies – lessons for everyone in dealing with AI appeared first on theconversation.com

Advertisement
Continue Reading

The Conversation

How was popcorn discovered? An archaeologist on its likely appeal for people in the Americas millennia ago

Published

on

theconversation.com – Sean Rafferty, Professor of Anthropology, at Albany, University of New York – 2024-07-01 07:19:19

Could a spill by the cook fire have been popcorn's eureka moment?

Paul Taylor/Stone via Getty Images

Sean Rafferty, University at Albany, State University of New York

Curious Kids is a series for children of all ages. If you have a question you'd like an expert to answer, send it to curiouskidsus@theconversation.com.

Advertisement

How was popcorn discovered? – Kendra, age 11, Penn Yan, New York


You have to wonder how people originally figured out how to eat some foods that are beloved . The cassava plant is toxic if not carefully processed through multiple steps. Yogurt is basically old milk that's been around for a while and contaminated with bacteria. And who discovered that popcorn could be a toasty, tasty treat?

These kinds of food mysteries are pretty hard to solve. Archaeology depends on solid remains to figure out what happened in the past, especially for people who didn't use any sort of writing. Unfortunately, most stuff people traditionally used made from wood, animal materials or cloth decays pretty quickly, and archaeologists like me never find it.

We have lots of evidence of hard stuff, such as pottery and stone tools, but softer things – such as leftovers from a meal – are much harder to find. Sometimes we get lucky, if softer stuff is found in very dry places that preserve it. Also, if stuff gets burned, it can last a very long time.

Corn's ancestors

Luckily, corn – also called maize – has some hard parts, such as the kernel shell. They're the bits at the bottom of the popcorn bowl that get caught in your teeth. And since you have to heat maize to make it edible, sometimes it got burned, and archaeologists find evidence that way. Most interesting of all, some plants, maize, contain tiny, rock-like fragments called phytoliths that can last for thousands of years.

Advertisement

green plant stalks with reddish tendrils

The ancestor of maize was a grass called teosinte.

vainillaychile/iStock via Getty Images Plus

Scientists are pretty sure they know how old maize is. We know maize was probably first farmed by Native Americans in what is now Mexico. Early farmers there domesticated maize from a kind of grass called teosinte.

Before farming, people would gather wild teosinte and eat the seeds, which contained a lot of starch, a carbohydrate like you'd find in bread or pasta. They would pick teosinte with the largest seeds and eventually started weeding and planting it. Over time, the wild plant developed into something like what we call maize today. You can tell maize from teosinte by its larger kernels.

There's evidence of maize farming from dry caves in Mexico as early as 9,000 years ago. From there, maize farming spread throughout North and South America.

Advertisement

Popped corn, preserved food

Figuring out when people started making popcorn is harder. There are several types of maize, most of which will pop if heated, but one variety, actually called “popcorn,” makes the best popcorn. Scientists have discovered phytoliths from Peru, as well as burned kernels, of this type of “poppable” maize from as early as 6,700 years ago.

cobs of popcorn over popped kernels, one showing popping on the cob

Each popcorn kernel is a seed, ready to burst when heated.

Rick Madonik/Toronto Star via Getty Images

You can imagine that popping maize kernels was first discovered by accident. Some maize probably fell into a cooking fire, and whoever was nearby figured out that this was a handy new way of preparing the food. Popped maize would last a long time and was easy to make.

Ancient popcorn was probably not much like the snack you might munch at the theater today. There was probably no salt and definitely no butter, since there were no cows to milk in the Americas yet. It probably wasn't served hot and was likely pretty chewy with the version you're used to today.

Advertisement

It's impossible to know exactly why or how popcorn was invented, but I would guess it was a clever way to preserve the edible starch in corn by getting rid of the little bit of inside each kernel that would make it more susceptible to spoiling. It's the heated water in the kernel escaping as steam that makes popcorn pop. The popped corn could then last a long time. What you may consider a tasty snack today probably started as a useful way of preserving and storing food.


Hello, curious kids! Do you have a question you'd like an expert to answer? Ask an adult to send your question to CuriousKidsUS@theconversation.com. Please tell us your name, age and the where you .

And since curiosity has no age limit – adults, let us know what you're wondering, too. We won't be able to answer every question, but we will do our best.The Conversation

Sean Rafferty, Professor of Anthropology, University at Albany, State University of New York

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Advertisement

Read More

The post How was popcorn discovered? An archaeologist on its likely appeal for people in the Americas millennia ago appeared first on .com

Continue Reading

News from the South

Trending