fbpx
Connect with us

The Conversation

AI companies train language models on YouTube’s archive − making family-and-friends videos a privacy risk

Published

on

theconversation.com – Ryan McGrady, Senior Researcher, Initiative for Digital Public , UMass Amherst – 2024-06-27 07:23:53
Your kid's silly video could be fodder for ChatGPT.
Halfpoint/iStock via Getty Images

Ryan McGrady, UMass Amherst and Ethan Zuckerman, UMass Amherst

The promised artificial intelligence revolution requires data. Lots and lots of data. OpenAI and Google have begun using YouTube to train their text-based AI models. But what does the YouTube archive actually include?

Our team of digital media researchers at the University of Amherst collected and analyzed random samples of YouTube videos to learn more about that archive. We published an 85-page paper about that dataset and set up a website called TubeStats for researchers and journalists who need basic information about YouTube.

Now, we're taking a closer look at some of our more surprising findings to better understand how these obscure videos might become part of powerful AI systems. We've found that many YouTube videos are meant for personal use or for small groups of people, and a significant proportion were created by who appear to be under 13.

Advertisement

Bulk of the YouTube iceberg

Most people's experience of YouTube is algorithmically curated: Up to 70% of the videos users watch are recommended by the site's algorithms. Recommended videos are typically popular content such as influencer stunts, news clips, explainer videos, travel vlogs and video reviews, while content that is not recommended languishes in obscurity.

Some YouTube content emulates popular creators or fits into established genres, but much of it is personal: family celebrations, selfies set to music, homework assignments, video game clips without context and kids dancing. The obscure side of YouTube – the vast majority of the estimated 14.8 billion videos created and uploaded to the platform – is poorly understood.

Illuminating this aspect of YouTube – and social media generally – is difficult because big tech companies have become increasingly hostile to researchers.

We've found that many videos on YouTube were never meant to be shared widely. We documented thousands of short, personal videos that have few views but high engagement – likes and comments – implying a small but highly engaged audience. These were clearly meant for a small audience of friends and family. Such social uses of YouTube contrast with videos that try to maximize their audience, suggesting another way to use YouTube: as a video-centered social network for small groups.

Advertisement

Other videos seem intended for a different kind of small, fixed audience: recorded classes from pandemic-era virtual instruction, school board meetings and work meetings. While not what most people think of as social uses, they likewise imply that their creators have a different expectation about the audience for the videos than creators of the kind of content people see in their recommendations.

Fuel for the AI machine

It was with this broader understanding that we read The New York Times exposé on how OpenAI and Google turned to YouTube in a race to find new troves of data to train their large language models. An archive of YouTube transcripts makes an extraordinary dataset for text-based models.

There is also speculation, fueled in part by an evasive answer from OpenAI's chief technology officer Mira Murati, that the videos themselves could be used to train AI text-to-video models such as OpenAI's Sora.

Advertisement

The New York Times story raised concerns about YouTube's terms of service and, of course, the copyright issues that pervade much of the debate about AI. But there's another problem: How could anyone know what an archive of more than 14 billion videos, uploaded by people all over the world, actually contains? It's not entirely clear that Google knows or even could know if it wanted to.

Kids as content creators

We were surprised to find an unsettling number of videos featuring kids or apparently created by them. YouTube requires uploaders to be at least 13 years old, but we frequently saw children who appeared to be much younger than that, typically dancing, singing or playing video .

In our preliminary research, our coders determined nearly a fifth of random videos with at least one person's face visible likely included someone under 13. We didn't take into account videos that were clearly shot with the consent of a parent or guardian.

Our current sample size of 250 is relatively small – we are working on coding a much larger sample – but the findings thus far are consistent with what we've seen in the past. We're not aiming to scold Google. Age validation on the internet is infamously difficult and fraught, and we have no way of determining whether these videos were uploaded with the consent of a parent or guardian. But we want to underscore what is being ingested by these large companies' AI models.

Advertisement

Small reach, big influence

It's tempting to assume OpenAI is using highly produced influencer videos or TV newscasts posted to the platform to train its models, but previous research on large language model training data shows that the most popular content is not always the most influential in training AI models. A virtually unwatched conversation between three friends could have much more linguistic value in training a chatbot language model than a music video with millions of views.

Unfortunately, OpenAI and other AI companies are quite opaque about their training materials: They don't specify what goes in and what doesn't. Most of the time, researchers can infer problems with training data through biases in AI systems' output. But when we do get a glimpse at training data, there's often cause for concern. For example, Human Rights Watch released a report on June 10, 2024, that showed that a popular training dataset includes many photos of identifiable kids.

The history of big tech self-regulation is filled with moving goal posts. OpenAI in particular is notorious for asking for forgiveness rather than permission and has increasing criticism for putting profit over safety.

Concerns over the use of user-generated content for training AI models typically center on intellectual property, but there are also privacy issues. YouTube is a vast, unwieldy archive, impossible to fully .

Advertisement

Models trained on a subset of professionally produced videos could conceivably be an AI company's first training corpus. But without strong policies in place, any company that ingests more than the popular tip of the iceberg is likely content that violates the Federal Trade Commission's Children's Online Privacy Protection Rule, which prevents companies from collecting data from children under 13 without notice.

With last year's executive order on AI and at least one promising proposal on the table for comprehensive privacy legislation, there are signs that legal protections for user data in the U.S. might become more robust.

When the Wall Street Journal's Joanna Stern asked OpenAI CTO Mira Murati whether OpenAI trained its text-to-video generator Sora on YouTube videos, she said she wasn't sure.

Have you unwittingly helped train ChatGPT?

The intentions of a YouTube uploader simply aren't as consistent or predictable as those of someone publishing a book, writing an article for a magazine or displaying a painting in a gallery. But even if YouTube's algorithm ignores your upload and it never gets more than a couple of views, it may be used to train models like ChatGPT and Gemini.

As far as AI is concerned, your family reunion video may be just as important as those uploaded by influencer giant Mr. Beast or CNN.The Conversation

Ryan McGrady, Senior Researcher, Initiative for Digital Public Infrastructure, UMass Amherst and Ethan Zuckerman, Associate Professor of Public Policy, Communication, and Information, UMass Amherst

Advertisement

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Read More

The post AI companies train language models on YouTube's archive − making family-and-friends videos a privacy risk appeared first on .com

Advertisement

The Conversation

Even short trips to space can change an astronaut’s biology − a new set of studies offers the most comprehensive look at spaceflight health since NASA’s Twins Study

Published

on

theconversation.com – Susan Bailey, Professor of Radiation Cancer Biology and Oncology, Colorado University – 2024-07-03 07:22:56
Crew members from the Inspiration4 mission. New research looks at the biological effects of their short to space.
SpaceX, CC BY-NC

Susan Bailey, Colorado State University

Only about 600 people have ever traveled to space. The vast majority of astronauts over the past six decades have been middle-aged men on short-duration missions of fewer than 20 days.

, with private, commercial and multinational spaceflight providers and flyers entering the market, we are witnessing a new era of human spaceflight. Missions have ranged from minutes, hours and days to months.

As humanity looks ahead to returning to the Moon over the coming decade, space exploration missions will be much longer, with many more space travelers and even space tourists. This also means that a wider diversity of people will experience the extreme of space – more women and people of different ethnicities, ages and health status.

Advertisement

Since people respond differently to the unique stressors and exposures of space, researchers in space health, like me, seek to better understand the human health effects of spaceflight. With such information, we can figure out how to astronauts stay healthy both while they're in space and once they return to Earth.

As part of the historic NASA Twins Study, in 2019, my colleagues and I published groundbreaking research on how one year on board the International Space Station affects the human body.

I am a radiation cancer biologist in Colorado State University's Department of Environmental and Radiological Health Sciences. I've spent the past few years continuing to build on that earlier research in a series of papers recently published across the portfolio of Nature journals.

These papers are part of the Space Omics and Medical Atlas package of manuscripts, data, protocols and repositories that represent the largest collection ever assembled for aerospace medicine and space biology. Over 100 institutions from 25 countries contributed to the coordinated release of a wide range of spaceflight data.

Advertisement

The NASA Twins Study

NASA's Twins Study seized on a unique research .

NASA selected astronaut Scott for the agency's first one-year mission, during which he spent a year on board the International Space Station from 2015 into 2016. Over the same time period, his identical twin brother, Mark Kelly, a former astronaut and current U.S. senator representing Arizona, remained on Earth.

Two identical men wearing blue jumpsuits stand next to each other.
NASA astronaut Scott Kelly, left, who went into space during the NASA Twins Study, stands next to his twin brother, Mark Kelly, who stayed on Earth.
AP Photo/Pat Sullivan

My team and I examined blood samples collected from the twin in space and his genetically matched twin back on Earth before, during and after spaceflight. We found that Scott's telomeres – the protective caps at the ends of chromosomes, much like the plastic tip that keeps a shoelace from fraying – lengthened, quite unexpectedly, during his year in space.

When Scott returned to Earth, however, his telomeres quickly shortened. Over the months, his telomeres recovered but were still shorter after his journey than they had been before he went to space.

As you get older, your telomeres shorten because of a variety of factors, stress. The length of your telomeres can serve as a biological indicator of your risk for developing age-related conditions such as dementia, cardiovascular disease and cancer.

Advertisement

In a separate study, my team studied a cohort of 10 astronauts on six-month missions on board the International Space Station. We also had a control group of age- and sex-matched participants who stayed on the ground.

We measured telomere length before, during and after spaceflight and again found that telomeres were longer during spaceflight and then shortened upon return to Earth. Overall, the astronauts had many more short telomeres after spaceflight than they had before.

One of the other Twins Study investigators, Christopher Mason, and I conducted another telomere study – this time with twin high-altitude mountain climbers – a somewhat similar extreme environment on Earth.

We found that while climbing Mount Everest, the climbers' telomeres were longer, and after they descended, their telomeres shortened. Their twins who remained at low altitude didn't experience the same changes in telomere length. These results indicate that it's not the space station's microgravity that led to the telomere length changes we observed in the astronauts – other culprits, such as increased radiation exposure, are more likely.

Advertisement

Civilians in space

In our latest study, we studied telomeres from the crew on board SpaceX's 2021 Inspiration4 mission. This mission had the first all-civilian crew, whose ages spanned four decades. All of the crew members' telomeres lengthened during the mission, and three of the four astronauts also exhibited telomere shortening once they were back on Earth.

Four people wearing black jumpsuits wave their hands in the air.
The crew members from SpaceX's 2021 Inspiration4 mission.
SpaceX, CC BY-NC

What's particularly interesting about these findings is that the Inspiration4 mission lasted only three days. So, not only do scientists now have consistent and reproducible data on telomeres' response to spaceflight, but we also know it happens quickly. These results suggest that even short trips, like a weekend getaway to space, will be associated with changes in telomere length.

Scientists still don't totally understand the health impacts of such changes in telomere length. We'll need more research to figure out how both long and short telomeres might affect an astronaut's long-term health.

Telomeric RNA

In another paper, we showed that the Inspiration4 crew – as well as Scott Kelly and the high-altitude mountain climbers – exhibited increased levels of telomeric RNA, termed TERRA.

Telomeres consist of lots of repetitive DNA sequences. These are transcribed into TERRA, which contributes to telomere structure and helps them do their job.

Advertisement

Together with laboratory studies, these findings tell us that telomeres are being damaged during spaceflight. While there is still a lot we don't know, we do know that telomeres are especially sensitive to oxidative stress. So, the chronic oxidative that astronauts experience when exposed to space radiation around the clock likely contributes to the telomeric responses we observe.

We also wrote a review article with a more futuristic perspective of how better understanding telomeres and aging might begin to inform the ability of humans to not only survive long-duration space travel but also to thrive and even colonize other planets. Doing so would require humans to reproduce in space and future generations to grow up in space. We don't know if that's even possible – yet.

Plant telomeres in space

My colleagues and I contributed other work to the Space Omics and Medical Atlas package, as well, including a paper published in Nature Communications. The study team, led by Texas A&M biologist Dorothy Shippen and Ohio University biologist Sarah Wyatt, found that, unlike people, plants flown in space did not have longer telomeres during their time on board the International Space Station.

The plants did, however, ramp up their production of telomerase, the enzyme that helps maintain telomere length.

Advertisement

As anyone who's seen “The Martian” knows, plants will play an essential role in long-term human survival in space. This finding suggests that plants are perhaps more naturally suited to withstand the stressors of space than humans.The Conversation

Susan Bailey, Professor of Radiation Cancer Biology and Oncology, Colorado State University

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Read More

The post Even short trips to space can change an astronaut's biology − a new set of studies offers the most comprehensive look at spaceflight health since NASA's Twins Study appeared first on theconversation.com

Advertisement
Continue Reading

The Conversation

From diagnosing brain disorders to cognitive enhancement, 100 years of EEG have transformed neuroscience

Published

on

theconversation.com – Erika Nyhus, Associate Professor of Psychology and Neuroscience, Bowdoin College – 2024-07-02 07:28:40
The electroencephalogram scientists to record and read brain activity.
Kateryna Kon/Science Photo Library via Getty Images

Erika Nyhus, Bowdoin College

Electroencephalography, or EEG, was invented 100 years ago. In the years since the invention of this device to monitor brain electricity, it has had an incredible impact on how scientists study the human brain.

Since its first use, the EEG has shaped researchers' understanding of cognition, from perception to memory. It has also been important for diagnosing and guiding treatment of multiple brain disorders, epilepsy.

I am a cognitive neuroscientist who uses EEG to study how people remember from their past. The EEG's 100-year anniversary is an opportunity to reflect on this discovery's significance in neuroscience and medicine.

Advertisement

Discovery of EEG

On July 6, 1924, psychiatrist Hans Berger performed the first EEG recording on a human, a 17-year-old boy undergoing neurosurgery. At the time, Berger and other researchers were performing electrical recordings on the brains of animals.

What set Berger apart was his obsession with finding the physical basis of what he called psychic energy, or mental effort, in people. Through a of experiments spanning his early career, Berger measured brain volume and temperature to study changes in mental processes such as intellectual work, attention and desire.

He then turned to recording electrical activity. Though he recorded the first traces of EEG in the human brain in 1924, he did not publish the results until 1929. Those five intervening years were a tortuous phase of self-doubt about the source of the EEG signal in the brain and refining the experimental setup. Berger recorded hundreds of EEGs on multiple subjects, including his own , with both experimental successes and setbacks.

This is among the first EEG readings published in Hans Berger's study. The top trace is the EGG while the bottom is a reference trace of 10 Hz.
Two EEG traces, the top more irregular in rhythm than the bottom.
Hans Berger/Über das Elektrenkephalogramm des Menchen. Archives für Psychiatrie. 1929; 87:527-70 via Wikimedia Commons

Finally convinced of his results, he published a series of papers in the journal Archiv für Psychiatrie and had hopes of winning a Nobel Prize. Unfortunately, the research community doubted his results, and years passed before anyone else started using EEG in their own research.

Berger was eventually nominated for a Nobel Prize in 1940. But Nobels were not awarded that year in any category due to World War II and Germany's occupation of Norway.

Advertisement

Neural oscillations

When many neurons are active at the same time, they produce an electrical signal strong enough to spread instantaneously through the conductive tissue of the brain, skull and scalp. EEG electrodes placed on the head can record these electrical .

Since the discovery of EEG, researchers have shown that neural activity oscillates at specific frequencies. In his initial EEG recordings in 1924, Berger noted the predominance of oscillatory activity that cycled eight to 12 times per second, or 8 to 12 hertz, named alpha oscillations. Since the discovery of alpha rhythms, there have been many attempts to understand how and why neurons oscillate.

Neural oscillations are thought to be important for effective communication between specialized brain regions. For example, theta oscillations that cycle at 4 to 8 hertz are important for communication between brain regions involved in memory encoding and retrieval in animals and humans.

Finger pointing at EEG reading
Different frequencies of neural oscillations indicate different types of brain activity.
undefined undefined/iStock via Getty Images Plus

Researchers then examined whether they could alter neural oscillations and therefore affect how neurons talk to each other. Studies have shown that many behavioral and noninvasive methods can alter neural oscillations and to changes in cognitive performance. Engaging in specific mental activities can induce neural oscillations in the frequencies those mental activities use. For example, my team's research found that mindfulness meditation can increase theta frequency oscillations and improve memory retrieval.

Noninvasive brain stimulation methods can target frequencies of interest. For example, my team's ongoing research found that brain stimulation at theta frequency can lead to improved memory retrieval.

Advertisement

EEG has also led to major discoveries about how the brain processes information in many other cognitive domains, including how people perceive the world around them, how they focus their attention, how they communicate through language and how they emotions.

Diagnosing and treating brain disorders

EEG is commonly used to diagnose sleep disorders and epilepsy and to guide brain disorder treatments.

Scientists are using EEG to see whether memory can be improved with noninvasive brain stimulation. Although the research is still in its infancy, there have been some promising results. For example, one study found that noninvasive brain stimulation at gamma frequency – 25 hertz – improved memory and neurotransmitter transmission in Alzheimer's disease.

Back of person's head enveloped by the many, small round electrodes of an EEG cap
Researchers and clinicians use EEG to diagnose conditions like epilepsy.
BSIP/Collection Mix: Subjects via Getty Images

A new type of noninvasive brain stimulation called temporal interference uses two high frequencies to cause neural activity equal to the difference between the stimulation frequencies. The high frequencies can better penetrate the brain and reach the targeted area. Researchers recently tested this method in people using 2,000 hertz and 2,005 hertz to send 5 hertz theta frequency at a key brain region for memory, the hippocampus. This led to improvements in remembering the name associated with a face.

Although these results are promising, more research is needed to understand the exact role neural oscillations play in cognition and whether altering them can lead to long-lasting cognitive enhancement.

Advertisement

The future of EEG

The 100-year anniversary of the EEG provides an opportunity to consider what it has taught us about brain function and what this technique can do in the future.

In a survey commissioned by the journal Nature Human Behaviour, over 500 researchers who use EEG in their work were asked to make predictions on the future of the technique. What will be possible in the next 100 years of EEG?

Some researchers, including myself, predict that we'll use EEG to diagnose and create targeted treatments for brain disorders. Others anticipate that an affordable, wearable EEG will be widely used to enhance cognitive function at home or will be seamlessly integrated into virtual reality applications. The possibilities are vast.The Conversation

Erika Nyhus, Associate Professor of Psychology and Neuroscience, Bowdoin College

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Advertisement

Read More

The post From diagnosing brain disorders to cognitive enhancement, 100 years of EEG have transformed neuroscience appeared first on .com

Continue Reading

The Conversation

Supreme Court kicks cases about tech companies’ First Amendment rights back to lower courts − but appears poised to block states from hampering online content moderation

Published

on

theconversation.com – Lynn Greenky, Professor Emeritus of Communication and Rhetorical Studies, Syracuse – 2024-07-01 15:26:42
How much power do social companies have over what users post?
Midnight Studio/iStock/Getty Images Plus

Lynn Greenky, Syracuse University

The U.S. Supreme Court has sent back to lower courts the about whether states can block social media companies such as Facebook and X, formerly Twitter, from regulating and controlling what users can post on their platforms.

Laws in Florida and Texas sought to impose restrictions on the internal policies and algorithms of social media platforms in ways that influence which posts will be promoted and spread widely and which will be made less visible or even .

In the unanimous decision, issued on July 1, 2024, the high court remanded the two cases, Moody v. NetChoice and NetChoice v. Paxton, to the 11th and 5th U.S. Circuit Courts of Appeals, respectively. The court admonished the lower courts for their failure to consider the full force of the laws' applications. It also warned the lower courts to consider the boundaries imposed by the Constitution against interference with private speech.

Advertisement

Contrasting views of social media sites

In their arguments before the court in February 2024, the two sides described competing visions of how social media fits into the often overwhelming flood of information that defines modern digital society.

The states said the platforms were mere conduits of communication, or “speech hosts,” similar to legacy telephone companies that were required to carry all calls and prohibited from discriminating against users. The states said that the platforms should have to carry all posts from users without discrimination among them based on what they were saying.

The states argued that the content moderation rules the social media companies imposed were not examples of the platforms themselves speaking – or choosing not to speak. Rather, the states said, the rules affected the platforms' behavior and caused them to censor certain views by allowing them to determine whom to allow to speak on which topics, which is outside First Amendment protections.

By contrast, the social media platforms, represented by NetChoice, a tech industry trade group, argued that the platforms' guidelines about what is acceptable on their sites are protected by the First Amendment's guarantee of speech free from government interference. The companies say their platforms are not public forums that may be subject to government regulation but rather private services that can exercise their own editorial judgment about what does or does not appear on their sites.

Advertisement

They argued that their policies were aspects of their own speech and that they should be to develop and implement guidelines about what is acceptable speech on their platforms based on their own First Amendment rights.

Here's what the First Amendment says and what it means.

A reframe by the Supreme Court

All the litigants – NetChoice, and Florida – framed the issue around the effect of the laws on the content moderation policies of the platforms, specifically whether the platforms were engaged in protected speech. The 11th U.S. Circuit Court of Appeals upheld a lower court preliminary injunction against the Florida law, holding the content moderation policies of the platforms were speech and the law was unconstitutional.

The 5th U.S. Circuit Court of Appeals came to the opposite conclusion and held that the platforms were not engaged in speech, but rather the platform's algorithms controlled platform behavior unprotected by the First Amendment. The 5th Circuit determined the behavior was censorship and reversed a lower court injunction against the Texas law.

The Supreme Court, however, reframed the inquiry. The court noted that the lower courts failed to consider the full range of activities the laws covered. Thus, while a First Amendment inquiry was in order, the decisions of the lower courts and the arguments by the parties were incomplete. The court added that neither the parties nor the lower courts engaged in a thorough analysis of whether and how the states' laws affected other elements of the platforms' products, such as Facebook's direct messaging applications, or even whether the laws have any impact on email providers or online marketplaces.

Advertisement

The Supreme Court directed the lower courts to engage in a much more exacting analysis of the laws and their implications and provided some guidelines.

First Amendment principles

The court held that content moderation policies reflect the constitutionally protected editorial choices of the platforms, at least regarding what the court describes as “heartland applications” of the laws – such as Facebook's News Feed and YouTube's homepage.

The Supreme Court required the lower courts to consider two core constitutional principles of the First Amendment. One is that the amendment protects speakers from being compelled to communicate messages they would prefer to exclude. Editorial discretion by entities, social media companies, that compile and curate the speech of others is a protected First Amendment activity.

The other principle holds that the amendment precludes the government from controlling private speech, even for the purpose of balancing the marketplace of ideas. Neither nor federal government may manipulate that marketplace for the purposes of presenting a more balanced array of viewpoints.

Advertisement

The court also affirmed that these principles apply to digital media in the same way they apply to traditional or legacy media.

In the 96-page opinion, Justice Elena Kagan wrote: “The First Amendment … does not go on leave when social media are involved.” For now, it appears the social media platforms will continue to control their content.The Conversation

Lynn Greenky, Professor Emeritus of Communication and Rhetorical Studies, Syracuse University

This article is republished from The Conversation under a Creative Commons license. Read the original article.

Read More

Advertisement

The post Supreme Court kicks cases about tech companies' First Amendment rights back to lower courts − but appears poised to block states from hampering online content moderation appeared first on .com

Continue Reading

News from the South

Trending