Daniel van Strien, a machine learning librarian at Hugging Face, has released a dataset containing one million public posts from Bluesky Social. The dataset, sourced from Bluesky’s Firehose API, includes text, metadata, media attachments, and reply relationships, offering a comprehensive snapshot of user activity on the decentralised platform.

Van Strien announced the dataset on Bluesky, describing it as a tool for machine learning research and experimentation with social media data. However, the dataset is not anonymised; it pairs posts with decentralised identifiers (DIDs) for each user, reports 404Media.

A search tool for locating users based on their DIDs has also been made available on Hugging Face, adding to the dataset’s utility for research.

The dataset provides a glimpse into the diversity of content shared on Bluesky. Posts range from casual conversations about concerts and politics to humorous one-liners. However, it also includes a significant amount of adult content, highlighting the unfiltered nature of the data collection.

According to the project page, the dataset has broad potential applications, including training and testing language models, studying social media interaction patterns, analysing content moderation, and exploring natural language processing tasks.

This is an image of huggingfacefeatured ss1

However, van Strien explicitly outlined prohibited uses, such as creating automated posting systems, generating fake content, or violating Bluesky’s terms of service.

The Firehose API is built on the open AT Protocol. It streams all public user data in real-time, making it accessible to developers. Tools like Firesky and Visualisers have leveraged this API to monitor platform activity and create new services.

However, this openness comes with challenges. While Bluesky has pledged not to use user content to train generative AI models, the public nature of its data means external entities can do so without explicit user consent.

This approach contrasts sharply with platforms like X and Meta, which have incorporated user data into their generative AI training efforts. For instance, X recently updated its service terms to allow user-generated content for AI model training explicitly.

Bluesky has acknowledged the concerns of artists and creators who fear their content might be repurposed for AI training. Emily Liu, a spokesperson for Bluesky, reiterated the platform’s commitment to finding ways for users to communicate consent to third-party developers.

In the News: Two Apple phishing scams target Black Friday shoppers