Chinese news dataset. 4 input news documents and 13,471 characters per event.

Chinese news dataset Jun 7, 2025 · For the Chinese language, we obtain an F1-micro (the performance metric for SemEval task 3, subtask 2) score of 0. The data have an exclusive focus on China and were collected from surveys, Chinese administrations, major data nlp news wiki text-classification word2vec corpus dataset question-answering chinese chinese-nlp language-model bert chinese-corpus pretrain chinese-dataset Updated on Sep 7 May 13, 2024 · To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Different from previous summarization datasets crawled from news websites, we called for news articles from hundreds of thousands of press Oct 20, 2021 · In this paper, we present a large-scale Chinese news summarization dataset CNewSum, which consists of 304,307 documents and human-written summaries for the news feed. The South China Sea Data Initiative is [1] creating new, systematic dataset documenting conflict in the South China sea over the past decade, using a mix of qualitative and quantitative methods, and [2] collecting new public opinion data via surveys from seven countries around the South China Sea. The LTCR dataset provides a valuable resource for accurately detecting misinformation Recently, the field of natural language processing (NLP) has grown rapidly, driven by massive datasets. Schultz2,3, Heng Lyu1, Zhonghua Zheng2,3, and Chi Zhang1 The South China Sea Data Initiative is [1] creating new, systematic dataset documenting conflict in the South China sea over the past decade, using a mix of qualitative and quantitative methods, and [2] collecting new public opinion data via surveys from seven countries around the South China Sea. Repository for Chinese News Framing Dataset to be integrated with the SemEval dataset, or used as a standalone dataset, potentially useful for classification methods utilising disagreement at the annotator or annotation level. ci WordNews English-Chinese Cross-Lingual Word Sense Disambiguation dataset This dataset allows evaluation of WSD systems on a dataset consisting of sentences from news articles written recently in 2015. They are labeled with multi-level topic categories, and some of them also have summaries. Resources Datasets Our researchers are constructing new datasets that can be used for social scientific research on China. We introduce the first fact-checked Chinese COVID-19 social media dataset, which enables more research on tracing the spread of microblogs misinformation and on analyzing content patterns in COVID-19 fake news. The dataset is introduced in (NEWSFARM: A Large-scale Chinese Corpus of Long News Summarization) Jun 27, 2024 · To probe this issue, we constructed a fine-grained Chinese-English parallel corpus of financial news called FFN. It covers multiple news categories, providing comprehensive support for text classification tasks across various domains, making it suitable for multi-scenario text mining research. The dataset contains a rich set of multimedia information for each microblog including ground-truth label, textual, visual, temporal, and network information. Jul 24, 2024 · To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. SportsSum2. Feb 13, 2025 · This news-derived dataset enables the analysis of urban floods in China from both natural and societal perspectives. These cover topics such as health, environment, GIS, law, media, religion, among others, and are listed in the Link Library below. Mar 31, 2025 · In this paper, we introduce a multimodal dataset that combines neuroimaging, behavioral data, and standardized Chinese social-lifestyle fake and true news materials. 1 Categories and Distribution We introduce the first fact-checked Chinese COVID-19 social media dataset, which enables more research on tracing the spread of microblogs misinformation and on analyzing content patterns in COVID-19 fake news. Similar to the FEVER dataset, claims in the “Sup-ports” and “Refutes We introduce the first fact-checked Chinese COVID-19 social media dataset, which enables more research on tracing the spread of microblogs misinformation and on analyzing content patterns in COVID-19 fake news. The subset consists of 200,000 Chinese news titles, with text lengths ranging from 20 to 30 characters per title. NEWSFARM is a large-scale Chinese long news summarization corpus, containing more than 220K Chinese long news and summaries written by professional editors or authors. Chinese news dataset of 20 different categoriesSomething went wrong and this page crashed! If the issue persists, it's likely a problem on our side. In this paper, we propose a new dataset, N24News, which is generated from New York Times with 24 categories and contains both text and image information in each news. The prevalence of fake news across various online sources has had a significant influence on the public. Compared with other Chinese epidemic rumor datasets, the LTCR dataset focuses on long-text data, contains longer and more usable fake news texts, filling the gap in Chinese long-text rumor detection datasets related to COVID-19. Mar 6, 2025 · For the Chinese language, we obtain an F1-micro (the performance metric for SemEval task 3, subtask 2) score of 0. Contribute to twinkle121/CNC development by creating an account on GitHub. Oct 17, 2022 · In this paper, we present a large Chinese news article dataset with 4. At present, given specified May 16, 2022 · Here are our top picks for Mandarin Chinese Language datasets: 1. io and is dedicated to providing free datasets of publicly available news articles. CLTS is a new Chinese long text summarization dataset, extracted from the Chinese news website ThePaper. The model reaches 20 BLEU on testing dataset, after training for only 2 epochs (18 hours on 6 NVIDIA Tesla K40M), while the SOTA result is about 24 BLEU. Access the dataset 2. 背景描述繁体中文和简体中文新闻文章集。它包括一些不是中国官方媒体的互联网新闻媒体（它们应有单独的数据集），不能保证完全覆盖。因此，此数据集不适合分析事件覆盖率。它旨在用作NLP算法的语料库。数据说明 title:文章标题og:title或twitter:titlemeta标签 desc:来自twitter:description或og To address this limitation, we con-structed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messag-ing apps, and traditional online news outlets. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. At present, given specified A collection of OCR-related datasets. 0: Generating High-Quality Sports To address this limitation, we construct the first multi-source benchmark dataset for Chinese fake news detec- tion, termed MCFEND, which contains news collected from diverse sources, such as social platforms, messaging apps, and traditional online news outlets, and fact-checked through 14 authoritative fact-checking agencies. Chinese news text classification is an important direction in natural language processing (NLP). Jul 13, 2023 · Compared with other Chinese epidemic rumor datasets, the LTCR dataset focuses on long-text data, contains longer and more usable fake news texts, filling the gap in Chinese long-text rumor detection datasets related to COVID-19. This project contains pre-processing scripts and Transformer baseline training scripts using pytorch/fairseq for WMT 2017 Machine Translation of News Chinese->English track. It has long documents with high-abstractive summaries, which can encourage document-level understanding and generation for current summarization models. 0 is the cleaned version of SportsSum. 5 days ago · %0 Conference Proceedings %T DEIE: Benchmarking Document-level Event Information Extraction with a Large-scale Chinese News Dataset %A Ren, Yubing %A Cao, Yanan %A Li, Hao %A Li, Yingjie %A Ma, Zixuan ZM %A Fang, Fang %A Guo, Ping %A Ma, Wei %Y Calzolari, Nicoletta %Y Kan, Min-Yen %Y Hoste, Veronique %Y Lenci, Alessandro %Y Sakti, Sakriani %Y Xue, Nianwen %S Proceedings of the 2024 Joint Mar 14, 2024 · To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. 2 trillion of aid and credit spread across In response to the need for such large-scale and high-quality datasets, we introduce Deie - a Unified Large-scale Document-level Event Information Ex-traction dataset. We contribute the dataset with a rich set of features on microblogs related to COVID-19. Multimodal news records with audience impact indicatorsSomething went wrong and this page crashed! If the issue persists, it's likely a problem on our side. AISHELL-1 Dataset AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin. Mar 6, 2025 · This study introduces the first Chinese News Framing dataset, to be used as either a stand-alone dataset or a supplementary resource to the SemEval-2023 task 3 dataset. [1] (Cui et al. Jan 3, 2022 · Parallel text datasets are a valuable for educational purposes, machine translation, and cross-language information retrieval, but few are domain-oriented. A collection of OCR-related datasets. It can efficiently automate the training, evaluation, and classification of user-defined text classification corpora. It has long documents with high-abstractive summaries, which encourages document-level understanding and generation for current summarization models. May 5, 2023 · Existing work generally classifies news headlines as a matter of short text classification. Recently, the field of natural language processing (NLP) has grown rapidly, driven by massive datasets. In this paper, we propose a new method to identify keywords in For the Chinese language, we obtain an F1-micro (the performance metric for SemEval task 3, subtask 2) score of 0. Mar 26, 2025 · One academic who reviewed the dataset said it was "clear evidence" that China, or its affiliates, wants to use AI to improve repression. The corpus Jan 25, 2025 · The Sogou corpus is a widely used Chinese text classification dataset, sourced from Sogou News, containing 17,910 text samples. com) with an example English news corpus and a labeled dataset of Named Entity Recognition (NER). Ifeng and Chinanews consist of first paragraphs of news articles of different topic classes. 4 input news documents and 13,471 characters per event. Jan 23, 2024 · To address this limitation, we construct the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which contains news collected from diverse sources, such as social platforms, messaging apps, and traditional online news outlets, and fact-checked through 14 authoritative fact-checking agencies. In response to the need for such large-scale and high-quality datasets, we introduce Deie - a Unified Large-scale Document-level Event Information Ex-traction dataset. Welcome to the Webz. The dataset is openly accessible for academic and research purposes through Our Website. The 2. A simple crawler to collect news text from China Daily (www. An English-Chinese COVID-19 fake&real news dataset from the ICDMW 2021 paper below: Cross-lingual COVID-19 Fake News Detection. 0 version offers more datasets, and improved data description, including data types and sources. Oct 6, 2021 · In this paper, we present a large-scale Chinese news summarization dataset, CNewSum, to make up for the lack of Chinese document-level summarization, which can become an important supplement to current Chinese understanding and generation tasks. Temporally, flood events occur predominantly in the summer, accounting for 74 % Feb 9, 2024 · To solve these issues, we introduce the Financial News and Stock Price Integration Dataset (FNSPID). Creating a national urban flood dataset for China from news texts (2000–2022) at the county level Shengnan Fu1, David M. Yu. Figure 1 illustrates an example of the financial senti-ment analysis task for enterprise early warning. Sina Weibo is Chinese largest public social media platform. The experimental results on Thucnews dataset show that the accuracy of the model for Chinese news text classification is 97. In this paper, we present a large-scale Chinese news summarization dataset CNewSum, which consists of 304,307 documents and human-written summaries for the news feed. 753 when we augment the SemEval dataset with Chinese news framing samples. CFEVER comprises 30,012 manually created claims based on content in Chinese Wikipedia. We have created a Chinese–English parallel dataset in the domain of finance technology, using the Financial Times website, from which we grabbed 60,473 news items from between 2007 and 2021. Methods trained on purely one single news source can hardly be The training data come from an annotated dataset for news classification, the Categorized News Dataset from Fudan University, downloaded from Kaggle (Fudan University’s Natural Language Processing Group, 2018). How to use high-quality text classification technology to help humans to efficiently organize and manage the massive amount of web news is an urgent problem to be solved. chinadaily. It is noted In this repository, we provide a curated collection of datasets specifically designed for chatbot training, including links, size, language, usage, and a brief description of each dataset. Oct 21, 2021 · In this paper, we present a large-scale Chinese news summarization dataset CNewSum, which consists of 304,307 documents and human-written summaries for the news feed. These articles are obtained from different news channels and sources. May 20, 2025 · False information can spread quickly on social media, negatively influencing the citizens’ behaviors and responses to social events. The report, based on data collected by William & Mary students, reveals that the scale and scope of Beijing’s portfolio is vastly larger than previously understood: $2. 1755826 newspaper data are stored in news_six_all. Unless indicated, the datasets are in simplified Chinese. It contains 17,000 documents and 29,223 events, which are all manually annotated based on a pre-defined schema for the military domain including 8 event types and 11 argument role types. May 16, 2022 · Here are our top picks for Mandarin Chinese Language datasets: 1. Metatext is a platform that allows you to build, train and deploy NLP models in minutes. The resulting version of the dataset contains more than 180,000 long-sequence pairs, where each article consists of multiple paragraphs and each summary consists of multiple sentences. They are collected by Glyph project and more details are discussed in the corresponding To fill this gap and alleviate relevant problems, we proposed a large-scale document-level open-source Chinese Military News Event Extraction dataset (CMNEE), which involved corpus from au-thoritative websites such as Huanqiu1, China Mili-tary Online2, Sina Military3 and Baidu Encyclope-dia4. Chinese News Causality Dataset. Jan 25, 2025 · THUCTC (THU Chinese Text Classification) is a Chinese text classification toolkit developed by the Natural Language Processing Laboratory of Tsinghua University. It comprises 29. In short, SportsSum2. 1. Apr 11, 2025 · It is the first large-scale Chinese multi-document summarization dataset, containing 5,100 events and a total of 57,984 news documents, with an average of 11. However, due to the strong domain nature and limited text length of news headlines, their classification results are usually determined by several specific keywords, which makes the traditional short text classification method ineffective. 0 The Datasets page, created in collaboration with the Library, aims to serve as a starting point for students and scholars to search for data on China. Therefore, it is of great significance to build a real-time and full-scale Weibo public opinion dataset. 4 million articles. Each document in this corpus contains one or more event templates. This limitation hinders We introduce the first fact-checked Chinese COVID-19 social media dataset, which enables more research on tracing the spread of microblogs misinformation and on analyzing content patterns in COVID-19 fake news. For more details pls refer to the following papers: SportsSum2. Collect and analyze elite and public opinion survey data from the littoral countries Refocus scholarly and policy analysis on the western Pacific from “US-China struggle” to ASEAN-China competition and cooperation. Jiangshu Du, Yingtong Dou, Congying Xia, Limeng Cui, Jing Ma, Philip S. 6 days ago · William & Mary’s AidData research lab today released a new flagship report and massive dataset that comprehensively tracks China’s lending and grant-giving activities worldwide. Jun 13, 2023 · Compared with other Chinese rumor datasets, the LTCR dataset focuses on long-text data, contains longer and more usable fake news texts, filling the gap in Chinese long-text rumor detection datasets related to COVID-19. The latest and most popular social events will be disclosed and discussed on Weibo as soon as possible. The total numbers of domains and news articles are 39 and around 115000, respectively. , 2016) Consensus Attention-based Neural Oct 13, 2021 · In this paper, we present a large-scale Chinese news summarization dataset CNewSum, which consists of 304,307 documents and human-written summaries for the news feed. , news articles) from corresponding live commentaries. Temporally, flood events predominantly occur in the summer, accounting for 74% of total flooding events. Each zipped file is a collection of news documents from a specific domain. Apr 20, 2025 · 1. Each claim in CFEVER is labeled as “Supports”, “Refutes”, or “Not Enough Info” to depict its degree of fac-tualness. News content data, about 35G in total; each piece of news comment content contains ID, time, news title and news body; this dataset can be used for tasks such as LLM training, chatgpt MCFEND This repository houses resources and materials associated with the research paper titled "MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection," presented at the ACM Web Conference 2024 (WWW' 2024). The FNSPID repository offers the FNSPID dataset, experimental results, and a news content scraper tool. MCFEND, the initial multi-source Chinese fake news detection dataset, comprises multi-modal content and social context of 23,974 real-world Chinese news pieces collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. . cn. AISHELL-3 Dataset AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that is used to train multi-speaker Text-to-Speech (TTS) systems. Mar 18, 2022 · Parallel text datasets are a valuable for educational purposes, machine translation, and cross-language information retrieval, but few are domain-oriented. In this paper, we present a large-scale Chinese news summarization dataset, CNew-Sum, to make up for the lack of Chinese document-level summarization, which can become an important supplement to current Chinese understanding and generation tasks. 719 using only samples from our Chinese News Framing dataset and a score of 0. We also collect and index important public datasets that are commonly used by our researchers. 北京时间4月21日足坛各项赛事火热进行，德甲第30轮拜仁大胜，阿森纳重回英超榜首，曼城绝杀切尔西闯进足总杯决赛上演疯狂一夜，下面就看看各场比赛的精彩综述。欧冠闯进四强的拜仁重燃斗志，联赛面对柏林联合毫不手软，格雷茨卡禁区抽射球进先拔头筹，诺伊尔连续化解险情，上半场补时凯恩任意球直接破门扩大优势，易边再战球队攻势不减，穆勒头顶脚踢梅开二度，特尔进球建功，尾声阶段费尔特森扳回一球，最终拜仁5-1大胜柏林联合。虽然已经提前无缘德甲冠军，但拜仁欧战仍有希望，他们需要在联赛打出强势表现保持状态，等到欧冠半决赛和皇马才能展现出势均力敌的对决增加悬念。枪手场面占据优势，但能看到球员因为多线作战表现不尽人意，尤其临门一脚效果欠佳，好在临近中场热苏斯扛住防守球员回球助攻特罗萨德破门。 It is a real-world dataset for cross-domain emotion distribution learning which was crawled from ChinaNews website. If I missed something, feel free to inform me. Training data 5 Chinese text classification datasets are used. However, the lack of large-scale and high-quality Chinese datasets remain a critical The experimental results on Thucnews dataset show that the accuracy of the model for Chinese news text classification is 97. Please check out our paper here. Oct 14, 2021 · Turenne N et al (2021) Mining an English-Chinese parallel Corpus of nancial News May 13, 2024 · Download Citation | On May 13, 2024, Yupeng Li and others published MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection | Find, read and cite all the research you need on Abstract We present CFEVER, a Chinese dataset designed for Fact Extraction and VERification. - KuroginQin/China_Daily_Cralwer CMNEE is a large-scale, document-level open-source Chinese Military News Event Extraction dataset. e. It is the first large-scale Chinese multi-document summarization dataset, containing 5,100 events and a total of 57,984 news documents, with an average of 11. yulia-wang / chinese_news_data Public Notifications You must be signed in to change notification settings Fork 1 Star 3 The first Chinese COVID-19 fake news dataset based on the Weibo platform. The China Data Lab Dataverse is a unique platform hosted by the China Data Lab at the University of California San Diego's 21st Century China Center. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. Oct 17, 2022 · With the filter reducing annotation overhead, we construct CStory, a large-scale Chinese news storyline dataset, which contains 11,978 news articles, 112,549 manually labeled storyline relation pairs, and 49,832 evidence sentences for annotation judgment. Jan 25, 2025 · The Sogou corpus is a widely used Chinese text classification dataset, sourced from Sogou News, containing 17,910 text samples. However, the lack of large-scale and high-quality Chinese datasets remain a critical Jun 22, 2021 · We introduce the first fact-checked Chinese COVID-19 social media dataset, which enables more research on tracing the spread of microblogs misinformation and on analyzing content patterns in COVID-19 fake news. io and build on Travis. It is noted In this paper, we present a large-scale Chinese news summarization dataset CNewSum, which consists of 304,307 documents and human-written summaries for the news feed. io News Dataset Repository! This repository is created by Webz. This news-derived dataset enables the analysis of urban floods in China from both natural and societal perspectives. This resource meticulously anno-tates 20,000 documents, spanning 64 event types derived from publicly available Chinese news re-ports. Nov 14, 2025 · Abstract Current news datasets merely focus on text features on the news and rarely leverage the feature of images, excluding numerous essential features for news classification. 7 million stock prices and 15. This dataset uniquely combines time series news and stock prices, providing a groundbreaking resource for financial market analysis. The corpus Chinese Financial Event Extraction Dataset (CFEED) is a financial-domain Chinese corpus regarding the major events in the announcements of listed companies. 87%, and the recall rate and F1 score are better than the comparison model. Sports Game Summarization is a challenging task, which aims to generate sports summaries (i. Mar 8, 2025 · 谢菲尔德大学本次发布的数据集 Chinese News Framing dataset, 中文新闻报道框架数据集（Chinese News Framing dataset）是由谢菲尔德大学计算机科学学院创建的，该数据集是首个专注于中文新闻框架检测的自动检测数据集。它包含了从13个不同国家的网站收集的约30万篇中文新闻文章，经过精心挑选和标注 Convert CSV data from this Kaggle dataset from into SQL Create a contanirezed database to store data Write a small flask app that Expose a list of news ordered byt the most recent news Expose a single new by its unique indentifier Deploy the application in Heroku Optional write small e2e test with Cypress. Generate Sep 1, 2023 · Dataset of newspaperThe six newspapers are chosen as the main sources for constructing the CCPU index: People’s Daily, Guangming Daily, Economic Daily, Global Times, Science and Technology Daily, and China News Service. This resource provides a diverse collection of datasets focused on understanding various aspects of Chinese society In this paper, we present a large-scale Chinese news summarization dataset, CNew-Sum, to make up for the lack of Chinese document-level summarization, which can be-come an important supplement to current Chinese understanding and generation tasks. The Sogou corpus is characterized by a balanced distribution of text samples across Data Repositories Anacode Chinese Web Datastore: A collection of crawled Chinese news and blogs in JSON format Appen Open Source Datasets: Over 270 audio, image, video and text datasets in over 80 languages AssetMacro: Historical data of macroeconomic indicators and market data Awesome Public Datasets: A topic-centric list of HQ open datasets The FinChina SA dataset and code for FinLLM@IJCAI'23 paper "Chinese Fine-Grained Financial Sentiment Analysis with Large Language Models" - YerayL/FinChina-SA Apr 18, 2024 · To alleviate this problem, we propose CMNEE, a large-scale, document-level open-source Chinese Military News Event Extraction dataset. 7 million time-aligned financial news records for 4,775 S&P500 companies, covering the period from 1999 to 2023, sourced from 4 stock market news websites. We acquired financial news articles spanning between January 1st, 2014, to December 31, 2023, from mainstream media websites such as CNN, FOX, and China Daily. To better detect all of the fake news, especially long texts which are harder to find completely, a Long-Text Chinese Rumor detection dataset named LTCR is proposed. Extensive experiments have been conducted to analyze CHECKED data and to provide benchmark results for well-established methods when predicting fake news using CHECKED. The newspaper data are collected from the Wisenews database between January 2000 and December 2022. Chinese Datasets Archive 2. Apr 26, 2023 · (1) Background: Chinese news text is a popular form of media communication, which can be seen everywhere in China. Contribute to xinke-wang/OCRDatasets development by creating an account on GitHub. It provides comprehensive financial data combining stock prices and news records for S&P500 companies, demonstrates the dataset's impact on prediction accuracy, and includes a tool for updating the dataset with new financial news. csv. The SCSDI aims: Create a new, systematic dataset documenting conflict in the South China sea over the past decade, using a mix of qualitative and quantitative methods. Dataset Overview The THUCNews dataset used in this system is a subset extracted from the original THUCNews corpus, which contains news articles collected from various Chinese news outlets. Schultz2,3, Heng Lyu1, Zhonghua Zheng2,3, and Chi Zhang1 Feb 9, 2024 · To address this challenge, we introduce a large-scale financial dataset, namely, Financial News and Stock Price Integration Dataset (FNSPID). Aiming to address the lack of a comprehensive Chinese fi-nancial sentiment analysis dataset and meet the demands of enterprises regarding negative news alerts, we propose the FinChina SA dataset specifically designed for the financial domain. The Sogou corpus is characterized by a balanced distribution of text samples across Data Repositories Anacode Chinese Web Datastore: A collection of crawled Chinese news and blogs in JSON format Appen Open Source Datasets: Over 270 audio, image, video and text datasets in over 80 languages AssetMacro: Historical data of macroeconomic indicators and market data Awesome Public Datasets: A topic-centric list of HQ open datasets The FinChina SA dataset and code for FinLLM@IJCAI'23 paper "Chinese Fine-Grained Financial Sentiment Analysis with Large Language Models" - YerayL/FinChina-SA Jun 27, 2024 · To probe this issue, we constructed a fine-grained Chinese-English parallel corpus of financial news called FFN. We release new datasets weekly, each containing around 1,000 news articles focused on various themes, topics, or metadata May 13, 2024 · To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. 0 is a Chinese sports game summarization dataset which is based on SportsSum. This is the first Chinese news dataset that has both hierarchical topic labels and article full Aug 14, 2020 · CNewSum: A Large-scale Chinese News Summarization Dataset with Human-annotated Adequacy and Deducibility Level Danqing Wang, Jiaze Chen, Xianze Wu, Hao Zhou, Lei Li† Dec 16, 2024 · It is the first large scale Chinese multi-document summarization dataset, containing 5,100 events and a total of 57,984 news documents, with an average of 11. The acquired military news text itself is reli-able. Here I list several Chinese reading comprehension datasets that are PUBLICLY available (with appropriate technical report or paper). A dataset of millions of news articles scraped from a curated list of data sources. At the same time, the need for automatic summarization systems has been rapidly increasing as the amount of textual information on the web and in large data centers became intractable for human readers. JD full, JD binary, and Dianping datasets consist of user reviews of different sentiment polarities. Two fake news datasets covering seven different news domains. Sep 1, 2023 · Dataset of newspaperThe six newspapers are chosen as the main sources for constructing the CCPU index: People’s Daily, Guangming Daily, Economic Daily, Global Times, Science and Technology Daily, and China News Service. - several27/FakeNewsCorpus Sep 17, 2024 · Compared to the prosperity of review domain with high-quality data for robust model evaluation, datasets from news domain are relatively scarce, and each dedicates to singular news subdomains for the Targeted Sentiment Analysis (TSA) task. Nov 18, 2019 · Yet Another Chinese News Dataset With Article Titles, Descriptions, Cover Images, and Links. nbbzql afu lpxj aqefso vihcgbck hpladjq rjgl jtktg qiih vtu fkknkgp nxio ygxhep oofsbx jfxpxqgc