Я следую руководству на https://huggingface.co/docs/transformers/pipeline_tutorial, чтобы использовать конвейер трансформаторов для логического вывода. Например, следующий фрагмент кода работает для получения результатов NER из конвейера ner.
# KeyDataset is a util that will just output the item we're interested in.
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset
model = ...
tokenizer = ...
pipe = pipeline("ner", model=model, tokenizer=tokenizer)
dataset = load_dataset("my_ner_dataset", split = "test")
for extracted_entities in pipe(KeyDataset(dataset, "text")):
print(extracted_entities)
В NER, как и во многих приложениях, мы также хотели бы получать входные данные, чтобы я мог сохранить результат в виде пары (текст, извлеченные_сущности) для последующей обработки. В основном я ищу что-то вроде:
# KeyDataset is a util that will just output the item we're interested in.
from transformers.pipelines.pt_utils import KeyDataset
from datasets import load_dataset
model = ...
tokenizer = ...
pipe = pipeline("ner", model=model, tokenizer=tokenizer)
dataset = load_dataset("my_ner_dataset", split = "test")
for text, extracted_entities in pipe(KeyDataset(dataset, "text")):
print(text, extracted_entities)
Где text
— необработанный входной текст (возможно, пакетный), который подается в конвейер.
Это выполнимо?
Спасибо. Я пытаюсь получить точный исходный ввод. Не только извлеченный диапазон.
# Datasets 2.11.0
from datasets import load_dataset
# Transformers 4.27.4, Torch 2.0.0+cu118,
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification,
pipeline
)
from transformers.pipelines.pt_utils import KeyDataset
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
pipe = pipeline(task = "ner", model=model, tokenizer=tokenizer)
dataset = load_dataset("argilla/gutenberg_spacy-ner", split = "train")
results = pipe(KeyDataset(dataset, "text"))
for idx, extracted_entities in enumerate(results):
print("Original text:\n{}".format(dataset[idx]["text"]))
print("Extracted entities:")
for entity in extracted_entities:
print(entity)
Original text:
Would I wish to send up my name now ? Again I declined , to the polite astonishment of the concierge , who evidently considered me a queer sort of a friend . He was called to his desk by a guest , who wished to ask questions , of course , and I waited where I was . At a quarter to eleven Herbert Bayliss emerged from the elevator . His appearance almost shocked me . Out late the night before ! He looked as if he had been out all night for many nights .
Extracted entities:
{'entity': 'B-PER', 'score': 0.9996532, 'index': 68, 'word': 'Herbert', 'start': 289, 'end': 296}
{'entity': 'I-PER', 'score': 0.9996567, 'index': 69, 'word': 'Bay', 'start': 297, 'end': 300}
{'entity': 'I-PER', 'score': 0.9991698, 'index': 70, 'word': '##lis', 'start': 300, 'end': 303}
{'entity': 'I-PER', 'score': 0.96547437, 'index': 71, 'word': '##s', 'start': 303, 'end': 304}
...
Original text:
And you think our run will be better than five hundred and eighty ? '' `` It should be , unless there is a remarkable change . This ship makes over six hundred , day after day , in good weather . She should do at least six hundred by to-morrow noon , unless there is a sudden change , as I said . '' `` But six hundred would be -- it would be the high field , by Jove ! '' `` Anything over five hundred and ninety-four would be that . The numbers are very low to-night .
Extracted entities:
{'entity': 'B-MISC', 'score': 0.40225995, 'index': 90, 'word': 'Jo', 'start': 363, 'end': 365}
Доступ к каждому образцу в наборе данных, созданном вызовом load_dataset
, можно получить с помощью индекса и соответствующего ключа словаря.
Вызовы объекта pipeline
с KeyDataset
в качестве входных данных возвращают PipelineIterator
объект, который является итерируемым. Следовательно, можно enumerate
объект PipelineIterator получить как результат, так и индекс для конкретного результата, а затем использовать этот индекс для извлечения связанной выборки в наборе данных.
Абстракция конвейера Huggingface — это оболочка для всех доступных конвейеров. Когда кто-то создает объект pipeline
, он возвращает соответствующий конвейер на основе аргумента task
:
pipe = pipeline(task = "ner", model=model, tokenizer=tokenizer)
Учитывая, что задача NER указана, будет возвращен TokenClassificationPipeline (примечание: «ner» — это псевдоним для «token-classification»). Этот конвейер (и все остальные) наследует базовый класс Pipeline . Базовый класс Pipeline
определяет функцию __call__
, на которую TokenClassificationPipeline
класс опирается всякий раз, когда вызывается экземпляр pipeline
.
После создания конвейера (см. выше) он вызывается с данными, передаваемыми в виде одной строки, списка или, при работе с полными наборами данных, набора данных 🔁 Huggingface через transforms.pipelines.pt_utils KeyDataset сорт.
dataset = load_dataset("argilla/gutenberg_spacy-ner", split = "train")
results = pipe(KeyDataset(dataset, "text")) # pipeline call
Когда конвейер вызывается, он проверяет , являются ли переданные данные итерируемыми, а затем вызывает соответствующую функцию. Для объектов Huggingface Dataset
вызывается get_iterator
функция , которая возвращает объект PipelineIterator . Учитывая известное поведение объектов-итераторов , можно перечислить объект, чтобы вернуть кортеж, содержащий количество (от начала, которое по умолчанию равно 0) и значения, полученные в результате итерации по итерируемому объекту. Значения представляют собой извлечения NER для каждой выборки в наборе данных. Следовательно, следующие действия дают желаемые результаты:
for idx, extracted_entities in enumerate(results):
print("Original text:\n{}".format(dataset[idx]["text"]))
print("Extracted entities:")
for entity in extracted_entities:
print(entity)
Спасибо за решение и объяснение! Будут ли результаты в том же порядке, что и dataset
? То есть гарантируется, что i-й результат в results
соответствует i-му образцу в dataset
? Я немного удивлен, что нет запроса на функцию, учитывая, насколько распространен (я думаю) этот запрос в реальных приложениях...
@Цзин Пожалуйста. Да, они будут в том же порядке, потому что возвращаемый PipelineIterator
действует на тот же итеративный dataset
, который был первоначально передан в качестве входных данных с вызовом pipe(KeyDataset(dataset, "text"))
. Да, это удивительно - я полагаю, что есть другие способы обойти эту проблему. Например, добавление выходных данных из конвейера в список, а после завершения повторение этого выходного списка в дополнение к вашим исходным данным. Возможно, поэтому он не был реализован как явная функция, поскольку люди только что нашли обходные пути.
То, что метод Python возвращает параметр, который использовался для его вызова, немного необычно. Но вы можете просто добавить столбец в свой набор данных с помощью map и сохранить набор данных с помощью save_to_disk.
Пожалуйста, взгляните на пример ниже:
from datasets import load_dataset
dataset = load_dataset("wikicorpus", "raw_en", split = "train[:20]")
print(dataset[0])
print(dataset.features)
Выход:
{'id': '214730', 'title': 'Henry Hallam', 'text': 'Henry Hallam (July 9, 1777 - January 21, 1859) was an English historian.\n\n\n\nThe only son of John Hallam, canon of Windsor and dean of Bristol, he was educated at Eton and Christ Church, Oxford, graduating in 1799. Called to the bar, he practised for some years on the Oxford circuit; but his tastes were literary, and when, on his father\'s death in 1812, he inherited a small estate in Lincolnshire, he gave himself up wholly to academic study. He had become connected with the brilliant band of authors and politicians who led the Whig party, a connection to which he owed his appointment to the well-paid and easy post of commissioner of stamps; but took no part in politics himself. He was, however, an active supporter of many popular movements--particularly of that which ended in the abolition of the slave trade; and he was sincerely attached to the political principles of the Whigs. \n\n\n\nHallam\'s earliest literary work was undertaken in connexion with the great organ of the Whig party, the Edinburgh Review, where his review of Scott\'s Dryden attracted attention. His first great work, The View of the State of Europe during the Middle Ages, was produced in 1818, and was followed nine years later by the Constitutional History of England. In 1838-1839 appeared the Introduction to the Literature of Europe in the 15th, 16th and 17th Centuries. These are the three works on which Hallam\'s fame rests. They took a place in English literature which was not seriously challenged until the 20th century. A volume of supplemental notes to his Middle Ages was published in 1848; these facts and dates represent nearly all of Hallam\'s career. The strongest personal interest in his life was the affliction which befell him in the loss of his children, one after another. His eldest son, Arthur Henry Hallam--the "A.H.H." of Tennyson\'s In Memoriam, and by the testimony of his contemporaries a man of the most brilliant promise--died in 1833 at the age of twenty-two. Seventeen years later, his second son, Henry Fitzmaurice Hallam, was cut off like his brother at the very threshold of what might have been a great career. The premature death and high talents of these young men, and the association of one of them with the most popular poem of the age, have made Hallam\'s family afflictions better known than any other incidents of his life. He survived wife, daughter and sons by many years.\n\n\n\nIn 1834 Hallam published The Remains in Prose and Verse of Arthur Henry Hallam, with a Sketch of his Life. In 1852 a selection of Literary Essays and Characters from the Literature of Europe was published. Hallam was a fellow of the Royal Society, and a trustee of the British Museum, and enjoyed many other appropriate distinctions. In 1830 he received the gold medal for history, founded by George IV. The Middle Ages is described by Hallam himself as a series of historical dissertations, a comprehensive survey of the chief circumstances that can interest a philosophical inquirer during the period from the 5th to the 15th century. The work consists of nine long chapters, each of which is a complete treatise in itself. The history of France, of Italy, of Spain, of Germany, and of the Greek and Saracenic empires, sketched in rapid and general terms, is the subject of five separate chapters. Others deal with the great institutional features of medieval society--the development of the feudal system, of the ecclesiastical system, and of the free political system of England. The last chapter sketches the general state of society, the growth of commerce, manners, and literature in the Middle Ages. The book may be regarded as a general view of early modern history, preparatory to the more detailed treatment of special lines of inquiry carried out in his subsequent works, although Hallam\'s original intention was to continue the work on the scale on which it had been begun. \n\n\n\nThe Constitutional History of England takes up the subject at the point at which it had been dropped in the View of the Middle Ages, viz, the accession of Henry VII, and carries it down to the accession of George III. Hallam stopped here for a characteristic reason, which it is impossible not to respect and to regret. He was unwilling to excite the prejudices of modern politics which seemed to him to run back through the whole period of the reign of George III; nevertheless, he was accused of bias. The Quarterly Review for 1828 contains an article on the Constitutional History, written by Southey, full of reproach. The work, he says. is the "production of a decided partisan," who "rakes in the ashes of long-forgotten and a thousand times buried slanders, for the means of heaping obloquy on all who supported the established institutions of the country." Hallam\'s view of constitutional history was that it should contain only so much of the political and general history of the time as bears directly on specific changes in the organization of the state, including judicial as well as ecclesiastical institutions. It was his cool treatment of such sanctified names as Charles I, Cranmer and Laud that provoked the indignation of Southey, who forgot that the same impartial measure was extended to statesmen on the other side.\n\n\n\nIf Hallam ever deviated from perfect fairness, it was in the tacit assumption that the 19th century theory of the constitution was the right theory in previous centuries, and that those who departed from it on one side or the other were in the wrong. He did unconsciously antedate the constitution, and it is clear from incidental allusions in his last work that he did not favour the democratic changes he thought to be impending. Hallam, like Macaulay, ultimately referred all political questions to the standard of Whig constitutionalism. But he was scrupulously conscientious in collecting and weighing his materials. In this he was helped by his legal training, and it was this which made the Constitutional History one of the standard text-books of English politics.\n\n\n\nLike the Constitutional History, the Introduction to the Literature of Europe continues a branch of inquiry which had been opened in the View of the Middle Ages. In the first chapter of the Literature, which is to a great extent supplementary to the last chapter of the Middle Ages, Hallam sketches the state of literature in Europe down to the end of the 14th century: the extinction of ancient learning which followed the fall of the Roman empire and the rise of Christianity; the preservation of the Latin language in the services of the church; and the slow revival of letters, which began to show itself soon after the 7th century--"the nadir of the human mind"--had been passed. For the first century and a half of his special period he is mainly occupied with a review of classical learning, and he adopts the plan of taking short decennial periods and noticing the most remarkable works which they produced. The rapid growth of literature in the 16th century compels him to resort to a classification of subjects: in the period 1520-1550 we have separate chapters on ancient literature, theology, speculative philosophy and jurisprudence, the literature of taste, and scientific and miscellaneous literature; and the subdivisions of subjects is carried further of course in the later periods. Thus poetry, the drama and polite literature form the subjects of separate chapters. One inconvenient result of this arrangement is that the same author is scattered over many chapters, according as his works fall within this category or that period of time. Names like Shakespeare, Grotius, Francis Bacon and Thomas Hobbes appear in half a dozen different places. The individuality of great authors is thus dissipated except when it has been preserved by an occasional sacrifice of the arrangement--and this defect, if it is to be esteemed a defect, is increased by the very sparing references to personal history and character with which Hallam was obliged to content himself.\n\n\n\nHis plan excluded biographical history, nor is the work, he tells us, to be regarded as one of reference. It is rigidly an account of the books which would make a complete library of the period, arranged according to the date of their publication and the nature of their subjects. The history of institutions like universities and academies, and that of great popular movements like the Reformation, are of course noticed in their immediate connection with literary results; but Hallam had little taste for the spacious generalization which such subjects suggest. The great qualities displayed in this work have been universally acknowledged--conscientiousness, accuracy, judgment and enormous reading. Not the least styiking testimony to Hallam\'s powers is his mastery over so many diverse forms of intellectual activity. In science and theology, mathematics and poetry, metaphysics and law, he is a competent and always a fair if not a profound critic. The bent of his own mind is manifest in his treatment of pure literature and of political speculation--which seems to be inspired with stronger personal interest and a higher sense of power than other parts of his work display. Not less worthy of notice in a literary history is the good sense by which both his learning and his tastes have been held in control. Probably no writer ever possessed a juster view of the relative importance of men and things. The labour devoted to an investigation is with Hallam no excuse for dwelling on the result, unless that is in itself important. He turns away contemptuously from the mere curiosities of literature, and is never tempted to make a display of trivial erudition. Nor do we find that his interest in special studies leads him to assign them a disproportionate place in his general view of the literature of a period.\n\n\n\nHallam is generally described as a "philosophical historian." The description is justified not so much by any philosophical quality in his method as by the nature of his subject and his own temper. Hallam is a philosopher to this extent that both in political and in literary history he fixed his attention on results rather than on persons. His conception of history embraced the whole movement of society. Beside that conception the issue of battles and the fate of kings fall into comparative insignificance. "We can trace the pedigree of princes," he reflects, "fill up the catalogue of towns besieged and provinces desolated, describe even the whole pageantry of coronations and festivals, but we cannot recover the genuine history of mankind." But, on the other hand, there is no trace in Hallam of anything like a philosophy of history or society.\n\n\n\nWise and generally melancholy reflections on human nature and political society are not infrequent in his writings, and they arise naturally and incidentally out of the subject he is discussing. His object is the attainment of truth in matters of fact. Sweeping theories of the movement of society, and broad characterizations of particular periods of history seem to have no attraction for him. The view of mankind on which such generalizations are usually based, taking little account of individual character, was distasteful to him. Thus he objects to the use of statistics because they favour the tendency to regard all men as mentally and morally equal. At the same time Hallam by no means assumes the tone of the mere scholar. He is solicitous to show that his point of view is that of the cultivated gentleman and not of the specialist. Thus he tells us that Montaigne is the first French author whom an English gentleman is ashamed not to have read. In fact, allusions to the necessary studies of a gentleman meet us constantly, reminding us of the unlikely erudition of the schoolboy in Macaulay. Hallam\'s prejudices, so far as he had any, belong to the same character. His criticism assumes a tone of moral censure when he has to deal with certain extremes of human thought--scepticism in philosophy, atheism in religion and democracy in politics.\n\n\n\nMacaulay\'s essay in review of the Constitutional History is available at: http://www.history1700s.com/etexts/html/texts/1cahe10.txt\n\n\n\nReferences.\n\n;'}
{'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None)}
from transformers import pipeline
pos_pipe = pipeline(model = "vblagoje/bert-english-uncased-finetuned-pos", aggregation_strategy = "simple")
# map without any parameters expects a dict as return value (every key is will be a column name)
dataset = dataset.map(lambda examples: {"pos": pos_pipe(examples["text"])})
# [:5] prints only the first 5
print(dataset[0]["pos"][:5])
Выход:
[{'end': 12, 'entity_group': 'PROPN', 'score': 0.9982394576072693, 'start': 0, 'word': 'henry hallam'}, {'end': 14, 'entity_group': 'PUNCT', 'score': 0.9956117868423462, 'start': 13, 'word': '('}, {'end': 18, 'entity_group': 'PROPN', 'score': 0.9955991506576538, 'start': 14, 'word': 'july'}, {'end': 20, 'entity_group': 'NUM', 'score': 0.9968919157981873, 'start': 19, 'word': '9'}, {'end': 21, 'entity_group': 'PUNCT', 'score': 0.9996525049209595, 'start': 20, 'word': ','}]
from datasets import load_from_disk
dataset.save_to_disk("blabla_my_dataset")
ds = load_from_disk("blabla_my_dataset")
print(ds.features)
Выход:
{'id': Value(dtype='string', id=None), 'title': Value(dtype='string', id=None), 'text': Value(dtype='string', id=None), 'pos': [{'end': Value(dtype='int64', id=None), 'entity_group': Value(dtype='string', id=None), 'score': Value(dtype='float32', id=None), 'start': Value(dtype='int64', id=None), 'word': Value(dtype='string', id=None)}]}
Конвейер
ner
уже должен возвращать классифицированный токен помимо его предсказанного объекта (см. github.com/huggingface/transformers/blob/… для справки); может это поможет? (возможно я не понял вашу мысль)